-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Base implementation of _tokenize module
#6240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (2)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughReorders and augments Ruff-related dependencies in top-level Changes
Sequence Diagram(s)sequenceDiagram
participant PyUser as Python user
participant TokenIter as TokenizerIter (Py)
participant State as PyTokenizerIterState
participant Parser as Ruff parser
PyUser->>TokenIter: next() / __next__()
TokenIter->>TokenIter: check internal state & tokens
alt tokens exhausted
TokenIter->>PyUser: call readline(vm) -> line
alt readline returns line
TokenIter->>State: push_line(line)
State->>Parser: parse_unchecked_source(line...)
Parser-->>State: tokens
State->>TokenIter: next_token()
TokenIter->>TokenIter: token_kind_value(kind)
alt extra_tokens and operator
TokenIter->>TokenIter: remap -> TOKEN_OP (55)
end
TokenIter-->>PyUser: (type, string, start, end, current_line)
else readline -> EOF
TokenIter->>State: mark EOF
TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)
end
else tokens available
TokenIter-->>PyUser: next token tuple
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (15)
Cargo.lockis excluded by!**/*.lockLib/test/test_tokenize.pyis excluded by!Lib/**Lib/test/tokenizedata/__init__.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding2.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_3131.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_pep3120.pyis excluded by!Lib/**Lib/test/tokenizedata/coding20731.pyis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests.txtis excluded by!Lib/**Lib/token.pyis excluded by!Lib/**Lib/tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (4)
Cargo.toml(1 hunks)stdlib/Cargo.toml(1 hunks)stdlib/src/lib.rs(2 hunks)stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
Applied to files:
stdlib/Cargo.tomlstdlib/src/tokenize.rsCargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code
Applied to files:
Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
tokenize(466-492)stdlib/src/socket.rs (1)
make_module(7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run tests under miri
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
| if nline.is_empty() { | ||
| state.eof = true; | ||
| *zelf.state.write() = state.clone(); | ||
|
|
||
| let line_num = &state.start().0; | ||
| let out = vm | ||
| .ctx | ||
| .new_tuple(vec![ | ||
| token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), | ||
| vm.ctx.new_str("").into(), | ||
| vm.ctx | ||
| .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) | ||
| .into(), | ||
| vm.ctx | ||
| .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) | ||
| .into(), | ||
| vm.ctx.new_str(state.current_line()).into(), | ||
| ]) | ||
| .into(); | ||
| return Ok(PyIterReturn::Return(out)); | ||
| } | ||
| state.push_line(&nline); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
End-of-file tuple needs to come from the real EOF token.
Synthesizing the tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Grab the parser’s EndOfFile token, set it as prev_token, and let the existing emission path produce the tuple.
- if nline.is_empty() {
- state.eof = true;
- *zelf.state.write() = state.clone();
-
- let line_num = &state.start().0;
- let out = vm
- .ctx
- .new_tuple(vec![
- token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
- vm.ctx.new_str("").into(),
- vm.ctx
- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
- .into(),
- vm.ctx
- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
- .into(),
- vm.ctx.new_str(state.current_line()).into(),
- ])
- .into();
- return Ok(PyIterReturn::Return(out));
- }
+ if nline.is_empty() {
+ state.eof = true;
+ if let Some(eof) = state
+ .tokens
+ .iter()
+ .rev()
+ .find(|token| token.kind() == TokenKind::EndOfFile)
+ {
+ state.prev_token = Some(*eof);
+ break eof.clone();
+ }
+ return Ok(PyIterReturn::StopIteration(None));
+ }Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 116 to 138, don't synthesize the EOF
tuple here (which hardcodes (line, -1) and reuses the previous line text);
instead obtain the parser's EndOfFile token, set it as the state's prev_token
(or otherwise set prev_token to that EOF token), set state.eof = true and
persist the state as before, then allow the existing token-emission path to
produce and return the EOF tuple so the coordinates and line content come from
the real EOF token rather than being hardcoded.
| if !zelf.extra_tokens { | ||
| token_repr = token_repr.trim(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't trim token text when extra_tokens is false.
trim() wipes indentation whitespace (and other significant padding), so INDENT tokens end up empty strings. CPython preserves the exact lexeme. Just use the slice as-is.
- let mut token_repr = &state.source[state.range()];
- if !zelf.extra_tokens {
- token_repr = token_repr.trim();
- }
+ let token_repr = &state.source[state.range()];🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 152 to 154, token text is being trimmed
when zelf.extra_tokens is false which removes significant whitespace (e.g.,
INDENT lexemes); remove the call to trim() and use the slice as-is so the
original lexeme (including leading/trailing whitespace) is preserved when
extra_tokens is false.
| extra_tokens: bool, | ||
| #[pyarg(named, optional)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra_tokens must default to False.
CPython’s TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks that API.
- #[pyarg(named)]
+ #[pyarg(named, default = "false")]
extra_tokens: bool,Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 179-180, the extra_tokens parameter
currently requires being passed and therefore breaks CPython's TokenizerIter
API; make extra_tokens default to False so it can be omitted. Update the pyarg
declaration to make the argument optional with a default of false (or change the
parameter to Option<bool> and treat None as false) so TokenizerIter(readline)
works without naming this flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)
39-42: Don't swallow non-StopIteration errors fromreadline.Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only
StopIterationshould terminate the stream; all other errors must propagate.Apply this diff to fix the error handling:
- let raw_line = match self.readline.invoke((), vm) { - Ok(v) => v, - Err(_) => return Ok(String::new()), - }; + let raw_line = match self.readline.invoke((), vm) { + Ok(v) => v, + Err(exc) => { + if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) { + return Ok(String::new()); + } + return Err(exc); + } + };
116-138: End-of-file tuple needs to come from the real EOF token.Synthesizing the EOF tuple here hardcodes
(line, -1)and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken, set it asprev_token, and let the existing emission path produce the correct tuple.Apply this diff to use the parser's EOF token:
if nline.is_empty() { state.eof = true; - *zelf.state.write() = state.clone(); - - let line_num = &state.start().0; - let out = vm - .ctx - .new_tuple(vec![ - token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), - vm.ctx.new_str("").into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx.new_str(state.current_line()).into(), - ]) - .into(); - return Ok(PyIterReturn::Return(out)); + if let Some(eof) = state + .tokens + .iter() + .rev() + .find(|token| token.kind() == TokenKind::EndOfFile) + { + state.prev_token = Some(*eof); + break *eof; + } + *zelf.state.write() = state.clone(); + return Ok(PyIterReturn::StopIteration(None)); }
151-154: Don't trim token text whenextra_tokensis false.
trim()removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.Apply this diff to preserve the exact lexeme:
- let mut token_repr = &state.source[state.range()]; - if !zelf.extra_tokens { - token_repr = token_repr.trim(); - } + let token_repr = &state.source[state.range()];
178-180:extra_tokensmust default toFalse.CPython's
TokenizerIterallows callingTokenizerIter(readline)without naming the flag. Requiring the argument here breaks API compatibility.Apply this diff to make
extra_tokensoptional with a default:#[pyarg(positional)] readline: ArgCallable, - #[pyarg(named)] - extra_tokens: bool, + #[pyarg(named, default = "false")] + extra_tokens: bool, #[pyarg(named, optional)] encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
108-109: Reminder: Address the TODO comment for error handling.The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The
state.errorsfield is populated but never checked.Do you want me to generate code to check
state.errorsand raise appropriate SyntaxError exceptions, or open a new issue to track this task?
200-207: Re-parsing the entire source on each line may be inefficient.The
push_linemethod re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)
1-23: LGTM!Module structure and imports follow RustPython conventions correctly.
25-33: LGTM!The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.
61-72: LGTM!Debug implementation and empty
pyclassimpl block are correct.
74-93: LGTM!Constructor implementation follows RustPython patterns correctly.
210-220: No changes needed tocurrent_linelogic.The
current_linemethod correctly uses theLineRangestrait methods fromruff_source_file:full_line_str(offset)returns the full line containing an offset, andfull_lines_str(range)returns text covering all lines that a range spans. The logic properly handles edge cases:
- Multi-line tokens: Correctly handled via
full_lines_str(range)for non-Newline tokens- Newline tokens: Correctly isolated via
full_line_str(range.start())- When
prev_tokenisNone: Safely defaults toTokenKind::UnknownwithTextRange::default(), which callsfull_lines_strand returns lines from the startThe code is sound and requires no modifications.
| fn next_token(&mut self) -> Option<Token> { | ||
| for token in self.tokens.iter() { | ||
| let (kind, range) = token.as_tuple(); | ||
|
|
||
| if matches!(kind, TokenKind::NonLogicalNewline) { | ||
| continue; | ||
| } | ||
|
|
||
| if matches!(range.ordering(self.range()), Ordering::Greater) { | ||
| self.prev_token = Some(*token); | ||
| return self.prev_token; | ||
| } | ||
| } | ||
|
|
||
| None | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify the token ordering logic.
The next_token method uses range.ordering(self.range()) to find tokens after the current position. Ensure this correctly handles:
- Zero-width tokens (if any)
- Tokens at the exact same position
- The initial state when
prev_tokenisNone(resulting inTextRange::default())
Run the following script to verify the behavior with the Ruff crates:
🏁 Script executed:
#!/bin/bash
# Check how TextRange::default() and ordering work
ast-grep --pattern $'TextRange::default()'
rg -n -C3 "TextRange::default|ordering" --type=rustLength of output: 10814
I need to find the TextRange type definition and understand its ordering() method to verify the logic.
Now I need to understand the context better. Let me search for how next_token is used and what the intended behavior should be.
Let me search for the actual tokenize.rs file context to understand the full implementation.
Let me try a different approach to view the file:
Now let me examine the tokenize.rs file more closely:
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the full context of the tokenize.rs file around the problematic method
cat stdlib/src/tokenize.rs | head -300 | tail -100
Length of output: 3125
🏁 Script executed:
#!/bin/bash
# Find and display the tokenize.rs file with context around the next_token method
fd -name "tokenize.rs" -type fLength of output: 230
🏁 Script executed:
#!/bin/bash
# Display the tokenize.rs file to understand the structure and usage
rg -n "fn next_token|impl|struct|fn range|self.tokens" stdlib/src/tokenize.rs -A5 -B2 --type=rustLength of output: 3859
Let me examine the IterNext implementation to understand how next_token() is actually being used:
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the IterNext implementation to see how next_token is used
sed -n '97,173p' stdlib/src/tokenize.rs
Length of output: 2843
Token ordering logic has a critical issue: tokens at position 0 are skipped on initial state.
When prev_token is None, self.range() returns TextRange::default() which is a zero-width range at position 0 (0..0). The comparison range.ordering(self.range()) > Ordering::Greater only returns true for tokens completely after the current range. Since tokens starting at position 0 (e.g., 0..1) overlap with 0..0, they return Ordering::Equal and are skipped. Only tokens starting after position 0 are returned first, meaning the initial token in the source is never yielded.
The same issue affects zero-width tokens (e.g., at position N): they overlap with their containing range and won't be returned when that range is current.
The fix requires checking for Ordering::Equal and Ordering::Greater on initial state, or using TextRange::EMPTY properly, or advancing past the current position differently to include overlapping tokens.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 223 to 238, the token selection logic
skips tokens that overlap the current range (notably tokens at position 0 when
prev_token is None) because it only accepts Ordering::Greater; change the
comparison to accept both Ordering::Equal and Ordering::Greater (or compare
token.range().start() against self.range().end()) so tokens that start at the
same position as the current range are returned; update the conditional to check
for Equal || Greater (or use token_start >= current_end) and ensure prev_token
is set before returning to preserve state for subsequent calls.
| const fn token_kind_value(kind: TokenKind) -> u8 { | ||
| match kind { | ||
| TokenKind::EndOfFile => 0, | ||
| TokenKind::Name | ||
| | TokenKind::For | ||
| | TokenKind::In | ||
| | TokenKind::Pass | ||
| | TokenKind::Class | ||
| | TokenKind::And | ||
| | TokenKind::Is | ||
| | TokenKind::Raise | ||
| | TokenKind::True | ||
| | TokenKind::False | ||
| | TokenKind::Assert | ||
| | TokenKind::Try | ||
| | TokenKind::While | ||
| | TokenKind::Yield | ||
| | TokenKind::Lambda | ||
| | TokenKind::None | ||
| | TokenKind::Not | ||
| | TokenKind::Or | ||
| | TokenKind::Break | ||
| | TokenKind::Continue | ||
| | TokenKind::Global | ||
| | TokenKind::Nonlocal | ||
| | TokenKind::Return | ||
| | TokenKind::Except | ||
| | TokenKind::Import | ||
| | TokenKind::Case | ||
| | TokenKind::Match | ||
| | TokenKind::Type | ||
| | TokenKind::Await | ||
| | TokenKind::With | ||
| | TokenKind::Del | ||
| | TokenKind::Finally | ||
| | TokenKind::From | ||
| | TokenKind::Def | ||
| | TokenKind::If | ||
| | TokenKind::Else | ||
| | TokenKind::Elif | ||
| | TokenKind::As | ||
| | TokenKind::Async => 1, | ||
| TokenKind::Int | TokenKind::Complex | TokenKind::Float => 2, | ||
| TokenKind::String => 3, | ||
| TokenKind::Newline | TokenKind::NonLogicalNewline => 4, | ||
| TokenKind::Indent => 5, | ||
| TokenKind::Dedent => 6, | ||
| TokenKind::Lpar => 7, | ||
| TokenKind::Rpar => 8, | ||
| TokenKind::Lsqb => 9, | ||
| TokenKind::Rsqb => 10, | ||
| TokenKind::Colon => 11, | ||
| TokenKind::Comma => 12, | ||
| TokenKind::Semi => 13, | ||
| TokenKind::Plus => 14, | ||
| TokenKind::Minus => 15, | ||
| TokenKind::Star => 16, | ||
| TokenKind::Slash => 17, | ||
| TokenKind::Vbar => 18, | ||
| TokenKind::Amper => 19, | ||
| TokenKind::Less => 20, | ||
| TokenKind::Greater => 21, | ||
| TokenKind::Equal => 22, | ||
| TokenKind::Dot => 23, | ||
| TokenKind::Percent => 24, | ||
| TokenKind::Lbrace => 25, | ||
| TokenKind::Rbrace => 26, | ||
| TokenKind::EqEqual => 27, | ||
| TokenKind::NotEqual => 28, | ||
| TokenKind::LessEqual => 29, | ||
| TokenKind::GreaterEqual => 30, | ||
| TokenKind::Tilde => 31, | ||
| TokenKind::CircumFlex => 32, | ||
| TokenKind::LeftShift => 33, | ||
| TokenKind::RightShift => 34, | ||
| TokenKind::DoubleStar => 35, | ||
| TokenKind::PlusEqual => 36, | ||
| TokenKind::MinusEqual => 37, | ||
| TokenKind::StarEqual => 38, | ||
| TokenKind::SlashEqual => 39, | ||
| TokenKind::PercentEqual => 40, | ||
| TokenKind::AmperEqual => 41, | ||
| TokenKind::VbarEqual => 42, | ||
| TokenKind::CircumflexEqual => 43, | ||
| TokenKind::LeftShiftEqual => 44, | ||
| TokenKind::RightShiftEqual => 45, | ||
| TokenKind::DoubleStarEqual => 46, | ||
| TokenKind::DoubleSlash => 47, | ||
| TokenKind::DoubleSlashEqual => 48, | ||
| TokenKind::At => 49, | ||
| TokenKind::AtEqual => 50, | ||
| TokenKind::Rarrow => 51, | ||
| TokenKind::Ellipsis => 52, | ||
| TokenKind::ColonEqual => 53, | ||
| TokenKind::Exclamation => 54, | ||
| TokenKind::FStringStart => 59, | ||
| TokenKind::FStringMiddle => 60, | ||
| TokenKind::FStringEnd => 61, | ||
| TokenKind::Comment => 62, | ||
| TokenKind::TStringStart => 62, // 3.14 compatible | ||
| TokenKind::TStringMiddle => 63, // 3.14 compatible | ||
| TokenKind::TStringEnd => 64, // 3.14 compatible | ||
| TokenKind::IpyEscapeCommand | TokenKind::Question => 0, // Ruff's specific | ||
| TokenKind::Unknown => 0, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify token value mappings match CPython.
This mapping should match the token values from CPython's token module. The comment at line 22 indicates TOKEN_OP is 55, which is correct for Python 3.x.
Please verify that all other token values match CPython, especially:
- Lines 380-382: TString tokens marked as "3.14 compatible"
- Line 379: Comment token maps to 62 (same as TStringStart on line 380)
- Lines 383-384: Ruff-specific tokens map to 0
🌐 Web query:
What are the token type values in CPython's token module for Python 3.14?
💡 Result:
I can show you the exact numeric values from your local CPython 3.14 installation — run this in a Python 3.14 interpreter:
python -c "import token, pprint; pprint.pprint({name: getattr(token, name) for name in sorted(token.tok_name.values())})"
Or, to get the reverse mapping (name → numeric value):
python -c "import token, pprint; pprint.pprint({k: v for v, k in token.tok_name.items()})"
The official Python 3.14 documentation for the token module (and the module source) is here for reference. [1]
Sources:
[1] Python 3.14 token module documentation — docs.python.org.
#!/usr/bin/env python3
import sys
try:
import token
print("Python version:", sys.version)
print("\nToken name to value mapping:")
# Get token name to value mapping
tok_mapping = {}
for value, name in token.tok_name.items():
tok_mapping[name] = value
# Sort and print
for name in sorted(tok_mapping.keys()):
print(f" {name}: {tok_mapping[name]}")
except ImportError as e:
print(f"Error: {e}")🏁 Script executed:
#!/bin/bash
# Also check if we can inspect the tokenize.rs file for comparison
cd /tmp || exit 1
cat > check_cpython_tokens.py << 'EOF'
import token
import pprint
print("=== Token module values ===")
for tok_val in sorted(token.tok_name.keys()):
print(f"{token.tok_name[tok_val]}: {tok_val}")
EOF
python3 check_cpython_tokens.pyLength of output: 844
🏁 Script executed:
#!/bin/bash
# Get the full TokenKind enum definition and context around the function
cd /tmp || exit 1
find . -name "tokenize.rs" -type f 2>/dev/null | head -5Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for TokenKind enum definition
rg -n "enum TokenKind" --type=rs -A 100Length of output: 26
🏁 Script executed:
#!/bin/bash
# Inspect the TokenKind enum and the token_kind_value function
rg -n "enum TokenKind" -A 150 | head -200Length of output: 0
🏁 Script executed:
#!/bin/bash
# Look at the token_kind_value function and surrounding context
rg -n "const fn token_kind_value" -B 5 -A 120 | head -200Length of output: 6992
🏁 Script executed:
#!/bin/bash
# Search for any comments about token mapping or Python compatibility
rg -n "CPython|Python.*token|token.*value" stdlib/src/tokenize.rs | head -20Length of output: 333
Critical token mapping errors found - multiple duplicates and CPython misalignment.
Verification against CPython 3.14 token module revealed significant issues:
- Duplicate values (lines 379-380): Both
CommentandTStringStartmap to 62, which is invalid - Incorrect
Commentvalue (line 379): Maps to 62, but CPython hasCOMMENT=61 - FString tokens don't match CPython (lines 376-378): Map to 59, 60, 61 which correspond to
SOFT_KEYWORD,ERRORTOKEN, andCOMMENTin CPython—not FString values - TString tokens misaligned (lines 380-382): The "3.14 compatible" comment is inaccurate; these values (62, 63, 64) don't correspond to template string tokens in CPython 3.14
- Exclamation token (line 375): Maps to 54, but CPython has
OP=54; Exclamation is not a standard Python token type
The mapping for lines 375-382 needs correction to either match CPython standard tokens or use reserved/unassigned values to avoid collisions and type misalignment.
🤖 Prompt for AI Agents
stdlib/src/tokenize.rs around lines 281-386: the TokenKind->u8 mapping has
duplicate and misaligned values (e.g., Comment and TStringStart both 62,
FStringStart/Middle/End set to 59/60/61 which conflict with CPython 3.14
meanings, Exclamation mapped to 54 which collides with OP, and the "3.14
compatible" comments are incorrect). Fix by replacing those values with the
correct CPython 3.14 token numeric assignments for COMMENT,
ERRORTOKEN/SOFT_KEYWORD/SO_ON if you intend CPython parity; if you intentionally
diverge, pick unused/reserved numeric values that do not collide with existing
entries, remove or correct the misleading "3.14 compatible" comments, ensure
IpyEscapeCommand/Question and Unknown map to distinct safe values (or 0 only
when appropriate), and run tokenizer unit tests and a full token value
uniqueness check to confirm no duplicates remain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)
121-143: EOF tuple must use the parser's actual EndOfFile token.The code synthesizes the EOF tuple with hardcoded
(-1)positions (lines 132, 135) and reusesstate.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken fromstate.tokens, set it asprev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.Apply this approach:
let nline = zelf.readline(vm)?; if nline.is_empty() { state.eof = true; - *zelf.state.write() = state.clone(); - - let line_num = &state.start().0; - let out = vm - .ctx - .new_tuple(vec![ - token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), - vm.ctx.new_str("").into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx.new_str(state.current_line()).into(), - ]) - .into(); - return Ok(PyIterReturn::Return(out)); + if let Some(eof) = state + .tokens + .iter() + .find(|token| token.kind() == TokenKind::EndOfFile) + { + state.prev_token = Some(*eof); + break *eof; + } + *zelf.state.write() = state.clone(); + return Ok(PyIterReturn::StopIteration(None)); }
156-159: Remove trim() to preserve exact token lexemes.Calling
trim()whenextra_tokensis false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.Apply this diff:
- let mut token_repr = &state.source[state.range()]; - if !zelf.extra_tokens { - token_repr = token_repr.trim(); - } + let token_repr = &state.source[state.range()];
183-184: Makeextra_tokensoptional with defaultfalse.CPython allows
TokenizerIter(readline)without specifyingextra_tokens. The current declaration requires the argument, breaking API compatibility.Apply this diff:
- #[pyarg(named)] - extra_tokens: bool, + #[pyarg(named, default = "false")] + extra_tokens: bool,
227-243: Token ordering skips tokens that overlap the current range.When
prev_tokenisNone,self.range()returnsTextRange::default()(0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater(line 236) only matches tokens completely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returningOrdering::Equal, and are skipped. The first token in the source is never yielded.Fix by comparing start positions:
- if matches!(range.ordering(self.range()), Ordering::Greater) { + if range.start() > self.range().end() { self.prev_token = Some(*token); return self.prev_token; }
286-391: Critical token mapping errors: duplicates and CPython conflicts.Multiple critical issues remain:
- Duplicate value 62 (lines 384-385): Both
CommentandTStringStartmap to 62- Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has
COMMENT=61- FString conflicts (lines 381-383): Values 59, 60, 61 correspond to
SOFT_KEYWORD,ERRORTOKEN, andCOMMENTin CPython 3.14- Exclamation collision (line 380): Maps to 54, which is CPython's
OPtoken- Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate
These mappings must be corrected to either:
- Match CPython 3.14 standard token values exactly, or
- Use reserved/unassigned values (e.g., 100+) to avoid collisions
Run this verification to confirm no remaining duplicates after fixes:
#!/bin/bash # Extract all token value assignments and check for duplicates rg -n "TokenKind::\w+ => \d+" stdlib/src/tokenize.rs | \ gawk '{print $NF}' | sort | uniq -d | \ xargs -I {} echo "Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
189-212: State structure is correct; consider performance implications.The
push_linemethod reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.
113-118: TODO: Implement syntax error handling.The TODO on line 114 indicates that parse errors from
state.errorsare not currently raised asSyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.Based on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/test/test_tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Applied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1328-1334)readline(1349-1355)readline(1575-1580)readline(2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)
1-23: LGTM: Module structure and imports are well-organized.The
#[pymodule]macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.
25-33: Struct definition is sound; track the TODO for future work.The use of
PyRwLockfor state management is appropriate. The TODO on line 29 regardingPyObjectvsArgCallablecan be addressed in a follow-up.
35-64: LGTM: Error handling correctly distinguishes StopIteration from other exceptions.The implementation now properly treats only
StopIterationas EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.
66-77: LGTM: Standard Debug implementation and trait delegation.
79-98: LGTM: Constructor implementation is clean.
100-100: LGTM: Standard SelfIter implementation.
214-225: LGTM: Current line extraction logic is appropriate.
245-267: LGTM: Position helper methods correctly convert to Python conventions.The
start()andend()methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.
270-284: LGTM: Default state initialization is clean.
Summary by CodeRabbit
New Features
Chores