🌐 AI搜索 & 代理 主页
Skip to content

Conversation

@ShaharNaveh
Copy link
Collaborator

@ShaharNaveh ShaharNaveh commented Nov 8, 2025

Summary by CodeRabbit

  • New Features

    • New Python tokenizer module: Tokenizer iterator that yields token tuples (type, text, start, end, line), accepts a readline callable, optional encoding, and an extra-token mode; emits EOF tokens with position metadata for streaming/tokenization.
  • Chores

    • Internal dependency declarations updated to support the new tokenizer module.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 8, 2025

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)
  • Lib/test/test_tarfile.py is excluded by !Lib/**
  • Lib/test/test_unittest/testmock/testhelpers.py is excluded by !Lib/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Reorders and augments Ruff-related dependencies in top-level Cargo.toml and adds workspace dependencies in stdlib/Cargo.toml; registers a new internal tokenize module in stdlib/src/lib.rs; and introduces stdlib/src/tokenize.rs, a PyO3-backed _tokenize module exposing TokenizerIter that yields Python token 5-tuples and handles tokenizer state and EOF.

Changes

Cohort / File(s) Summary
Top-level manifest
Cargo.toml
Reordered Ruff-related crate entries; added ruff_python_trivia; moved ruff_text_size near ruff_source_file. No API changes.
Stdlib manifest updates
stdlib/Cargo.toml
Added workspace dependencies: ruff_python_ast, ruff_python_parser, ruff_source_file, ruff_text_size (all workspace = true).
Module registration
stdlib/src/lib.rs
Added mod tokenize; and registered _tokenize by including "_tokenize" => tokenize::make_module in module initializers.
Tokenizer implementation
stdlib/src/tokenize.rs
New PyO3 module _tokenize exposing Python class TokenizerIter (PyTokenizerIter) and constructor args. Implements an iterator that reads lines via a Python readline callable, manages internal state with a PyRwLock, emits 5-tuples (type, string, start_pos, end_pos, current_line), remaps operator tokens to TOKEN_OP when extra_tokens is true, and emits an EOF token tuple.

Sequence Diagram(s)

sequenceDiagram
    participant PyUser as Python user
    participant TokenIter as TokenizerIter (Py)
    participant State as PyTokenizerIterState
    participant Parser as Ruff parser

    PyUser->>TokenIter: next() / __next__()
    TokenIter->>TokenIter: check internal state & tokens
    alt tokens exhausted
        TokenIter->>PyUser: call readline(vm) -> line
        alt readline returns line
            TokenIter->>State: push_line(line)
            State->>Parser: parse_unchecked_source(line...)
            Parser-->>State: tokens
            State->>TokenIter: next_token()
            TokenIter->>TokenIter: token_kind_value(kind)
            alt extra_tokens and operator
                TokenIter->>TokenIter: remap -> TOKEN_OP (55)
            end
            TokenIter-->>PyUser: (type, string, start, end, current_line)
        else readline -> EOF
            TokenIter->>State: mark EOF
            TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)
        end
    else tokens available
        TokenIter-->>PyUser: next token tuple
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review on stdlib/src/tokenize.rs concurrency (use of PyRwLock), correctness of token position/range calculations, EOF emission behavior, and token-kind remapping.
  • Verify workspace dependency consistency across Cargo.toml and stdlib/Cargo.toml.

Suggested reviewers

  • youknowone

Poem

A rabbit reads each Python line with care,
Hops through tokens drifting in the air,
Ruff crates in order, ready for the quest,
Tuples hop out tidy, positions dressed,
EOF tiptoes in — the iterator rests. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: implementing a base _tokenize module with a TokenizerIter class and supporting infrastructure.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cad66c and 7bc0197.

⛔ Files ignored due to path filters (15)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_tokenize.py is excluded by !Lib/**
  • Lib/test/tokenizedata/__init__.py is excluded by !Lib/**
  • Lib/test/tokenizedata/bad_coding.py is excluded by !Lib/**
  • Lib/test/tokenizedata/bad_coding2.py is excluded by !Lib/**
  • Lib/test/tokenizedata/badsyntax_3131.py is excluded by !Lib/**
  • Lib/test/tokenizedata/badsyntax_pep3120.py is excluded by !Lib/**
  • Lib/test/tokenizedata/coding20731.py is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
  • Lib/test/tokenizedata/tokenize_tests.txt is excluded by !Lib/**
  • Lib/token.py is excluded by !Lib/**
  • Lib/tokenize.py is excluded by !Lib/**
📒 Files selected for processing (4)
  • Cargo.toml (1 hunks)
  • stdlib/Cargo.toml (1 hunks)
  • stdlib/src/lib.rs (2 hunks)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

Applied to files:

  • stdlib/Cargo.toml
  • stdlib/src/tokenize.rs
  • Cargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code

Applied to files:

  • Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
  • tokenize (466-492)
stdlib/src/socket.rs (1)
  • make_module (7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run tests under miri
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets

Comment on lines +116 to +138
if nline.is_empty() {
state.eof = true;
*zelf.state.write() = state.clone();

let line_num = &state.start().0;
let out = vm
.ctx
.new_tuple(vec![
token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
vm.ctx.new_str("").into(),
vm.ctx
.new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
.into(),
vm.ctx
.new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
.into(),
vm.ctx.new_str(state.current_line()).into(),
])
.into();
return Ok(PyIterReturn::Return(out));
}
state.push_line(&nline);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

End-of-file tuple needs to come from the real EOF token.

Synthesizing the tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Grab the parser’s EndOfFile token, set it as prev_token, and let the existing emission path produce the tuple.

-                if nline.is_empty() {
-                    state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
-                }
+                if nline.is_empty() {
+                    state.eof = true;
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .rev()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break eof.clone();
+                    }
+                    return Ok(PyIterReturn::StopIteration(None));
+                }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 116 to 138, don't synthesize the EOF
tuple here (which hardcodes (line, -1) and reuses the previous line text);
instead obtain the parser's EndOfFile token, set it as the state's prev_token
(or otherwise set prev_token to that EOF token), set state.eof = true and
persist the state as before, then allow the existing token-emission path to
produce and return the EOF tuple so the coordinates and line content come from
the real EOF token rather than being hardcoded.

Comment on lines +152 to +154
if !zelf.extra_tokens {
token_repr = token_repr.trim();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't trim token text when extra_tokens is false.

trim() wipes indentation whitespace (and other significant padding), so INDENT tokens end up empty strings. CPython preserves the exact lexeme. Just use the slice as-is.

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 152 to 154, token text is being trimmed
when zelf.extra_tokens is false which removes significant whitespace (e.g.,
INDENT lexemes); remove the call to trim() and use the slice as-is so the
original lexeme (including leading/trailing whitespace) is preserved when
extra_tokens is false.

Comment on lines +179 to +180
extra_tokens: bool,
#[pyarg(named, optional)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

extra_tokens must default to False.

CPython’s TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks that API.

-        #[pyarg(named)]
+        #[pyarg(named, default = "false")]
         extra_tokens: bool,

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 179-180, the extra_tokens parameter
currently requires being passed and therefore breaks CPython's TokenizerIter
API; make extra_tokens default to False so it can be omitted. Update the pyarg
declaration to make the argument optional with a default of false (or change the
parameter to Option<bool> and treat None as false) so TokenizerIter(readline)
works without naming this flag.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)

39-42: Don't swallow non-StopIteration errors from readline.

Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only StopIteration should terminate the stream; all other errors must propagate.

Apply this diff to fix the error handling:

-            let raw_line = match self.readline.invoke((), vm) {
-                Ok(v) => v,
-                Err(_) => return Ok(String::new()),
-            };
+            let raw_line = match self.readline.invoke((), vm) {
+                Ok(v) => v,
+                Err(exc) => {
+                    if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) {
+                        return Ok(String::new());
+                    }
+                    return Err(exc);
+                }
+            };

116-138: End-of-file tuple needs to come from the real EOF token.

Synthesizing the EOF tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser's EndOfFile token, set it as prev_token, and let the existing emission path produce the correct tuple.

Apply this diff to use the parser's EOF token:

                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .rev()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }

151-154: Don't trim token text when extra_tokens is false.

trim() removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.

Apply this diff to preserve the exact lexeme:

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];

178-180: extra_tokens must default to False.

CPython's TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks API compatibility.

Apply this diff to make extra_tokens optional with a default:

         #[pyarg(positional)]
         readline: ArgCallable,
-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,
         #[pyarg(named, optional)]
         encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

108-109: Reminder: Address the TODO comment for error handling.

The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The state.errors field is populated but never checked.

Do you want me to generate code to check state.errors and raise appropriate SyntaxError exceptions, or open a new issue to track this task?


200-207: Re-parsing the entire source on each line may be inefficient.

The push_line method re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.

However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc0197 and 04371d3.

📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)

1-23: LGTM!

Module structure and imports follow RustPython conventions correctly.


25-33: LGTM!

The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.


61-72: LGTM!

Debug implementation and empty pyclass impl block are correct.


74-93: LGTM!

Constructor implementation follows RustPython patterns correctly.


210-220: No changes needed to current_line logic.

The current_line method correctly uses the LineRanges trait methods from ruff_source_file: full_line_str(offset) returns the full line containing an offset, and full_lines_str(range) returns text covering all lines that a range spans. The logic properly handles edge cases:

  • Multi-line tokens: Correctly handled via full_lines_str(range) for non-Newline tokens
  • Newline tokens: Correctly isolated via full_line_str(range.start())
  • When prev_token is None: Safely defaults to TokenKind::Unknown with TextRange::default(), which calls full_lines_str and returns lines from the start

The code is sound and requires no modifications.

Comment on lines +223 to +238
fn next_token(&mut self) -> Option<Token> {
for token in self.tokens.iter() {
let (kind, range) = token.as_tuple();

if matches!(kind, TokenKind::NonLogicalNewline) {
continue;
}

if matches!(range.ordering(self.range()), Ordering::Greater) {
self.prev_token = Some(*token);
return self.prev_token;
}
}

None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the token ordering logic.

The next_token method uses range.ordering(self.range()) to find tokens after the current position. Ensure this correctly handles:

  • Zero-width tokens (if any)
  • Tokens at the exact same position
  • The initial state when prev_token is None (resulting in TextRange::default())

Run the following script to verify the behavior with the Ruff crates:


🏁 Script executed:

#!/bin/bash
# Check how TextRange::default() and ordering work
ast-grep --pattern $'TextRange::default()'
rg -n -C3 "TextRange::default|ordering" --type=rust

Length of output: 10814


I need to find the TextRange type definition and understand its ordering() method to verify the logic.

Now I need to understand the context better. Let me search for how next_token is used and what the intended behavior should be.

Let me search for the actual tokenize.rs file context to understand the full implementation.


Let me try a different approach to view the file:



Now let me examine the tokenize.rs file more closely:


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the full context of the tokenize.rs file around the problematic method
cat stdlib/src/tokenize.rs | head -300 | tail -100

Length of output: 3125


🏁 Script executed:

#!/bin/bash
# Find and display the tokenize.rs file with context around the next_token method
fd -name "tokenize.rs" -type f

Length of output: 230


🏁 Script executed:

#!/bin/bash
# Display the tokenize.rs file to understand the structure and usage
rg -n "fn next_token|impl|struct|fn range|self.tokens" stdlib/src/tokenize.rs -A5 -B2 --type=rust

Length of output: 3859


Let me examine the IterNext implementation to understand how next_token() is actually being used:


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the IterNext implementation to see how next_token is used
sed -n '97,173p' stdlib/src/tokenize.rs

Length of output: 2843


Token ordering logic has a critical issue: tokens at position 0 are skipped on initial state.

When prev_token is None, self.range() returns TextRange::default() which is a zero-width range at position 0 (0..0). The comparison range.ordering(self.range()) > Ordering::Greater only returns true for tokens completely after the current range. Since tokens starting at position 0 (e.g., 0..1) overlap with 0..0, they return Ordering::Equal and are skipped. Only tokens starting after position 0 are returned first, meaning the initial token in the source is never yielded.

The same issue affects zero-width tokens (e.g., at position N): they overlap with their containing range and won't be returned when that range is current.

The fix requires checking for Ordering::Equal and Ordering::Greater on initial state, or using TextRange::EMPTY properly, or advancing past the current position differently to include overlapping tokens.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 223 to 238, the token selection logic
skips tokens that overlap the current range (notably tokens at position 0 when
prev_token is None) because it only accepts Ordering::Greater; change the
comparison to accept both Ordering::Equal and Ordering::Greater (or compare
token.range().start() against self.range().end()) so tokens that start at the
same position as the current range are returned; update the conditional to check
for Equal || Greater (or use token_start >= current_end) and ensure prev_token
is set before returning to preserve state for subsequent calls.

Comment on lines +281 to +386
const fn token_kind_value(kind: TokenKind) -> u8 {
match kind {
TokenKind::EndOfFile => 0,
TokenKind::Name
| TokenKind::For
| TokenKind::In
| TokenKind::Pass
| TokenKind::Class
| TokenKind::And
| TokenKind::Is
| TokenKind::Raise
| TokenKind::True
| TokenKind::False
| TokenKind::Assert
| TokenKind::Try
| TokenKind::While
| TokenKind::Yield
| TokenKind::Lambda
| TokenKind::None
| TokenKind::Not
| TokenKind::Or
| TokenKind::Break
| TokenKind::Continue
| TokenKind::Global
| TokenKind::Nonlocal
| TokenKind::Return
| TokenKind::Except
| TokenKind::Import
| TokenKind::Case
| TokenKind::Match
| TokenKind::Type
| TokenKind::Await
| TokenKind::With
| TokenKind::Del
| TokenKind::Finally
| TokenKind::From
| TokenKind::Def
| TokenKind::If
| TokenKind::Else
| TokenKind::Elif
| TokenKind::As
| TokenKind::Async => 1,
TokenKind::Int | TokenKind::Complex | TokenKind::Float => 2,
TokenKind::String => 3,
TokenKind::Newline | TokenKind::NonLogicalNewline => 4,
TokenKind::Indent => 5,
TokenKind::Dedent => 6,
TokenKind::Lpar => 7,
TokenKind::Rpar => 8,
TokenKind::Lsqb => 9,
TokenKind::Rsqb => 10,
TokenKind::Colon => 11,
TokenKind::Comma => 12,
TokenKind::Semi => 13,
TokenKind::Plus => 14,
TokenKind::Minus => 15,
TokenKind::Star => 16,
TokenKind::Slash => 17,
TokenKind::Vbar => 18,
TokenKind::Amper => 19,
TokenKind::Less => 20,
TokenKind::Greater => 21,
TokenKind::Equal => 22,
TokenKind::Dot => 23,
TokenKind::Percent => 24,
TokenKind::Lbrace => 25,
TokenKind::Rbrace => 26,
TokenKind::EqEqual => 27,
TokenKind::NotEqual => 28,
TokenKind::LessEqual => 29,
TokenKind::GreaterEqual => 30,
TokenKind::Tilde => 31,
TokenKind::CircumFlex => 32,
TokenKind::LeftShift => 33,
TokenKind::RightShift => 34,
TokenKind::DoubleStar => 35,
TokenKind::PlusEqual => 36,
TokenKind::MinusEqual => 37,
TokenKind::StarEqual => 38,
TokenKind::SlashEqual => 39,
TokenKind::PercentEqual => 40,
TokenKind::AmperEqual => 41,
TokenKind::VbarEqual => 42,
TokenKind::CircumflexEqual => 43,
TokenKind::LeftShiftEqual => 44,
TokenKind::RightShiftEqual => 45,
TokenKind::DoubleStarEqual => 46,
TokenKind::DoubleSlash => 47,
TokenKind::DoubleSlashEqual => 48,
TokenKind::At => 49,
TokenKind::AtEqual => 50,
TokenKind::Rarrow => 51,
TokenKind::Ellipsis => 52,
TokenKind::ColonEqual => 53,
TokenKind::Exclamation => 54,
TokenKind::FStringStart => 59,
TokenKind::FStringMiddle => 60,
TokenKind::FStringEnd => 61,
TokenKind::Comment => 62,
TokenKind::TStringStart => 62, // 3.14 compatible
TokenKind::TStringMiddle => 63, // 3.14 compatible
TokenKind::TStringEnd => 64, // 3.14 compatible
TokenKind::IpyEscapeCommand | TokenKind::Question => 0, // Ruff's specific
TokenKind::Unknown => 0,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify token value mappings match CPython.

This mapping should match the token values from CPython's token module. The comment at line 22 indicates TOKEN_OP is 55, which is correct for Python 3.x.

Please verify that all other token values match CPython, especially:

  • Lines 380-382: TString tokens marked as "3.14 compatible"
  • Line 379: Comment token maps to 62 (same as TStringStart on line 380)
  • Lines 383-384: Ruff-specific tokens map to 0

🌐 Web query:

What are the token type values in CPython's token module for Python 3.14?

💡 Result:

I can show you the exact numeric values from your local CPython 3.14 installation — run this in a Python 3.14 interpreter:

python -c "import token, pprint; pprint.pprint({name: getattr(token, name) for name in sorted(token.tok_name.values())})"

Or, to get the reverse mapping (name → numeric value):

python -c "import token, pprint; pprint.pprint({k: v for v, k in token.tok_name.items()})"

The official Python 3.14 documentation for the token module (and the module source) is here for reference. [1]

Sources:
[1] Python 3.14 token module documentation — docs.python.org.


#!/usr/bin/env python3
import sys
try:
    import token
    print("Python version:", sys.version)
    print("\nToken name to value mapping:")
    # Get token name to value mapping
    tok_mapping = {}
    for value, name in token.tok_name.items():
        tok_mapping[name] = value
    
    # Sort and print
    for name in sorted(tok_mapping.keys()):
        print(f"  {name}: {tok_mapping[name]}")
except ImportError as e:
    print(f"Error: {e}")

🏁 Script executed:

#!/bin/bash
# Also check if we can inspect the tokenize.rs file for comparison
cd /tmp || exit 1
cat > check_cpython_tokens.py << 'EOF'
import token
import pprint

print("=== Token module values ===")
for tok_val in sorted(token.tok_name.keys()):
    print(f"{token.tok_name[tok_val]}: {tok_val}")
EOF
python3 check_cpython_tokens.py

Length of output: 844



🏁 Script executed:

#!/bin/bash
# Get the full TokenKind enum definition and context around the function
cd /tmp || exit 1
find . -name "tokenize.rs" -type f 2>/dev/null | head -5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for TokenKind enum definition
rg -n "enum TokenKind" --type=rs -A 100

Length of output: 26



🏁 Script executed:

#!/bin/bash
# Inspect the TokenKind enum and the token_kind_value function
rg -n "enum TokenKind" -A 150 | head -200

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Look at the token_kind_value function and surrounding context
rg -n "const fn token_kind_value" -B 5 -A 120 | head -200

Length of output: 6992


🏁 Script executed:

#!/bin/bash
# Search for any comments about token mapping or Python compatibility
rg -n "CPython|Python.*token|token.*value" stdlib/src/tokenize.rs | head -20

Length of output: 333


Critical token mapping errors found - multiple duplicates and CPython misalignment.

Verification against CPython 3.14 token module revealed significant issues:

  1. Duplicate values (lines 379-380): Both Comment and TStringStart map to 62, which is invalid
  2. Incorrect Comment value (line 379): Maps to 62, but CPython has COMMENT=61
  3. FString tokens don't match CPython (lines 376-378): Map to 59, 60, 61 which correspond to SOFT_KEYWORD, ERRORTOKEN, and COMMENT in CPython—not FString values
  4. TString tokens misaligned (lines 380-382): The "3.14 compatible" comment is inaccurate; these values (62, 63, 64) don't correspond to template string tokens in CPython 3.14
  5. Exclamation token (line 375): Maps to 54, but CPython has OP=54; Exclamation is not a standard Python token type

The mapping for lines 375-382 needs correction to either match CPython standard tokens or use reserved/unassigned values to avoid collisions and type misalignment.

🤖 Prompt for AI Agents
stdlib/src/tokenize.rs around lines 281-386: the TokenKind->u8 mapping has
duplicate and misaligned values (e.g., Comment and TStringStart both 62,
FStringStart/Middle/End set to 59/60/61 which conflict with CPython 3.14
meanings, Exclamation mapped to 54 which collides with OP, and the "3.14
compatible" comments are incorrect). Fix by replacing those values with the
correct CPython 3.14 token numeric assignments for COMMENT,
ERRORTOKEN/SOFT_KEYWORD/SO_ON if you intend CPython parity; if you intentionally
diverge, pick unused/reserved numeric values that do not collide with existing
entries, remove or correct the misleading "3.14 compatible" comments, ensure
IpyEscapeCommand/Question and Unknown map to distinct safe values (or 0 only
when appropriate), and run tokenizer unit tests and a full token value
uniqueness check to confirm no duplicates remain.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)

121-143: EOF tuple must use the parser's actual EndOfFile token.

The code synthesizes the EOF tuple with hardcoded (-1) positions (lines 132, 135) and reuses state.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser's EndOfFile token from state.tokens, set it as prev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.

Apply this approach:

                 let nline = zelf.readline(vm)?;
                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }

156-159: Remove trim() to preserve exact token lexemes.

Calling trim() when extra_tokens is false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.

Apply this diff:

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];

183-184: Make extra_tokens optional with default false.

CPython allows TokenizerIter(readline) without specifying extra_tokens. The current declaration requires the argument, breaking API compatibility.

Apply this diff:

-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,

227-243: Token ordering skips tokens that overlap the current range.

When prev_token is None, self.range() returns TextRange::default() (0..0). The comparison range.ordering(self.range()) > Ordering::Greater (line 236) only matches tokens completely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returning Ordering::Equal, and are skipped. The first token in the source is never yielded.

Fix by comparing start positions:

-                if matches!(range.ordering(self.range()), Ordering::Greater) {
+                if range.start() > self.range().end() {
                     self.prev_token = Some(*token);
                     return self.prev_token;
                 }

286-391: Critical token mapping errors: duplicates and CPython conflicts.

Multiple critical issues remain:

  1. Duplicate value 62 (lines 384-385): Both Comment and TStringStart map to 62
  2. Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has COMMENT=61
  3. FString conflicts (lines 381-383): Values 59, 60, 61 correspond to SOFT_KEYWORD, ERRORTOKEN, and COMMENT in CPython 3.14
  4. Exclamation collision (line 380): Maps to 54, which is CPython's OP token
  5. Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate

These mappings must be corrected to either:

  • Match CPython 3.14 standard token values exactly, or
  • Use reserved/unassigned values (e.g., 100+) to avoid collisions

Run this verification to confirm no remaining duplicates after fixes:

#!/bin/bash
# Extract all token value assignments and check for duplicates
rg -n "TokenKind::\w+ => \d+" stdlib/src/tokenize.rs | \
  gawk '{print $NF}' | sort | uniq -d | \
  xargs -I {} echo "Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

189-212: State structure is correct; consider performance implications.

The push_line method reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.


113-118: TODO: Implement syntax error handling.

The TODO on line 114 indicates that parse errors from state.errors are not currently raised as SyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04371d3 and 30f54f6.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_tokenize.py is excluded by !Lib/**
📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1328-1334)
  • readline (1349-1355)
  • readline (1575-1580)
  • readline (2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)

1-23: LGTM: Module structure and imports are well-organized.

The #[pymodule] macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.


25-33: Struct definition is sound; track the TODO for future work.

The use of PyRwLock for state management is appropriate. The TODO on line 29 regarding PyObject vs ArgCallable can be addressed in a follow-up.


35-64: LGTM: Error handling correctly distinguishes StopIteration from other exceptions.

The implementation now properly treats only StopIteration as EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.


66-77: LGTM: Standard Debug implementation and trait delegation.


79-98: LGTM: Constructor implementation is clean.


100-100: LGTM: Standard SelfIter implementation.


214-225: LGTM: Current line extraction logic is appropriate.


245-267: LGTM: Position helper methods correctly convert to Python conventions.

The start() and end() methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.


270-284: LGTM: Default state initialization is clean.

@ShaharNaveh ShaharNaveh marked this pull request as draft November 14, 2025 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant