Base implementation of `_tokenize` module #6240

ShaharNaveh · 2025-11-08T10:54:47Z

Summary by CodeRabbit

New Features
- New Python tokenizer module: Tokenizer iterator that yields token tuples (type, text, start, end, line), accepts a readline callable, optional encoding, and an extra-token mode; emits EOF tokens with position metadata for streaming/tokenization.
Chores
- Internal dependency declarations updated to support the new tokenizer module.

coderabbitai · 2025-11-08T10:55:00Z

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)

Lib/test/test_tarfile.py is excluded by !Lib/**
Lib/test/test_unittest/testmock/testhelpers.py is excluded by !Lib/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including **/dist/** will override the default block on the dist directory, by removing the pattern from both the lists.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Reorders and augments Ruff-related dependencies in top-level Cargo.toml and adds workspace dependencies in stdlib/Cargo.toml; registers a new internal tokenize module in stdlib/src/lib.rs; and introduces stdlib/src/tokenize.rs, a PyO3-backed _tokenize module exposing TokenizerIter that yields Python token 5-tuples and handles tokenizer state and EOF.

Changes

Cohort / File(s)	Summary
Top-level manifest `Cargo.toml`	Reordered Ruff-related crate entries; added `ruff_python_trivia`; moved `ruff_text_size` near `ruff_source_file`. No API changes.
Stdlib manifest updates `stdlib/Cargo.toml`	Added workspace dependencies: `ruff_python_ast`, `ruff_python_parser`, `ruff_source_file`, `ruff_text_size` (all `workspace = true`).
Module registration `stdlib/src/lib.rs`	Added `mod tokenize;` and registered `_tokenize` by including `"_tokenize" => tokenize::make_module` in module initializers.
Tokenizer implementation `stdlib/src/tokenize.rs`	New PyO3 module `_tokenize` exposing Python class `TokenizerIter` (`PyTokenizerIter`) and constructor args. Implements an iterator that reads lines via a Python `readline` callable, manages internal state with a `PyRwLock`, emits 5-tuples `(type, string, start_pos, end_pos, current_line)`, remaps operator tokens to `TOKEN_OP` when `extra_tokens` is true, and emits an EOF token tuple.

Sequence Diagram(s)

sequenceDiagram
    participant PyUser as Python user
    participant TokenIter as TokenizerIter (Py)
    participant State as PyTokenizerIterState
    participant Parser as Ruff parser

    PyUser->>TokenIter: next() / __next__()
    TokenIter->>TokenIter: check internal state & tokens
    alt tokens exhausted
        TokenIter->>PyUser: call readline(vm) -> line
        alt readline returns line
            TokenIter->>State: push_line(line)
            State->>Parser: parse_unchecked_source(line...)
            Parser-->>State: tokens
            State->>TokenIter: next_token()
            TokenIter->>TokenIter: token_kind_value(kind)
            alt extra_tokens and operator
                TokenIter->>TokenIter: remap -> TOKEN_OP (55)
            end
            TokenIter-->>PyUser: (type, string, start, end, current_line)
        else readline -> EOF
            TokenIter->>State: mark EOF
            TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)
        end
    else tokens available
        TokenIter-->>PyUser: next token tuple
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Focus review on stdlib/src/tokenize.rs concurrency (use of PyRwLock), correctness of token position/range calculations, EOF emission behavior, and token-kind remapping.
Verify workspace dependency consistency across Cargo.toml and stdlib/Cargo.toml.

Suggested reviewers

youknowone

Poem

A rabbit reads each Python line with care,
Hops through tokens drifting in the air,
Ruff crates in order, ready for the quest,
Tuples hop out tidy, positions dressed,
EOF tiptoes in — the iterator rests. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: implementing a base `_tokenize` module with a TokenizerIter class and supporting infrastructure.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5cad66c and 7bc0197.

⛔ Files ignored due to path filters (15)

Cargo.lock is excluded by !**/*.lock
Lib/test/test_tokenize.py is excluded by !Lib/**
Lib/test/tokenizedata/__init__.py is excluded by !Lib/**
Lib/test/tokenizedata/bad_coding.py is excluded by !Lib/**
Lib/test/tokenizedata/bad_coding2.py is excluded by !Lib/**
Lib/test/tokenizedata/badsyntax_3131.py is excluded by !Lib/**
Lib/test/tokenizedata/badsyntax_pep3120.py is excluded by !Lib/**
Lib/test/tokenizedata/coding20731.py is excluded by !Lib/**
Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt is excluded by !Lib/**
Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txt is excluded by !Lib/**
Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt is excluded by !Lib/**
Lib/test/tokenizedata/tokenize_tests.txt is excluded by !Lib/**
Lib/token.py is excluded by !Lib/**
Lib/tokenize.py is excluded by !Lib/**

📒 Files selected for processing (4)

Cargo.toml (1 hunks)
stdlib/Cargo.toml (1 hunks)
stdlib/src/lib.rs (2 hunks)
stdlib/src/tokenize.rs (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

stdlib/src/lib.rs
stdlib/src/tokenize.rs

{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

stdlib/src/lib.rs
stdlib/src/tokenize.rs

🧠 Learnings (4)

📓 Common learnings

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

📚 Learning: 2025-09-11T05:06:59.971Z

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

stdlib/src/lib.rs
stdlib/src/tokenize.rs

📚 Learning: 2025-09-11T05:06:59.971Z

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

Applied to files:

stdlib/Cargo.toml
stdlib/src/tokenize.rs
Cargo.toml

📚 Learning: 2025-09-11T05:06:59.971Z

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code

Applied to files:

Cargo.toml

🧬 Code graph analysis (2)

stdlib/src/lib.rs (2)

Lib/tokenize.py (1)

tokenize (466-492)

stdlib/src/socket.rs (1)

make_module (7-11)

stdlib/src/tokenize.rs (2)

vm/src/stdlib/io.rs (1)

vm (3324-3326)

Lib/test/test_tokenize.py (4)

readline (1329-1335)

readline (1350-1356)

readline (1576-1581)

readline (2106-2107)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: Run snippets and cpython tests (ubuntu-latest)
GitHub Check: Run tests under miri
GitHub Check: Run snippets and cpython tests on wasm-wasi
GitHub Check: Check the WASM package and demo
GitHub Check: Run snippets and cpython tests (macos-latest)
GitHub Check: Run snippets and cpython tests (windows-latest)
GitHub Check: Check Rust code with rustfmt and clippy
GitHub Check: Run rust tests (windows-latest)
GitHub Check: Run rust tests (ubuntu-latest)
GitHub Check: Run rust tests (macos-latest)
GitHub Check: Ensure compilation on various targets

stdlib/src/tokenize.rs

coderabbitai · 2025-11-08T11:02:12Z

stdlib/src/tokenize.rs

+                if nline.is_empty() {
+                    state.eof = true;
+                    *zelf.state.write() = state.clone();
+
+                    let line_num = &state.start().0;
+                    let out = vm
+                        .ctx
+                        .new_tuple(vec![
+                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
+                            vm.ctx.new_str("").into(),
+                            vm.ctx
+                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
+                                .into(),
+                            vm.ctx
+                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
+                                .into(),
+                            vm.ctx.new_str(state.current_line()).into(),
+                        ])
+                        .into();
+                    return Ok(PyIterReturn::Return(out));
+                }
+                state.push_line(&nline);
+            };


⚠️ Potential issue | 🔴 Critical

End-of-file tuple needs to come from the real EOF token.

Synthesizing the tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Grab the parser’s EndOfFile token, set it as prev_token, and let the existing emission path produce the tuple.

- if nline.is_empty() { - state.eof = true; - *zelf.state.write() = state.clone(); - - let line_num = &state.start().0; - let out = vm - .ctx - .new_tuple(vec![ - token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), - vm.ctx.new_str("").into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx - .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) - .into(), - vm.ctx.new_str(state.current_line()).into(), - ]) - .into(); - return Ok(PyIterReturn::Return(out)); - } + if nline.is_empty() { + state.eof = true; + if let Some(eof) = state + .tokens + .iter() + .rev() + .find(|token| token.kind() == TokenKind::EndOfFile) + { + state.prev_token = Some(*eof); + break eof.clone(); + } + return Ok(PyIterReturn::StopIteration(None)); + }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In stdlib/src/tokenize.rs around lines 116 to 138, don't synthesize the EOF tuple here (which hardcodes (line, -1) and reuses the previous line text); instead obtain the parser's EndOfFile token, set it as the state's prev_token (or otherwise set prev_token to that EOF token), set state.eof = true and persist the state as before, then allow the existing token-emission path to produce and return the EOF tuple so the coordinates and line content come from the real EOF token rather than being hardcoded.

coderabbitai · 2025-11-08T11:02:12Z

stdlib/src/tokenize.rs

+            if !zelf.extra_tokens {
+                token_repr = token_repr.trim();
+            }


⚠️ Potential issue | 🟠 Major

Don't trim token text when extra_tokens is false.

trim() wipes indentation whitespace (and other significant padding), so INDENT tokens end up empty strings. CPython preserves the exact lexeme. Just use the slice as-is.

- let mut token_repr = &state.source[state.range()]; - if !zelf.extra_tokens { - token_repr = token_repr.trim(); - } + let token_repr = &state.source[state.range()];

🤖 Prompt for AI Agents

In stdlib/src/tokenize.rs around lines 152 to 154, token text is being trimmed when zelf.extra_tokens is false which removes significant whitespace (e.g., INDENT lexemes); remove the call to trim() and use the slice as-is so the original lexeme (including leading/trailing whitespace) is preserved when extra_tokens is false.

coderabbitai · 2025-11-08T11:02:12Z

stdlib/src/tokenize.rs

+        extra_tokens: bool,
+        #[pyarg(named, optional)]


⚠️ Potential issue | 🔴 Critical

extra_tokens must default to False.

CPython’s TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks that API.

- #[pyarg(named)] + #[pyarg(named, default = "false")] extra_tokens: bool,

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In stdlib/src/tokenize.rs around lines 179-180, the extra_tokens parameter currently requires being passed and therefore breaks CPython's TokenizerIter API; make extra_tokens default to False so it can be omitted. Update the pyarg declaration to make the argument optional with a default of false (or change the parameter to Option<bool> and treat None as false) so TokenizerIter(readline) works without naming this flag.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

stdlib/src/tokenize.rs (4)

39-42: Don't swallow non-StopIteration errors from readline.

Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only StopIteration should terminate the stream; all other errors must propagate.

Apply this diff to fix the error handling:

-            let raw_line = match self.readline.invoke((), vm) {
-                Ok(v) => v,
-                Err(_) => return Ok(String::new()),
-            };
+            let raw_line = match self.readline.invoke((), vm) {
+                Ok(v) => v,
+                Err(exc) => {
+                    if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) {
+                        return Ok(String::new());
+                    }
+                    return Err(exc);
+                }
+            };

116-138: End-of-file tuple needs to come from the real EOF token.

Synthesizing the EOF tuple here hardcodes (line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser's EndOfFile token, set it as prev_token, and let the existing emission path produce the correct tuple.

Apply this diff to use the parser's EOF token:

                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .rev()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }

151-154: Don't trim token text when extra_tokens is false.

trim() removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.

Apply this diff to preserve the exact lexeme:

-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];

178-180: extra_tokens must default to False.

CPython's TokenizerIter allows calling TokenizerIter(readline) without naming the flag. Requiring the argument here breaks API compatibility.

Apply this diff to make extra_tokens optional with a default:

         #[pyarg(positional)]
         readline: ArgCallable,
-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,
         #[pyarg(named, optional)]
         encoding: Option<PyStrRef>,

🧹 Nitpick comments (2)

stdlib/src/tokenize.rs (2)

108-109: Reminder: Address the TODO comment for error handling.

The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The state.errors field is populated but never checked.

Do you want me to generate code to check state.errors and raise appropriate SyntaxError exceptions, or open a new issue to track this task?

200-207: Re-parsing the entire source on each line may be inefficient.

The push_line method re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.

However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7bc0197 and 04371d3.

📒 Files selected for processing (1)

stdlib/src/tokenize.rs (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

stdlib/src/tokenize.rs

{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

stdlib/src/tokenize.rs

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

📚 Learning: 2025-09-11T05:06:59.971Z

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

stdlib/src/tokenize.rs

🧬 Code graph analysis (1)

stdlib/src/tokenize.rs (2)

vm/src/stdlib/io.rs (1)

vm (3324-3326)

Lib/test/test_tokenize.py (4)

readline (1329-1335)

readline (1350-1356)

readline (1576-1581)

readline (2106-2107)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Run snippets and cpython tests on wasm-wasi
GitHub Check: Check Rust code with rustfmt and clippy
GitHub Check: Check the WASM package and demo
GitHub Check: Run snippets and cpython tests (macos-latest)
GitHub Check: Run snippets and cpython tests (ubuntu-latest)
GitHub Check: Run snippets and cpython tests (windows-latest)
GitHub Check: Run rust tests (ubuntu-latest)
GitHub Check: Run rust tests (windows-latest)
GitHub Check: Run rust tests (macos-latest)
GitHub Check: Ensure compilation on various targets

🔇 Additional comments (5)

stdlib/src/tokenize.rs (5)

1-23: LGTM!

Module structure and imports follow RustPython conventions correctly.

25-33: LGTM!

The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.

61-72: LGTM!

Debug implementation and empty pyclass impl block are correct.

74-93: LGTM!

Constructor implementation follows RustPython patterns correctly.

210-220: No changes needed to current_line logic.

The current_line method correctly uses the LineRanges trait methods from ruff_source_file: full_line_str(offset) returns the full line containing an offset, and full_lines_str(range) returns text covering all lines that a range spans. The logic properly handles edge cases:

Multi-line tokens: Correctly handled via full_lines_str(range) for non-Newline tokens

Newline tokens: Correctly isolated via full_line_str(range.start())

When prev_token is None: Safely defaults to TokenKind::Unknown with TextRange::default(), which calls full_lines_str and returns lines from the start

The code is sound and requires no modifications.

coderabbitai · 2025-11-08T11:11:33Z

stdlib/src/tokenize.rs

+        fn next_token(&mut self) -> Option<Token> {
+            for token in self.tokens.iter() {
+                let (kind, range) = token.as_tuple();
+
+                if matches!(kind, TokenKind::NonLogicalNewline) {
+                    continue;
+                }
+
+                if matches!(range.ordering(self.range()), Ordering::Greater) {
+                    self.prev_token = Some(*token);
+                    return self.prev_token;
+                }
+            }
+
+            None
+        }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify the token ordering logic.

The next_token method uses range.ordering(self.range()) to find tokens after the current position. Ensure this correctly handles:

Zero-width tokens (if any)

Tokens at the exact same position

The initial state when prev_token is None (resulting in TextRange::default())

Run the following script to verify the behavior with the Ruff crates:

🏁 Script executed:

#!/bin/bash # Check how TextRange::default() and ordering work ast-grep --pattern $'TextRange::default()' rg -n -C3 "TextRange::default|ordering" --type=rust

Length of output: 10814

I need to find the TextRange type definition and understand its ordering() method to verify the logic.

Now I need to understand the context better. Let me search for how next_token is used and what the intended behavior should be.

Let me search for the actual tokenize.rs file context to understand the full implementation.

Let me try a different approach to view the file:

Now let me examine the tokenize.rs file more closely:

--- 🏁 Script executed: ```shell #!/bin/bash # Get the full context of the tokenize.rs file around the problematic method cat stdlib/src/tokenize.rs | head -300 | tail -100

Length of output: 3125

🏁 Script executed:

#!/bin/bash # Find and display the tokenize.rs file with context around the next_token method fd -name "tokenize.rs" -type f

Length of output: 230

🏁 Script executed:

#!/bin/bash # Display the tokenize.rs file to understand the structure and usage rg -n "fn next_token|impl|struct|fn range|self.tokens" stdlib/src/tokenize.rs -A5 -B2 --type=rust

Length of output: 3859

Let me examine the IterNext implementation to understand how next_token() is actually being used:

--- 🏁 Script executed: ```shell #!/bin/bash # Get the IterNext implementation to see how next_token is used sed -n '97,173p' stdlib/src/tokenize.rs

Length of output: 2843

Token ordering logic has a critical issue: tokens at position 0 are skipped on initial state.

When prev_token is None, self.range() returns TextRange::default() which is a zero-width range at position 0 (0..0). The comparison range.ordering(self.range()) > Ordering::Greater only returns true for tokens completely after the current range. Since tokens starting at position 0 (e.g., 0..1) overlap with 0..0, they return Ordering::Equal and are skipped. Only tokens starting after position 0 are returned first, meaning the initial token in the source is never yielded.

The same issue affects zero-width tokens (e.g., at position N): they overlap with their containing range and won't be returned when that range is current.

The fix requires checking for Ordering::Equal and Ordering::Greater on initial state, or using TextRange::EMPTY properly, or advancing past the current position differently to include overlapping tokens.

🤖 Prompt for AI Agents

In stdlib/src/tokenize.rs around lines 223 to 238, the token selection logic skips tokens that overlap the current range (notably tokens at position 0 when prev_token is None) because it only accepts Ordering::Greater; change the comparison to accept both Ordering::Equal and Ordering::Greater (or compare token.range().start() against self.range().end()) so tokens that start at the same position as the current range are returned; update the conditional to check for Equal || Greater (or use token_start >= current_end) and ensure prev_token is set before returning to preserve state for subsequent calls.

coderabbitai · 2025-11-08T11:11:33Z

stdlib/src/tokenize.rs

+    const fn token_kind_value(kind: TokenKind) -> u8 {
+        match kind {
+            TokenKind::EndOfFile => 0,
+            TokenKind::Name
+            | TokenKind::For
+            | TokenKind::In
+            | TokenKind::Pass
+            | TokenKind::Class
+            | TokenKind::And
+            | TokenKind::Is
+            | TokenKind::Raise
+            | TokenKind::True
+            | TokenKind::False
+            | TokenKind::Assert
+            | TokenKind::Try
+            | TokenKind::While
+            | TokenKind::Yield
+            | TokenKind::Lambda
+            | TokenKind::None
+            | TokenKind::Not
+            | TokenKind::Or
+            | TokenKind::Break
+            | TokenKind::Continue
+            | TokenKind::Global
+            | TokenKind::Nonlocal
+            | TokenKind::Return
+            | TokenKind::Except
+            | TokenKind::Import
+            | TokenKind::Case
+            | TokenKind::Match
+            | TokenKind::Type
+            | TokenKind::Await
+            | TokenKind::With
+            | TokenKind::Del
+            | TokenKind::Finally
+            | TokenKind::From
+            | TokenKind::Def
+            | TokenKind::If
+            | TokenKind::Else
+            | TokenKind::Elif
+            | TokenKind::As
+            | TokenKind::Async => 1,
+            TokenKind::Int | TokenKind::Complex | TokenKind::Float => 2,
+            TokenKind::String => 3,
+            TokenKind::Newline | TokenKind::NonLogicalNewline => 4,
+            TokenKind::Indent => 5,
+            TokenKind::Dedent => 6,
+            TokenKind::Lpar => 7,
+            TokenKind::Rpar => 8,
+            TokenKind::Lsqb => 9,
+            TokenKind::Rsqb => 10,
+            TokenKind::Colon => 11,
+            TokenKind::Comma => 12,
+            TokenKind::Semi => 13,
+            TokenKind::Plus => 14,
+            TokenKind::Minus => 15,
+            TokenKind::Star => 16,
+            TokenKind::Slash => 17,
+            TokenKind::Vbar => 18,
+            TokenKind::Amper => 19,
+            TokenKind::Less => 20,
+            TokenKind::Greater => 21,
+            TokenKind::Equal => 22,
+            TokenKind::Dot => 23,
+            TokenKind::Percent => 24,
+            TokenKind::Lbrace => 25,
+            TokenKind::Rbrace => 26,
+            TokenKind::EqEqual => 27,
+            TokenKind::NotEqual => 28,
+            TokenKind::LessEqual => 29,
+            TokenKind::GreaterEqual => 30,
+            TokenKind::Tilde => 31,
+            TokenKind::CircumFlex => 32,
+            TokenKind::LeftShift => 33,
+            TokenKind::RightShift => 34,
+            TokenKind::DoubleStar => 35,
+            TokenKind::PlusEqual => 36,
+            TokenKind::MinusEqual => 37,
+            TokenKind::StarEqual => 38,
+            TokenKind::SlashEqual => 39,
+            TokenKind::PercentEqual => 40,
+            TokenKind::AmperEqual => 41,
+            TokenKind::VbarEqual => 42,
+            TokenKind::CircumflexEqual => 43,
+            TokenKind::LeftShiftEqual => 44,
+            TokenKind::RightShiftEqual => 45,
+            TokenKind::DoubleStarEqual => 46,
+            TokenKind::DoubleSlash => 47,
+            TokenKind::DoubleSlashEqual => 48,
+            TokenKind::At => 49,
+            TokenKind::AtEqual => 50,
+            TokenKind::Rarrow => 51,
+            TokenKind::Ellipsis => 52,
+            TokenKind::ColonEqual => 53,
+            TokenKind::Exclamation => 54,
+            TokenKind::FStringStart => 59,
+            TokenKind::FStringMiddle => 60,
+            TokenKind::FStringEnd => 61,
+            TokenKind::Comment => 62,
+            TokenKind::TStringStart => 62,  // 3.14 compatible
+            TokenKind::TStringMiddle => 63, // 3.14 compatible
+            TokenKind::TStringEnd => 64,    // 3.14 compatible
+            TokenKind::IpyEscapeCommand | TokenKind::Question => 0, // Ruff's specific
+            TokenKind::Unknown => 0,
+        }
+    }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify token value mappings match CPython.

This mapping should match the token values from CPython's token module. The comment at line 22 indicates TOKEN_OP is 55, which is correct for Python 3.x.

Please verify that all other token values match CPython, especially:

Lines 380-382: TString tokens marked as "3.14 compatible"

Line 379: Comment token maps to 62 (same as TStringStart on line 380)

Lines 383-384: Ruff-specific tokens map to 0

🌐 Web query:

What are the token type values in CPython's token module for Python 3.14?

💡 Result:

I can show you the exact numeric values from your local CPython 3.14 installation — run this in a Python 3.14 interpreter:

python -c "import token, pprint; pprint.pprint({name: getattr(token, name) for name in sorted(token.tok_name.values())})"

Or, to get the reverse mapping (name → numeric value):

python -c "import token, pprint; pprint.pprint({k: v for v, k in token.tok_name.items()})"

The official Python 3.14 documentation for the token module (and the module source) is here for reference. [1]

Sources:
[1] Python 3.14 token module documentation — docs.python.org.

#!/usr/bin/env python3 import sys try: import token print("Python version:", sys.version) print("\nToken name to value mapping:") # Get token name to value mapping tok_mapping = {} for value, name in token.tok_name.items(): tok_mapping[name] = value # Sort and print for name in sorted(tok_mapping.keys()): print(f" {name}: {tok_mapping[name]}") except ImportError as e: print(f"Error: {e}")

🏁 Script executed:

#!/bin/bash # Also check if we can inspect the tokenize.rs file for comparison cd /tmp || exit 1 cat > check_cpython_tokens.py << 'EOF' import token import pprint print("=== Token module values ===") for tok_val in sorted(token.tok_name.keys()): print(f"{token.tok_name[tok_val]}: {tok_val}") EOF python3 check_cpython_tokens.py

Length of output: 844

🏁 Script executed:

#!/bin/bash # Get the full TokenKind enum definition and context around the function cd /tmp || exit 1 find . -name "tokenize.rs" -type f 2>/dev/null | head -5

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for TokenKind enum definition rg -n "enum TokenKind" --type=rs -A 100

Length of output: 26

🏁 Script executed:

#!/bin/bash # Inspect the TokenKind enum and the token_kind_value function rg -n "enum TokenKind" -A 150 | head -200

Length of output: 0

🏁 Script executed:

#!/bin/bash # Look at the token_kind_value function and surrounding context rg -n "const fn token_kind_value" -B 5 -A 120 | head -200

Length of output: 6992

🏁 Script executed:

#!/bin/bash # Search for any comments about token mapping or Python compatibility rg -n "CPython|Python.*token|token.*value" stdlib/src/tokenize.rs | head -20

Length of output: 333

Critical token mapping errors found - multiple duplicates and CPython misalignment.

Verification against CPython 3.14 token module revealed significant issues:

Duplicate values (lines 379-380): Both Comment and TStringStart map to 62, which is invalid

Incorrect Comment value (line 379): Maps to 62, but CPython has COMMENT=61

FString tokens don't match CPython (lines 376-378): Map to 59, 60, 61 which correspond to SOFT_KEYWORD, ERRORTOKEN, and COMMENT in CPython—not FString values

TString tokens misaligned (lines 380-382): The "3.14 compatible" comment is inaccurate; these values (62, 63, 64) don't correspond to template string tokens in CPython 3.14

Exclamation token (line 375): Maps to 54, but CPython has OP=54; Exclamation is not a standard Python token type

The mapping for lines 375-382 needs correction to either match CPython standard tokens or use reserved/unassigned values to avoid collisions and type misalignment.

🤖 Prompt for AI Agents

stdlib/src/tokenize.rs around lines 281-386: the TokenKind->u8 mapping has duplicate and misaligned values (e.g., Comment and TStringStart both 62, FStringStart/Middle/End set to 59/60/61 which conflict with CPython 3.14 meanings, Exclamation mapped to 54 which collides with OP, and the "3.14 compatible" comments are incorrect). Fix by replacing those values with the correct CPython 3.14 token numeric assignments for COMMENT, ERRORTOKEN/SOFT_KEYWORD/SO_ON if you intend CPython parity; if you intentionally diverge, pick unused/reserved numeric values that do not collide with existing entries, remove or correct the misleading "3.14 compatible" comments, ensure IpyEscapeCommand/Question and Unknown map to distinct safe values (or 0 only when appropriate), and run tokenizer unit tests and a full token value uniqueness check to confirm no duplicates remain.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

stdlib/src/tokenize.rs (5)
121-143: EOF tuple must use the parser's actual EndOfFile token.

The code synthesizes the EOF tuple with hardcoded (-1) positions (lines 132, 135) and reuses state.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser's EndOfFile token from state.tokens, set it as prev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.

Apply this approach:
                 let nline = zelf.readline(vm)?;
                 if nline.is_empty() {
                     state.eof = true;
-                    *zelf.state.write() = state.clone();
-
-                    let line_num = &state.start().0;
-                    let out = vm
-                        .ctx
-                        .new_tuple(vec![
-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
-                            vm.ctx.new_str("").into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx
-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
-                                .into(),
-                            vm.ctx.new_str(state.current_line()).into(),
-                        ])
-                        .into();
-                    return Ok(PyIterReturn::Return(out));
+                    if let Some(eof) = state
+                        .tokens
+                        .iter()
+                        .find(|token| token.kind() == TokenKind::EndOfFile)
+                    {
+                        state.prev_token = Some(*eof);
+                        break *eof;
+                    }
+                    *zelf.state.write() = state.clone();
+                    return Ok(PyIterReturn::StopIteration(None));
                 }
156-159: Remove trim() to preserve exact token lexemes.

Calling trim() when extra_tokens is false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.

Apply this diff:
-            let mut token_repr = &state.source[state.range()];
-            if !zelf.extra_tokens {
-                token_repr = token_repr.trim();
-            }
+            let token_repr = &state.source[state.range()];
183-184: Make extra_tokens optional with default false.

CPython allows TokenizerIter(readline) without specifying extra_tokens. The current declaration requires the argument, breaking API compatibility.

Apply this diff:
-        #[pyarg(named)]
-        extra_tokens: bool,
+        #[pyarg(named, default = "false")]
+        extra_tokens: bool,
227-243: Token ordering skips tokens that overlap the current range.

When prev_token is None, self.range() returns TextRange::default() (0..0). The comparison range.ordering(self.range()) > Ordering::Greater (line 236) only matches tokens completely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returning Ordering::Equal, and are skipped. The first token in the source is never yielded.

Fix by comparing start positions:
-                if matches!(range.ordering(self.range()), Ordering::Greater) {
+                if range.start() > self.range().end() {
                     self.prev_token = Some(*token);
                     return self.prev_token;
                 }
286-391: Critical token mapping errors: duplicates and CPython conflicts.

Multiple critical issues remain:

Duplicate value 62 (lines 384-385): Both Comment and TStringStart map to 62

Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has COMMENT=61

FString conflicts (lines 381-383): Values 59, 60, 61 correspond to SOFT_KEYWORD, ERRORTOKEN, and COMMENT in CPython 3.14

Exclamation collision (line 380): Maps to 54, which is CPython's OP token

Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate

These mappings must be corrected to either:

Match CPython 3.14 standard token values exactly, or

Use reserved/unassigned values (e.g., 100+) to avoid collisions

Run this verification to confirm no remaining duplicates after fixes:
#!/bin/bash
# Extract all token value assignments and check for duplicates
rg -n "TokenKind::\w+ => \d+" stdlib/src/tokenize.rs | \
  gawk '{print $NF}' | sort | uniq -d | \
  xargs -I {} echo "Duplicate token value found: {}"

🧹 Nitpick comments (2)

stdlib/src/tokenize.rs (2)

189-212: State structure is correct; consider performance implications.

The push_line method reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.

113-118: TODO: Implement syntax error handling.

The TODO on line 114 indicates that parse errors from state.errors are not currently raised as SyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04371d3 and 30f54f6.

⛔ Files ignored due to path filters (1)

Lib/test/test_tokenize.py is excluded by !Lib/**

📒 Files selected for processing (1)

stdlib/src/tokenize.rs (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (run cargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

stdlib/src/tokenize.rs

{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

stdlib/src/tokenize.rs

🧠 Learnings (2)

📓 Common learnings

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

📚 Learning: 2025-09-11T05:06:59.971Z

Learnt from: CR
Repo: RustPython/RustPython PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-09-11T05:06:59.971Z
Learning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

stdlib/src/tokenize.rs

🧬 Code graph analysis (1)

stdlib/src/tokenize.rs (2)

vm/src/stdlib/io.rs (1)

vm (3324-3326)

Lib/test/test_tokenize.py (4)

readline (1328-1334)

readline (1349-1355)

readline (1575-1580)

readline (2105-2106)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Run snippets and cpython tests (ubuntu-latest)
GitHub Check: Run snippets and cpython tests on wasm-wasi
GitHub Check: Run snippets and cpython tests (macos-latest)
GitHub Check: Run snippets and cpython tests (windows-latest)
GitHub Check: Run rust tests (macos-latest)
GitHub Check: Run rust tests (windows-latest)
GitHub Check: Check the WASM package and demo
GitHub Check: Run rust tests (ubuntu-latest)
GitHub Check: Check Rust code with rustfmt and clippy
GitHub Check: Ensure compilation on various targets

🔇 Additional comments (9)

stdlib/src/tokenize.rs (9)

1-23: LGTM: Module structure and imports are well-organized.

The #[pymodule] macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.

25-33: Struct definition is sound; track the TODO for future work.

The use of PyRwLock for state management is appropriate. The TODO on line 29 regarding PyObject vs ArgCallable can be addressed in a follow-up.

35-64: LGTM: Error handling correctly distinguishes StopIteration from other exceptions.

The implementation now properly treats only StopIteration as EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.

66-77: LGTM: Standard Debug implementation and trait delegation.

79-98: LGTM: Constructor implementation is clean.

100-100: LGTM: Standard SelfIter implementation.

214-225: LGTM: Current line extraction logic is appropriate.

245-267: LGTM: Position helper methods correctly convert to Python conventions.

The start() and end() methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.

270-284: LGTM: Default state initialization is clean.

ShaharNaveh added 4 commits November 8, 2025 12:50

Update token{ize,}.py from 3.13.9

c9d2ad5

Lazy readline

d8a76a3

Support encoding

327052e

Mark/unmark tests

7bc0197

ShaharNaveh requested review from coolreader18 and youknowone November 8, 2025 10:54

clippy

04371d3

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

Handle readline stopiteration

30f54f6

coderabbitai bot reviewed Nov 8, 2025

View reviewed changes

ShaharNaveh added 2 commits November 9, 2025 00:02

Unmark passing test

57d1696

fix path on test_tarfile.py

37f6c72

ShaharNaveh marked this pull request as draft November 14, 2025 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Base implementation of `_tokenize` module #6240

Base implementation of `_tokenize` module #6240

Uh oh!

ShaharNaveh commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 8, 2025 •

edited

Loading

Review skipped

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot Nov 8, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Base implementation of _tokenize module #6240

Are you sure you want to change the base?

Base implementation of _tokenize module #6240

Uh oh!

Conversation

ShaharNaveh commented Nov 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Base implementation of `_tokenize` module #6240

Base implementation of `_tokenize` module #6240

ShaharNaveh commented Nov 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 8, 2025 •

edited

Loading