[GraphPartition] cache get_free_symbol_uses #166338

BoyuanFeng · 2025-10-27T21:16:37Z

Graph partition relies on get_free_symbol_uses() to collect symbol inputs.

Lines 4869 to 4885 in ee7434b

    
                   def get_scheduler_node_symbol_uses( 
        
                       node: BaseSchedulerNode, 
        
                   ) -> OrderedSet[sympy.Symbol]: 
        
                       """ 
        
                       Gets symbols used in node. 
        
                       """ 
        
                       if isinstance(node, FusedSchedulerNode): 
        
                           return OrderedSet().union( 
        
                               *(get_scheduler_node_symbol_uses(snode) for snode in node.snodes) 
        
                           ) 
        
                       assert node.node is not None 
        
                       free_symbol_uses = node.node.get_free_symbol_uses() 
        
                       free_symbol_uses.update( 
        
                           *(get_layout_symints(ir_node) for ir_node in node.node.get_outputs()) 
        
                       ) 
        
                       return free_symbol_uses

I empirically observed that get_free_symbol_uses() becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to get_free_symbol_uses() for 1 node.

Why? Because get_free_symbol_uses() may recursively call another get_free_symbol_uses(), which could recursively run many times.

pytorch/torch/_inductor/ir.py

Lines 4541 to 4543 in ee7434b

    
           result = self.layout.get_free_symbol_uses( 
        
               unbacked_only 
        
           ) | self.data.get_free_symbol_uses(unbacked_only)

This PR fixes the issue by caching the results of get_free_symbol_uses(). I validated on torchtitan that the issue is fixed.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-10-27T21:16:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166338

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0d670c8 with merge base 365ed62 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

laithsakka · 2025-10-28T06:20:50Z

seems reasonable are inductor nodes immutable? @eellison
if NOT wonder if we can do this optimization in a more safe way, within a context that i know that the nodes are not changing i can cache ? that would be dependent on the torchtitan case.

desertfire · 2025-10-28T14:19:53Z

torch/_inductor/utils.py

+    def wrapper(self: Any, *args: P.args, **kwargs: P.kwargs) -> RV:
+        key = (id(self), args, tuple(sorted(kwargs.items())))
+        if key not in cache:
+            cache[key] = fn(self, *args, **kwargs)


Looking at how cache_on_self was implemented, I think we should do something similar here to further improve the performance.

eellison

Right, flexible layout might change stride, although it's unlikely it would induce a new symbol use. Would it be safer to only cache if the layout is fixed ?

BoyuanFeng · 2025-10-28T20:59:54Z

torch/_inductor/ir.py

+
+    @offset.setter
+    def offset(self, value: Expr) -> None:
+        self.assert_free_symbol_uses_unchanged("offset", value)


error if free symbols are added or deleted after initialization.

eellison

sorry, one last question - at the point we call it all these nodes should have fixed layout. should we just only cache in the fixed layout case ? i think that will be a bit simpler.

desertfire · 2025-10-30T17:05:05Z

sorry, one last question - at the point we call it all these nodes should have fixed layout. should we just only cache in the fixed layout case ? i think that will be a bit simpler.

That is simpler but probably will have performance implication. @BoyuanFeng , I wonder how much performance difference it will be.

atalman · 2025-10-31T12:55:03Z

@pytorchmergebot revert -c nosignal -m "Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values GH job link HUD commit link"

pytorchmergebot · 2025-10-31T12:57:52Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-10-31T12:57:58Z

@BoyuanFeng your PR has been successfully reverted.

This reverts commit a6b1ef1. Reverted #166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a6b1ef17173f56ba93ac97ff4384fa4060b5e41e) ([comment](#166338 (comment)))

BoyuanFeng · 2025-10-31T17:47:26Z

@atalman the failure is not related to this pr. I also cannot repro locally. Let me rebase and try ci again.

BoyuanFeng · 2025-10-31T21:16:31Z

@pytorchbot merge

pytorchmergebot · 2025-10-31T21:18:25Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: #166338 Approved by: https://github.com/eellison

This reverts commit a6b1ef1. Reverted #166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a6b1ef17173f56ba93ac97ff4384fa4060b5e41e) ([comment](#166338 (comment)))

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: #166338 Approved by: https://github.com/eellison

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: pytorch#166338 Approved by: https://github.com/eellison

This reverts commit a6b1ef1. Reverted pytorch#166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a6b1ef17173f56ba93ac97ff4384fa4060b5e41e) ([comment](pytorch#166338 (comment)))

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: pytorch#166338 Approved by: https://github.com/eellison

BoyuanFeng · 2025-11-04T18:43:07Z

@pytorchbot cherry-pick --onto release/2.9 --fixes "Inductor partition compilation infinite hang issue introduced in 2.9.0 breaking torchtitan" -c fixnewfeature

pytorchbot · 2025-11-04T18:48:31Z

Cherry picking #166338

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x dfebdcab86acbaa0eaa996b47595e5f27a66492e returned non-zero exit code 1

Auto-merging test/inductor/test_torchinductor.py
Auto-merging torch/_inductor/ir.py
CONFLICT (content): Merge conflict in torch/_inductor/ir.py
Auto-merging torch/_inductor/utils.py
CONFLICT (content): Merge conflict in torch/_inductor/utils.py
error: could not apply dfebdcab86a... [GraphPartition] cache get_free_symbol_uses (#166338)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: #166338 Approved by: https://github.com/eellison (cherry picked from commit dfebdca)

Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/scheduler.py#L4869-L4885 I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node. Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times. https://github.com/pytorch/pytorch/blob/ee7434be822cf6e75b4566d8159f550ee233d8ae/torch/_inductor/ir.py#L4541-L4543 This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed. Pull Request resolved: #166338 Approved by: https://github.com/eellison (cherry picked from commit dfebdca) Co-authored-by: Boyuan Feng <boyuan@meta.com>

cache get_free_symbol_uses

0361c22

BoyuanFeng added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: inductor labels Oct 27, 2025

pytorch-bot bot added the ciflow/inductor label Oct 27, 2025

BoyuanFeng mentioned this pull request Oct 27, 2025

[DO NOT REVIEW] Inductor lite mode with CUDAGraph support #166320

Closed

3 tasks

BoyuanFeng requested review from desertfire and eellison October 27, 2025 21:18

BoyuanFeng added 2 commits October 27, 2025 14:20

lint

46110df

Merge branch 'main' into bf/partition-cache-free-symbols

47d229f

eellison requested a review from laithsakka October 27, 2025 21:50

BoyuanFeng added 3 commits October 27, 2025 15:47

fix ConcatKernel

ab23224

add obj id to cache key

e4bc3e1

Merge branch 'main' into bf/partition-cache-free-symbols

1bb964f

desertfire reviewed Oct 28, 2025

View reviewed changes

eellison reviewed Oct 28, 2025

View reviewed changes

check that FlexibleLayout does not change free symbol uses

7c59352

BoyuanFeng commented Oct 28, 2025

View reviewed changes

BoyuanFeng added 6 commits October 28, 2025 14:03

use exec to make cache faster

dbc76f6

nit

261e6eb

lint

261b4c1

Merge branch 'main' into bf/partition-cache-free-symbols

565ff78

nit

48172eb

store on self.key

f8b2bb5

BoyuanFeng requested review from desertfire and eellison October 30, 2025 00:13

eellison reviewed Oct 30, 2025

View reviewed changes

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 31, 2025

pytorchmergebot reopened this Oct 31, 2025

Merge branch 'main' into bf/partition-cache-free-symbols

0d670c8

pytorchmergebot added the merging label Oct 31, 2025

pytorchmergebot closed this in dfebdca Oct 31, 2025

pytorchmergebot removed the merging label Oct 31, 2025

This was referenced Nov 4, 2025

[GraphPartition] cache get_free_symbol_uses (#166338) #166994

Merged

[v2.9.1] Release Tracker #166758

Closed

github-actions bot deleted the bf/partition-cache-free-symbols branch December 5, 2025 02:18

	def get_scheduler_node_symbol_uses(
	node: BaseSchedulerNode,
	) -> OrderedSet[sympy.Symbol]:
	"""
	Gets symbols used in node.
	"""
	if isinstance(node, FusedSchedulerNode):
	return OrderedSet().union(
	*(get_scheduler_node_symbol_uses(snode) for snode in node.snodes)
	)
	assert node.node is not None
	free_symbol_uses = node.node.get_free_symbol_uses()
	free_symbol_uses.update(
	*(get_layout_symints(ir_node) for ir_node in node.node.get_outputs())
	)
	return free_symbol_uses

	result = self.layout.get_free_symbol_uses(
	unbacked_only
	) \| self.data.get_free_symbol_uses(unbacked_only)

[GraphPartition] cache get_free_symbol_uses #166338

[GraphPartition] cache get_free_symbol_uses #166338

Uh oh!

Conversation

BoyuanFeng commented Oct 27, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166338

✅ No Failures

Uh oh!

laithsakka commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

desertfire Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

BoyuanFeng Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

desertfire commented Oct 30, 2025

Uh oh!

atalman commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Uh oh!

BoyuanFeng commented Oct 31, 2025

Uh oh!

BoyuanFeng commented Oct 31, 2025

Uh oh!

pytorchmergebot commented Oct 31, 2025

Merge started

Uh oh!

BoyuanFeng commented Nov 4, 2025

Uh oh!

pytorchbot commented Nov 4, 2025

Cherry picking #166338

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

BoyuanFeng commented Oct 27, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 27, 2025 •

edited

Loading

laithsakka commented Oct 28, 2025 •

edited

Loading