🌐 AI搜索 & 代理 主页
Skip to content

Conversation

@QuLogic
Copy link
Member

@QuLogic QuLogic commented Sep 27, 2025

PR summary

The original code split fonts into subsets based on the character modulo the subset size (determined by font type limits). This had some limitations:

  1. If characters were used from across various font blocks (especially for type 3, whose blocks were only 256 characters), then there would be many, possibly sparsely-populated, font subsets.
  2. Sometimes a single character code may map to multiple glyphs. This is the case if you mix languages (i.e., Add language parameter to Text objects #29794), but the naive tracking would only produce one glyph.
  3. Sometimes multiple characters can map to a single glyph. This is the case for ligatures, and also with complex text shaping (such as Arabic), and this would just fail calling ord on a multi-char string.

To fix this, CharacterTracker now tracks characters and glyphs more closely. Specifically,

  1. for each font, a (character code(s), glyph index)-pair is mapped to a (subset index, subset character code)-pair. This ensures that point 2 above is handled.
  2. If the above map doesn't exist yet, then a subset index/character code is calculated:
    1. if the (singular) character code is in the first block (255 for type 3, or 64k for type 42), then keep the character code the same and put it in subset 0; this preserves the text in those lower ranges if you happen to be looking at a PDF directly
    2. if the (singular) character code is already in subset 0, then bump it to the next available spot; a conflict here means the character is being used with multiple glyphs (i.e., another case for part 2 above)
    3. if the character code is in fact multiple character codes, then also bump to the next available spot as it could never be in the subset 0 (this is part 3 above)
    4. the next available spot is the next character code in the next subset block, if necessary; by filling as needed, this takes care of point 1 above

With these changes, the complex/font features/languages tests in #30607 produce correct results.

PR checklist

@QuLogic QuLogic added this to the v3.11.0 milestone Sep 27, 2025
@github-project-automation github-project-automation bot moved this to Waiting for other PR in Font and text overhaul Sep 27, 2025
@QuLogic QuLogic moved this from Waiting for other PR to Ready for Review in Font and text overhaul Sep 27, 2025
@QuLogic QuLogic changed the title Deduplicate CharacterTracker.track implementation Prepare CharacterTracker for advanced font features Sep 27, 2025
@QuLogic
Copy link
Member Author

QuLogic commented Sep 27, 2025

Note, I think the original commit was small, and the remaining ended up small enough, that I just put them all in this PR.

print("%%BeginProlog", file=fh)
if not mpl.rcParams['ps.useafm']:
Ndict += len(ps_renderer._character_tracker.used)
Ndict += sum(map(len, ps_renderer._character_tracker.used.values()), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 0 at the end is unneeded.

No need to repeat the calculation of subset blocks, but instead offload
it to `track_glyph`.
Instead of splitting fonts into `subset_size` blocks and writing text as
character code modulo `subset_size`, compress the blocks by doing two
things:

1. Preserve the character code if it lies in the first block. This keeps
   ASCII (for Type 3) and the Basic Multilingual Plane (for Type 42) as
   their normal codes.
2. Push everything else into the next spot in the next block, splitting
   by `subset_size` as necessary.

This should reduce the number of additional font subsets to embed.
If mixing languages, sometimes a single character may use different
glyphs in one document. In that case, we need to give it a new character
code in the next subset, since subset 0 is preserving character codes.
@QuLogic
Copy link
Member Author

QuLogic commented Sep 30, 2025

OK, I've handled all your comments, I think. I also fixed subsetting in the PostScript backend, noted above.

There are 3 test image changes:

  • 2 PDF tests think that the fourth emoji moved slightly; I think this is because the characters on the line are now from the same font instead of split between emoji/not. But this seems very small, as viewing it at 100% doesn't really seem to show any difference.
  • 1 EPS test change; this is because subsetting is now "real", so the switch from type 3 to type 42 no longer happens. Ghostscript seems to convert those to raster a little different, but on the upside, now the Computer Modern and DejaVu fonts look the same.

Copy link
Contributor

@anntzer anntzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor point regarding a comment.

For ligatures or complex shapings, multiple characters may map to a
single glyph. In this case, we still want to output a single character
code for the string using the font subset, but the `ToUnicode` map
should give back all the characters.
Previously, this was supposed to "upgrade" type 3 to type 42 if the
number of glyphs overflowed. However, as `CharacterTracker` can suggest
a new subset for other reasons (i.e., multiple glyphs for the same
character or a glyph for multiple characters may go to a second subset),
we do need proper subset handling here as well.

Since that is now done, we can drop the "promotion" from type 3 to type
42, as we don't get too many glyphs in each embedded font.
@QuLogic
Copy link
Member Author

QuLogic commented Oct 2, 2025

Removed the image changes (and moved them to the text-overhaul-figures branch) in preparation for merging.

@QuLogic
Copy link
Member Author

QuLogic commented Oct 2, 2025

Linting issues are known (#30626), so merging over those.

@QuLogic QuLogic merged commit ed4ca6c into matplotlib:text-overhaul Oct 2, 2025
35 of 36 checks passed
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in Font and text overhaul Oct 2, 2025
@QuLogic QuLogic deleted the simpler-track branch October 2, 2025 23:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants