Prepare `CharacterTracker` for advanced font features #30608

QuLogic · 2025-09-27T02:38:13Z

PR summary

The original code split fonts into subsets based on the character modulo the subset size (determined by font type limits). This had some limitations:

If characters were used from across various font blocks (especially for type 3, whose blocks were only 256 characters), then there would be many, possibly sparsely-populated, font subsets.
Sometimes a single character code may map to multiple glyphs. This is the case if you mix languages (i.e., Add language parameter to Text objects #29794), but the naive tracking would only produce one glyph.
Sometimes multiple characters can map to a single glyph. This is the case for ligatures, and also with complex text shaping (such as Arabic), and this would just fail calling ord on a multi-char string.

To fix this, CharacterTracker now tracks characters and glyphs more closely. Specifically,

for each font, a (character code(s), glyph index)-pair is mapped to a (subset index, subset character code)-pair. This ensures that point 2 above is handled.
If the above map doesn't exist yet, then a subset index/character code is calculated:
1. if the (singular) character code is in the first block (255 for type 3, or 64k for type 42), then keep the character code the same and put it in subset 0; this preserves the text in those lower ranges if you happen to be looking at a PDF directly
2. if the (singular) character code is already in subset 0, then bump it to the next available spot; a conflict here means the character is being used with multiple glyphs (i.e., another case for part 2 above)
3. if the character code is in fact multiple character codes, then also bump to the next available spot as it could never be in the subset 0 (this is part 3 above)
4. the next available spot is the next character code in the next subset block, if necessary; by filling as needed, this takes care of point 1 above

With these changes, the complex/font features/languages tests in #30607 produce correct results.

PR checklist

[n/a] "closes #0000" is in the body of the PR description to link the related issue
new and changed code is tested
[n/a] Plotting related features are demonstrated in an example
[n/a] New Features and API Changes are noted with a directive and release note
[n/a] Documentation complies with general and docstring guidelines

QuLogic · 2025-09-27T09:56:28Z

Note, I think the original commit was small, and the remaining ended up small enough, that I just put them all in this PR.

lib/matplotlib/backends/_backend_pdf_ps.py

lib/matplotlib/backends/backend_pdf.py

anntzer · 2025-09-29T12:07:41Z

lib/matplotlib/backends/backend_ps.py

            print("%%BeginProlog", file=fh)
            if not mpl.rcParams['ps.useafm']:
-                Ndict += len(ps_renderer._character_tracker.used)
+                Ndict += sum(map(len, ps_renderer._character_tracker.used.values()), 0)


The 0 at the end is unneeded.

lib/matplotlib/backends/backend_ps.py

No need to repeat the calculation of subset blocks, but instead offload it to `track_glyph`.

Instead of splitting fonts into `subset_size` blocks and writing text as character code modulo `subset_size`, compress the blocks by doing two things: 1. Preserve the character code if it lies in the first block. This keeps ASCII (for Type 3) and the Basic Multilingual Plane (for Type 42) as their normal codes. 2. Push everything else into the next spot in the next block, splitting by `subset_size` as necessary. This should reduce the number of additional font subsets to embed.

If mixing languages, sometimes a single character may use different glyphs in one document. In that case, we need to give it a new character code in the next subset, since subset 0 is preserving character codes.

QuLogic · 2025-09-30T05:27:22Z

OK, I've handled all your comments, I think. I also fixed subsetting in the PostScript backend, noted above.

There are 3 test image changes:

2 PDF tests think that the fourth emoji moved slightly; I think this is because the characters on the line are now from the same font instead of split between emoji/not. But this seems very small, as viewing it at 100% doesn't really seem to show any difference.
1 EPS test change; this is because subsetting is now "real", so the switch from type 3 to type 42 no longer happens. Ghostscript seems to convert those to raster a little different, but on the upside, now the Computer Modern and DejaVu fonts look the same.

lib/matplotlib/backends/backend_pdf.py

anntzer

Just a minor point regarding a comment.

For ligatures or complex shapings, multiple characters may map to a single glyph. In this case, we still want to output a single character code for the string using the font subset, but the `ToUnicode` map should give back all the characters.

Previously, this was supposed to "upgrade" type 3 to type 42 if the number of glyphs overflowed. However, as `CharacterTracker` can suggest a new subset for other reasons (i.e., multiple glyphs for the same character or a glyph for multiple characters may go to a second subset), we do need proper subset handling here as well. Since that is now done, we can drop the "promotion" from type 3 to type 42, as we don't get too many glyphs in each embedded font.

QuLogic · 2025-10-02T19:46:01Z

Removed the image changes (and moved them to the text-overhaul-figures branch) in preparation for merging.

QuLogic · 2025-10-02T23:00:05Z

Linting issues are known (#30626), so merging over those.

QuLogic added this to the v3.11.0 milestone Sep 27, 2025

QuLogic added this to Font and text overhaul Sep 27, 2025

github-project-automation bot moved this to Waiting for other PR in Font and text overhaul Sep 27, 2025

QuLogic moved this from Waiting for other PR to Ready for Review in Font and text overhaul Sep 27, 2025

github-actions bot added backend: ps backend: pdf labels Sep 27, 2025

QuLogic force-pushed the simpler-track branch from 58874b6 to 5afd71b Compare September 27, 2025 03:21

QuLogic changed the title ~~Deduplicate CharacterTracker.track implementation~~ Prepare CharacterTracker for advanced font features Sep 27, 2025

QuLogic mentioned this pull request Sep 27, 2025

Implement libraqm for vector outputs #30607

Merged

5 tasks