Can a general LLM really just read handwriting?

I first worked with Transkribus nearly ten years ago. At the time it seemed almost impossible: a computer that could learn to read historical handwriting. It also demanded a fair amount from the user. The software was still complex, and a new hand usually meant preparing training material before the model became useful. But it did feel a little bit like the future had arrived.

A few weeks ago I had that feeling again. I was at the municipal archives of Westerwolde, doing research for an article on the electrification of the region in the early twentieth century, due to appear soon in Terra Westerwolda. There was about forty centimetres of archival material to go through, and I only had the morning. Under normal circumstances I would have made quick notes, photographed selectively, and accepted that I needed to come back.

This time I scanned every page with the new scan function in Google Drive. It made tidy multipage PDFs of each record. I uploaded the files to NotebookLM, had a cup of coffee, and then got into the car to drive home while listening to an automatically generated podcast based on the documents I had just digitised.

The strange part was how useful it became before I had done any serious reading myself. NotebookLM misread things, of course; I would never treat that output as a finished transcription. Even so, it found several points in the records that later made their way into the article. The feeling was familiar from those early Transkribus experiments: something that had seemed just out of reach was suddenly sitting inside an ordinary research day.

NotebookLM is a Google product, and much of what happens behind the scenes depends on Google’s Gemini models. The experience left me with a practical question: can these general-purpose LLMs now read old handwriting cold?

Spend enough time around registers, charters, account books, minutes, indexes and loose administrative paper, and handwriting stops feeling like a category. It becomes a mass of unsolved work. At the Groninger Archieven, as in every archive I know, there are shelves and shelves of pages that are legible to trained people and still mostly opaque to computers.

Classic handwritten text recognition has real strengths, and tools such as Transkribus, eScriptorium and Loghi are now part of the normal working vocabulary in digital heritage. For obscure hands, mixed scripts, small collections, damaged documents and especially tables, they can still require training data, layout work, correction loops and patience. That is manageable for a project with time and funding. It is a different proposition when the question is whether a box of records is worth a closer look.

I turned that question into a small benchmark: fifteen pages, ten models, three runs, and a set of failures that were still worth keeping because of what they can tell us. Can a general vision-language model just read these pages?

What I tested

The benchmark uses fifteen hand-curated pages from the Groninger Archieven, ranging from the early 15th to the early 20th century. They include Latin and Dutch (in many variants); charters, registers, resolutions, a book list, a city account and indexes; prose pages, tabular pages, and one damaged page that turned out to set a hard limit on what machine reading can do.

The ground truth is diplomatic. In normal language: transcribe what is literally there. Do not expand a month unless the page expands it. Written Id. stays Id.. A ditto stroke becomes a quote mark, because that is what is on the page. The one exception is medieval Latin charter material, where expanding abbreviations is the scholarly norm. A strictly literal rendering there would require abbreviation glyphs that the scorer would then have to normalise anyway, so I kept the diplomatic rule but followed charter practice.

I scored the outputs with weighted checkpoints: names, dates, places, sums and other pieces of information that matter if you are actually trying to use the source. Character error rate, or CER, is still useful, especially when a model starts producing absurd amounts of text. But for tables, line breaks and column order can make CER noisy. The checkpoints carry the signal.

There were also two blind LLM judges, one using Claude Opus 4.8 and one using Gemini 3.1 Pro (both state of the art frontier models at time of writing). They compared anonymous submissions against the ground truth across several dimensions. That helps, with limits: the judges are LLMs too, and may share some of the same failure modes as the models they are scoring. Blindness reduces the obvious bias, but it does not make a model an impartial authority.

Each model-page pair ran three times because cloud temperature=0 does not guarantee identical output. Routing, batching, safety layers and implementation details can still change the result. The scoreboard therefore reports means and low-high ranges, with the spread kept visible.

The final roster was ten vision-capable models: Gemini 3.1 Pro and Gemini 3.5 Flash from Google, Claude Haiku, Sonnet and Opus from Anthropic, Qwen3-VL from Alibaba’s Qwen team, GLM-4.6V from Z.ai, Mistral Medium and Mistral Large from Mistral, and Llama 4 Maverick from Meta. Most API access went through OpenRouter.

For material that cannot leave an institution, some of these models could be run locally/self-hosted. In this run, the cloud-only top tier was Gemini and Claude. The models with downloadable weights, at least in principle, were the Qwen3-VL, GLM-4.6V, Llama 4 Maverick and Mistral Large 3 side of the roster, with the usual caveat that “can be run locally” sometimes means “can be run locally if your local machine is a small data centre”. Still, it could be an option to consider for restricted archival material.

The scoreboard

Here is the headline result, using the final checkpoint percentage over three full runs.

Rank	Model	Checkpoint mean	Run range	CER	Outlier pages
1	gemini-3.1-pro-think	74.91%	74.63-75.23	0.200	1
2	opus-4.8	70.06%	68.86-71.84	0.208	0
3	gemini-3.5-flash-think	67.27%	64.09-72.42	0.358	4
4	sonnet-4.6	54.54%	52.26-55.78	0.281	1
5	qwen3-vl-235b-thinking	44.16%	43.09-45.77	0.335	1
6	glm-4.6v	39.44%	37.97-40.98	3.143	1
7	mistral-medium-3.5	35.98%	33.06-37.48	0.454	0
8	llama-4-maverick-tiled	29.72%	28.87-31.07	0.575	1
9	mistral-large-tiled	28.66%	27.52-29.41	0.495	0
10	haiku-4.5	19.07%	18.31-19.86	0.495	0

Gemini 3.1 Pro wins cleanly at 74.9%, and it barely moves across runs. Its three-run range is less than a percentage point. Opus 4.8, with a tiled input pipeline, is close behind at 70.1% and has no outlier pages. Then comes the surprise: Gemini 3.5 Flash, after fixing a failure mode I will get to in a moment, reaches 67.3%. That puts it ahead of Sonnet 4.6, which is not where I expected it to land when I started.

After that the long tail gets long quite quickly: Qwen at 44.2%, GLM at 39.4%, Mistral Medium at 36.0%, Llama 4 Maverick at 29.7%, Mistral Large at 28.7%, and Haiku at 19.1%. These are not useless systems. Some of them produce decent passages on some pages. But a benchmark matters precisely because it measures every page, not just the ones a model happens to get right.

The ranking matters, but the failure modes are probably the more useful result.

The image pipeline

The first technical lesson came from the image pipeline. Claude, when run through the CLI, reads large scans with a Read tool that downscales images. On dense pages this reduced the effective input quality. On a table page, the checkpoint scores plummeted: Haiku 4.5%, Sonnet 24%, Opus 39%. Those numbers measured the delivery route as much as the model: the page had been softened before Claude ever saw it.

The fix was tiling. I split the page into four overlapping horizontal strips and let Claude read those strips one by one. Effective resolution went up. So did the scores: Opus went from 39% to 61% on the smoke-test table page, Sonnet from 24% to 60%, Haiku from 4.5% to 26%.

That immediately raises the fairness question. If only Claude gets tiling, is that a hidden advantage? I tested that too. The answer was useful because it was not symmetrical. GLM gained 14.8 percentage points from tiling, so I tiled it too. Gemini Pro lost 2.3 points, presumably because fragmenting the page damaged its ability to see the whole layout. Qwen was basically unchanged.

So I treated tiling as an input condition. Resolution-limited routes need it. Strong whole-image routes may not. I reported it as a measured condition rather than assuming that every model saw an equivalent version of the scan. In a benchmark like this, the image pipeline should be part of the test, it turned out.

The same experiment settled a related question: the gains came from resolution, not reasoning. I ran Claude with extended thinking turned up and turned off, and the quality barely moved — no-thinking scored about the same as medium thinking, and even “ultrathink” without tiling got stuck at 24% on the table page. Tiling did the work; thinking did not. So Claude runs tiled and with thinking off. The lesson is consistent across the benchmark: a page has to be legible before reasoning can help, and once it is legible the reasoning barely helps.

The most revealing failure

The most instructive failure was Gemini Flash at temperature=0 with high reasoning. On hard handwriting it sometimes fell into a repetition loop. It would worry at a word, revise it, worry again, and then keep going until it had burned the entire 32k token budget. It would go something like: g-e-a-r-r-e-s-t? No, g-e-a-r-r-e-s-t-e-e-r-t? No.... It got stuck on a single word, unable to leave it alone.

More tokens did not help. They only let the loop run longer. Capping the reasoning tokens did not help either: Gemini honoured the cap unreliably, and even the high and medium effort settings were applied inconsistently from one identical call to the next — sometimes a clean ten thousand tokens, sometimes a thirty-thousand-token loop. The fix that held was a low reasoning budget, with a small temperature bump to 0.3 alongside it. The low budget was the best guarantee: with almost nothing to think with, the model cannot fall into the loop, and the call finishes cleanly every time. The temperature bump helps break the loop, but on its own it is not watertight. Only the low setting was dependably stable, and since Flash is the model that overthinks while Pro does not, this is a Flash-specific accommodation.

The effect was significant. Flash jumped from 47.2% to 67.3% checkpoints, moving from fourth place to third and beating the pricier Sonnet. The model had not become smarter. The harness had just stopped feeding one of its weaknesses.

glm-4.6v on p6: 68,388 characters, about 40x the ground truth length
... toestand van de beame op het gemelde ten regarde lebben de toestand
van de beame op het gemelde ten regarde lebben de toestand van de beame
op het gemelde ten regarde ...

GLM-4.6V did the same kind of thing more visibly. On that page it produced 68,388 characters, about forty times the real length, repeating a phrase ad infinitum. That is why CER still matters alongside checkpoints: GLM’s checkpoint score merely says “low” while the CER records the scale of the failure.

The cost

The economics were less intuitive than the model ranking. In the scored runs, Flash was much cheaper than Pro. The bad Flash configuration was the expensive part: failed and experimental calls could burn 25k to 30k reasoning tokens, often without producing useful text, costing up to 30 cents per run while yielding no usable results.

After the loop fix, Flash became what the price list had promised: 67.3% in 38 seconds for about $0.065 per scored page. That is the value result in the benchmark. Pro remains the quality winner at 74.9% in 96 seconds for about $0.161. Opus is accurate but slow at 174 seconds and $0.268 API-equivalent. Sonnet is slower still at 224 seconds and $0.228 API-equivalent.

Mistral makes the reasoning-cost trade visible inside a single model family. Mistral Medium, which reasons, scored 36.0% at roughly twelve thousand tokens a page; Mistral Large, which does not, scored 28.7% at about fifteen hundred. That is a real 7.3-point gain for reasoning and a larger model, but it costs roughly twenty-four times as much per page.

I didn’t pay for the Claude runs, since I ran them through my subscription. In practice, that made the Claude calls feel like $0. But the tiled CLI route is token-heavy: four Read rounds, agent overhead, and a lot of output. In API-equivalent terms, the same technique that rescued Claude’s quality made it roughly five to ten times heavier than a plain direct call would have been. That’s something to take into account if you’d want to deploy it at scale.

Tables split the field

Tables are where I expected general models to suffer, because tables are difficult for many HTR workflows. Rows and columns are not text in a simple line; their meaning is in the layout, in which value sits under which heading. If the model misreads the grid, the words may be right and the data still useless.

The result split by model family and, I suspect, by image pipeline.

Gemini Pro scored 80.7% on tabular pages and 72.1% on prose, a positive delta of 8.6 points. Flash was even more table-positive: 74.0% on tables, 63.9% on prose. GLM and Llama also did better on tables. Claude went the other way. Opus scored 64.9% on tables and 72.7% on prose. Sonnet showed the same pattern.

My working explanation is this: Gemini sees the whole grid in one go, which enables it to read a cell within the context of its column. Claude, in this setup, sees four overlapping strips. That helps resolution but fragments the table. A model can be good at reading handwriting and still be put at a disadvantage by how the page is sliced.

The spreadsheet

Then I asked the champion to do a slightly different job: not “transcribe this page”, but “turn this handwritten 23-column hospital register into CSV”. Semicolon-separated columns, one row per table row, keep the ditto marks, keep the amounts as written.

It surprised me how well this worked. Gemini Pro produced a clean CSV with 23 columns, headers, 13 patient rows, name corrections, Id. entries, ditto marks and quarter amounts. It opened in Excel. The p14 two-column index became a neat 40-row CSV too. Below is an excerpt. Because of the width of the table I’ve omitted a few fields.

Nummer	Naam en voornaam	Geboren	Geboorteplaats	Woonplaats	Ingekomen	Uitgegaan	Totaal
1	Postema / Wubbina	12 Aug 1851	Aduard	Violetsteeg Groningen	1 Januari	13 Juli	48.50
2	van Weering / Meggelina	21 Septb 1891	Groningen	Groningen	1
3	Barendse / Abraham	31 Juli 1887	Groningen	Bloemstraat Groningen	1	20 Januari	5.00
4	Weide / Wilhelmina Adriana	23 Decb 1879	Goes	Leeuwarderstr Groningen	1	5 Septb	62.00

For archives, this may be the most interesting part of the whole exercise. A lot of heritage data becomes useful only when the table structure survives. If a strong vision model can go from scan to database-able CSV for sources like this, it might actually be viable for making large amounts of historical data accessible quickly. I immediately started fantasising about what this could mean for our website AlleGroningers.

Claude Opus, using the tiled CLI route, did not even get close to giving me the same result here. It timed out after more than fifteen minutes. That timeout belongs in the findings. The same tiling that rescued Claude on dense handwriting made it a poor fit for table-to-CSV conversion, because the grid was no longer visible as one grid.

The three-run setup mattered most for Flash. After the fix it was a strong contender, but the range was still wide: 64.09 to 72.42. It had four outlier pages, including pages where the spread between runs was enormous. Pro and Opus were much steadier. Pro had one outlier page, the damaged 1672 city account. Opus had none.

That makes Flash attractive for large batches where you can tolerate review, reruns or ensemble checks. It is fast and cheap enough to be used generously. But if I needed one expensive page read once, I would rather have the steadier range of Pro or Opus. The average alone would hide that distinction, and a single run would have hidden it entirely.

What I still do not trust

I do not trust any of these models not to confabulate on damaged or ambiguous pages. The damaged 1672 city account is the clearest warning. Gemini Pro spent roughly 32k tokens on that page and scored only 15.2% checkpoints. That page shows the bottleneck clearly. More reasoning did not make the missing or damaged marks readable.

I also do not trust the judges without checking them. They are useful, especially because there are two of them and the submissions are blind, but they are not outside the system. They share some failure modes with the contestants. For table pages I trust checkpoints more than judge prose.

The sample is also small. Fifteen pages, three runs each, ten models. That is enough to see patterns, but not enough for broad claims about all archival handwriting. The ground truth took careful human curation, and that remains the real bottleneck. The benchmark exists because someone looked at the pages, decided what counted, wrote checkpoints and corrected the diplomatic transcription. There is no escaping that labour. There are only better ways to spend it.

Finally, with the current speed LLM’s are developing the roster will always be dated and incomplete. Fable was pulled from availability before the expansion, after the US government ordered Anthropic to disable it for foreign nationals under export controls. GPT-5.5 was too expensive and slow for the large run, though it remains plausible for occasional single-page transcription. Kimi produced failures around reasoning tokens and output budgets. Gemma 4 could not be fielded at all: through OpenRouter it received zero image tokens and never actually saw the scan, and through Google’s own studio it thought without limit and timed out before producing any transcription, with no way to switch the thinking off.

What we learn from this

I think we can draw a few lessons from this, even as these models age out.

Tables are the surprise strength. This is the material where classic HTR struggles most, and where I expected general models to struggle too. Instead the strong ones thrived, Gemini most of all: it read the whole grid in one pass, scored higher on tabular pages than on prose, and turned a 23-column handwritten hospital register into clean, Excel-ready CSV. For a great deal of archival material the structure is the point, and structure is exactly what these models preserved.

Flawed material is (for now, at least) the hard ceiling. Damaged, ambiguous or unclear pages defeated every model, and more reasoning never rescued them. The damaged 1672 city account set the ceiling for the whole benchmark. Worse, a model’s instinct on an uncertain mark is to produce confident output, which is precisely the wrong instinct for a source you cannot fully read.

These tests showed that the harness and the method matter just as much as the model. The same model scored 24% or 60% on one page depending only on how the scan reached it, and each one wanted different handling: tiling helped some and hurt others, a low reasoning budget rescued one model and a high one wrecked another, and several did not work out of the box at all: one never received the image, one looped for tens of thousands of characters, one quietly ignored the setting I had chosen.

One run is not enough to trust. Cloud models are not deterministic even at temperature zero, and the cheapest strong model, Flash, was also the least stable, swinging from near-zero to around 80% on the same page between runs. A single pass would have hidden that. Reliability is part of accuracy, models that score well on average don’t always score consistently.

And all of this has a short shelf life. The field moves fast enough that one model vanished mid-project to a government export ban, and the rest will be overtaken soon enough. Read this for the shape of the problems, which will likely change more slowly, rather than the ranking, which will be stale within weeks.

What I would try next

The next step is to build more workflows building on these findings.

The biggest gap is the comparison I deliberately left out of this round: these general models against the dedicated HTR tools (Transkribus, eScriptorium, Loghi) head to head, on the same pages and the same scoring. My strong suspicion is that the answer is not either-or but both. HTR does the line-level reading it was built for, and an LLM does the post-processing it is good at: correcting, normalising, and above all imposing structure, turning a flat line of text into the table or index the source actually is. If that holds, the most useful system is a combination — HTR with LLM post-processing — and the next benchmark should measure that pairing directly rather than either alone.

This is already the direction things are headed in. Transkribus, the most widely used HTR platform, now pairs its text recognition with AI-driven information extraction, table recognition, field models and entity tagging, to turn a flat transcription into structured CSV or XML, which is precisely the post-processing job I am describing. So the hybrid is less a bet than an observation: the dedicated tools and the general models are converging on the same workflow from opposite ends, and a benchmark that pits them head to head should also measure them working together.

For prose pages, I would test rerun strategies: run Flash cheaply several times, detect low-confidence or high-variance pages, and escalate those to Pro or Opus. For tables, I would separate the problem from line-oriented HTR and treat it as structured extraction from the start. For restricted collections, I would give the open-weight models a more serious local run, especially Qwen, GLM, Llama and the newer Mistral weights.

I also want better benchmark pages. More damaged material. More marginalia. More mixed hands. More pages where the answer is “there is no safe answer”. The temptation with machine reading is to reward confident output. Archives often require the opposite: calibrated hesitation. Let me know what you don’t know.

What stays with me most is how conditional the strongest result is. The models did not make palaeography unnecessary, or remove the need for human judgment, source criticism, ground truth, or the little decisions about ditto marks. Good. That is not the part of archival work I want automated away. It also means that, even though on the surface it looks like LLM’s can just read old handwriting out of the box, in reality they still need quite a lot of guidance and support, especially if you want the results to be of archival quality.

What they did do is become useful enough to change the shape of some work. They will not handle the hard scholarly parts where a faint abbreviation changes the reading. But a great deal of archival labour sits between “impossible” and “already solved”: sorting, rough transcription, table extraction, candidate indexing, triage, repeatable comparison. That middle zone is where these models are starting to matter. I can use them as more of a multiplier of my skills than a replacement of them.

That is why the morning doing research in the municipal archives of Westerwolde keeps coming back to me. NotebookLM did not solve the archive. It did not replace reading, checking or writing. But it gave me a usable first pass while the documents were still fresh in my head, and it helped point me toward material I might otherwise have missed in a hurried visit. The benchmark gives that feeling a more sober frame: not magic, but a set of tools that are becoming practical if you know where they break.

I began with the deliberately naive question: can a general LLM just read old handwriting? After this benchmark, my answer is: sometimes, provided we are honest about everything around the word “read”. The pipeline matters. The scoring matters. The page matters. The cost model matters. So does the person checking the result. For now, the useful work is knowing where these tools help, and where the archive still needs a careful human reader.