Great list! I’ll definitely run your benchmark against Doctly.ai (our PDF-to-Markdown service) specially as we publish our workflow service, to see how we stack up.
One thing I’ve noticed in many benchmarks, though, is the potential for bias. I’m actually working on a post about this issue, so it’s top of mind for me. For example, in the omni benchmark, the ground truth expected a specific order for heading information—like logo, phone number, and customer details. While this data was all located near the top of the document, the exact ordering felt subjective. Should the model prioritize horizontal or vertical scanning? Since the ground truth was created by the company running the benchmark, their model naturally scored the highest for maintaining the same order as the ground-truth.
However, this approach penalized other LLMs for not adhering to the "correct" order, even though the order itself was arguably arbitrary. This kind of bias can skew results and make it harder to evaluate models fairly. I’d love to see benchmarks that account for subjectivity or allow for multiple valid interpretations of document structure.
Did you run into this when looking at the benchmarks?
On a side note, Doctly.ai leverages multiple LLMs to evaluate documents, and runs a tournament with a judge for each page to get the best data (this is only on the Precision Ultra selection).
themanmaran 3 hours ago [-]
Hey I wrote the Omni benchmark. I think you might be misreading the methodology on our side. Order on page does not matter in our accuracy scoring. In fact we are only scoring on JSON extraction as a measurement of accuracy. Which is order independent.
We chose this method for all the same reasons you highlight. Text similarity based measurements are very subject to bias, and don't correlate super well with accuracy. I covered the same concepts in the "The case against text-similarity"[1] section of our writeup.
I'll dig deeper into your code, but scanning your post does look like your are addressing this. That's great.
If I do find anything, I'll share with you for comments before I publish the post.
prats226 3 hours ago [-]
Bias wrt ordering is a great point. What we consider structured information in this benchmark is irrespective of how its presentation (Order, format etc), it should be directly comparable. So the benchmark does that it into account.
Example is if you are only converting lets say an invoice into markdown, you can introduce bias wrt ordering etc. But if the task is to find out invoice number, total amount, number of line items with headers like price, amount, description, in that case you can compare two outputs without a lot of bias. Eg even if columns are interchanged, you will still get the same metric.
kapitalx 3 hours ago [-]
Exactly. You still have to be explicit in order to remove bias. Either by sorting the keys, or looking up specific keys. For arrays, I would say order still matters. For example when you capture a list of invoice items, you should maintain order.
themanmaran 5 hours ago [-]
Love to see another benchmark! We published the OmniAI OCR benchmark the other week. Thanks for adding us to the list.
One question on the "Automation" score in the results, is this a function of extraction accuracy vs the accuracy of the LLM's "confidence score". I noticed the "accuracy" column was very tightly grouped (between 79 & 84%) but the automation score was way more variable.
And side note: is there an open source Mistral benchmark for their latest OCR model? I know they claimed it was 95% accurate, but it looks that was based on an internal evaluation.
prats226 5 hours ago [-]
Automation is combination of both, accuracy and accuracy of confidence scores.
Good way to think about automation is recall at high precision which is what you need for true automation where you don't worry about documents that are very likely to have correct results and focus on manually correcting documents likely to have errors.
The reason accuracies are tighly grouped but not the automation is because these models are trained to be accurate but not necessarily predictable, where there is no real way to get confidence score calibrated.
Couldn't find the benchmark mistral used as well
bn-l 3 hours ago [-]
In your pricing example I see $6.27 for a 10 page document. That is extremely expensive.
prats226 3 hours ago [-]
Can share link? Maybe some kind of mistake.
criddell 5 hours ago [-]
How do these compare to traditional commercial and open source OCR tools? What about things like the Apple Vision APIs?
sumedh 4 hours ago [-]
> Apple Vision APIs
Can we even OCR pdfs using Apple Vision APIs
derrasterpunkt 2 hours ago [-]
You can[1].
I’m vibe coding a little macOS OCR app since last weekend, and I’m really happy with the results so far. This is my first app, so fingers crossed. If it becomes feature-complete and polished enough, I’m considering open sourcing it. There’s still a long way to go, though.
Don't think you can. And also there is big difference in plain old OCR, which is just getting all text out from image and document processing which is can you only get the relevant information in a good structure that can be directly pushed into a database.
29athrowaway 1 hours ago [-]
Many of the benchmarks I have seen in this space suffer from the Texas Sharpshooter fallacy, where you shoot first and then paint a target around the hole.
If you create a benchmark and your product outperforms everything else, it could mean many things. Overfitting being one of them.
prats226 46 minutes ago [-]
That's an interesting point. The bias might or might not be intentional. From the benchmarks I have seen, lot of tools solve slightly different problems altogether and also target slightly different data distribution and in the end have to build best solution around it.
Which is why publishing open benchmarks is first step where there is public scrutiny around whether then benchmark itself irrespective of the results is fair. In the end, the end user will choose the benchmark that's best fit for their usecase or mostly will create a variation of their own, do their own unbiased evaluations.
One thing I’ve noticed in many benchmarks, though, is the potential for bias. I’m actually working on a post about this issue, so it’s top of mind for me. For example, in the omni benchmark, the ground truth expected a specific order for heading information—like logo, phone number, and customer details. While this data was all located near the top of the document, the exact ordering felt subjective. Should the model prioritize horizontal or vertical scanning? Since the ground truth was created by the company running the benchmark, their model naturally scored the highest for maintaining the same order as the ground-truth.
However, this approach penalized other LLMs for not adhering to the "correct" order, even though the order itself was arguably arbitrary. This kind of bias can skew results and make it harder to evaluate models fairly. I’d love to see benchmarks that account for subjectivity or allow for multiple valid interpretations of document structure.
Did you run into this when looking at the benchmarks?
On a side note, Doctly.ai leverages multiple LLMs to evaluate documents, and runs a tournament with a judge for each page to get the best data (this is only on the Precision Ultra selection).
We chose this method for all the same reasons you highlight. Text similarity based measurements are very subject to bias, and don't correlate super well with accuracy. I covered the same concepts in the "The case against text-similarity"[1] section of our writeup.
[1] https://getomni.ai/ocr-benchmark
If I do find anything, I'll share with you for comments before I publish the post.
Example is if you are only converting lets say an invoice into markdown, you can introduce bias wrt ordering etc. But if the task is to find out invoice number, total amount, number of line items with headers like price, amount, description, in that case you can compare two outputs without a lot of bias. Eg even if columns are interchanged, you will still get the same metric.
One question on the "Automation" score in the results, is this a function of extraction accuracy vs the accuracy of the LLM's "confidence score". I noticed the "accuracy" column was very tightly grouped (between 79 & 84%) but the automation score was way more variable.
And side note: is there an open source Mistral benchmark for their latest OCR model? I know they claimed it was 95% accurate, but it looks that was based on an internal evaluation.
Good way to think about automation is recall at high precision which is what you need for true automation where you don't worry about documents that are very likely to have correct results and focus on manually correcting documents likely to have errors.
The reason accuracies are tighly grouped but not the automation is because these models are trained to be accurate but not necessarily predictable, where there is no real way to get confidence score calibrated.
Couldn't find the benchmark mistral used as well
Can we even OCR pdfs using Apple Vision APIs
I’m vibe coding a little macOS OCR app since last weekend, and I’m really happy with the results so far. This is my first app, so fingers crossed. If it becomes feature-complete and polished enough, I’m considering open sourcing it. There’s still a long way to go, though.
[1] https://developer.apple.com/documentation/vision/vnrecognize...
If you create a benchmark and your product outperforms everything else, it could mean many things. Overfitting being one of them.
Which is why publishing open benchmarks is first step where there is public scrutiny around whether then benchmark itself irrespective of the results is fair. In the end, the end user will choose the benchmark that's best fit for their usecase or mostly will create a variation of their own, do their own unbiased evaluations.