NewMind AI Journal #146

NewMind AI Journal #146

Scaling VL-OCR on MareNostrum5: 25K-Image Inference with vLLM 

📌 We built and scaled a vision–language OCR pipeline on MareNostrum5, processing 25,000 images into structured Parquet with vLLM-served RolmOCR and DotsOCR models. 

📌 The system achieves high concurrency and reliability through SLURM-orchestrated, single-node jobs on 4× H100 GPUs, with full monitoring of latency, throughput, and GPU health. 

📌 Benchmarks show that DotsOCR and RolmOCR both surpass Tesseract, with DotsOCR leading in throughput, layout fidelity, and Turkish diacritic handling—reducing errors and post-processing effort in large-scale document pipelines. 

System Design and Experimental Setup 

We developed a high-throughput VL-OCR pipeline that transforms PDFs into structured Parquet files using the reducto/RolmOCR and rednote-hilab/dots.ocr models, served through vLLM 0.9.1. The system runs on BSC’s MareNostrum5 supercomputer and is orchestrated with SLURM job arrays for efficient parallel processing. PDFs are first rendered into images with PyMuPDF (fitz) at 300 DPI, resized to a maximum dimension of 1024 px, and then passed through OCR. Final outputs are written as Snappy-compressed Parquet, while real-time performance and health metrics are automatically tracked via a Prometheus + Grafana monitoring stack. 

Environment 

Experiments were conducted on BSC’s MareNostrum5, a high-performance computing (HPC) system optimized for large-scale parallel workloads. Its architecture provides an ideal foundation for implementing and testing high-throughput OCR at scale by combining advanced GPU nodes, many-core CPUs, and high-performance storage This environment enables high per-node concurrency while keeping orchestration straightforward: each job starts its own vLLM server, processes its assigned images, and writes sealed Parquet shards—no cross-node coordination required. 

MN5 setup 

  • Scheduler & partition: SLURM; GPU jobs on “acc” partition. 

  • Per-node profile: 4× NVIDIA H100 (64 GB VRAM each), ~80 CPUs (≈20 per GPU), 128 GB RAM. 

  • Scale-out approach: Scale-out approach: Independent single-node jobs orchestrated with SLURM arrays, which allow multiple jobs to be launched and managed efficiently under a single submission. In this run, we used 2 jobs in the array (no multi-node vLLM). 

  • Storage & I/O: High-performance /gpfs/scratch for temporary files; project data under /gpfs/projects; flat image/output layout for simple sharding. 

  • Containers & Toolchain: Singularity 3.11.5; Python 3.12.1; CUDA 12.8, cuDNN 9.6.0, NCCL 2.24.3-1; NVIDIA HPC SDK 25.3; GCC 13.2.0, MKL 2024.2, HDF5 1.14.4.2-nvidia23.9-nvhpcx23.9. 

  • Models & serving: vLLM 0.9.1 hosting reducto/RolmOCR and rednote-hilab/dots.ocr. 

  • Monitoring approach: A dedicated node runs Prometheus and Grafana; OCR jobs emit file-based discovery targets, enabling automatic ingestion of vLLM and GPU metrics without manual target edits. 

Pipeline Overview 

Figure 1: End-to-end VL-OCR on MN5.  

PDFs are rendered at 300 DPI and resized to a 1024 px longest side via PyMuPDF (fitz), producing a flat PNG set with class-prefixed filenames (toc_*, table_*, formula_*, noise_*, scattered_text_*). A SLURM array builds balanced worker lists, skips already-processed items by scanning existing Parquet outputs, and runs RolmOCR/DotsOCR on vLLM 0.9.1 (tensor-parallel across 4× H100 per task) at high concurrency. Results are written as Snappy-compressed Parquet. A separate monitoring node ingests per-job metrics and visualizes throughput, latency, cache behavior, and GPU health. 

Stages 

1) PDF → PNG (preprocessing)   

  • Renderer: PyMuPDF (fitz).   

  • Policy: Render each page at 300 DPI, then resize so the longest side = 1024 px.   

  • Layout: Flat directory output; filenames are class-prefixed to preserve the document category.   

2) Work partitioning (idempotent segmentation)   

  • Discovery: List all PNG basenames.   

  • Skip cache: Scan existing Parquet outputs to reconstruct processed basenames. 

  • Rebalance: Remove already-processed items and evenly distribute the remaining images into lists across SLURM array tasks.   

3) PNG → Parquet (vLLM RolmOCR and DotsOCR)   

  • Serving: Per-task vLLM 0.9.1 server hosting reducto/RolmOCR or rednote-hilab/dots.ocr, bfloat16, tensor-parallel size = 4 (one per GPU on the node).   

  • Client: Async HTTP with initial_concurrency = 300, retries and adaptive timeouts; each image is resized/encoded (JPEG 90 preferred, PNG fallback), base64-embedded, and sent to the OpenAI-compatible /v1/chat/completions endpoint.   

  • Schema & sharding: Write Parquet rows {reocr_reason, doc_id, vl_ocr_content} with Snappy compression; flush in sealed shards (default threshold 25,000 rows), atomic rename, timestamped filenames.   

4) Monitoring (operational telemetry)   

  • Logging: Each OCR task writes file-based discovery JSON targets for the vLLM server and the NVIDIA GPU exporter. 

  • Collection & visualization: A dedicated node runs Prometheus + Grafana, auto-loading those targets to show token throughput, TTFT, end-to-end latency, queue depth, KV-cache usage, and GPU metrics in real time. 

Configuration 

Workload 

The experiment processed a dataset of 25,000 images across five content classes (5,000 each: scattered_text, formula, toc, table, noise). Files were organized in a flat directory with class-prefixed names, simplifying downstream partitioning and evaluation   

Preprocessing (pdf2png) 

  • Renderer: PyMuPDF (fitz).   

  • Policy: Render at 300 DPI, then resize longest side to 1024 px.   

  • Output: Flat PNG set for downstream OCR. 

Serving (vLLM 0.9.1 with RolmOCR and DotsOCR) 

Each job launched a dedicated vLLM 0.9.1 server configured for tensor-parallel execution across four H100 GPUs. 

  • --model reducto/RolmOCR 

  • --dtype bfloat16 

  • --tensor-parallel-size 4 

  • --pipeline-parallel-size 1 

  • --served-model-name reducto/RolmOCR 

  • --max-model-len 12288 

  • --mm-processor-kwargs '{"num_crops": 1}' 

  • --max-num-seqs 300 

  • --limit-mm-per-prompt image=300 

  • --enforce-eager 

  • --trust-remote-code 

  • --api-key #### 

 Notes: 

  • The num_crops argument is ignored at runtime, as Qwen2-VL models (used in RolmOCR and DotsOCR) do not support it. 

  • DotsOCR was run with a higher max sequence length (16,384) compared to RolmOCR (12,288).   

Inference Client (png → text) 

  • Concurrency: initial_concurrency = 300, min_concurrency = 50 

  • Retries/Timeouts: max_retries = 2, retry_delay = 0.5s, timeout_base = 240s, timeout_increment = 20s 

  • Generation: max_tokens = 3072, temperature = 0 

  • HTTP connection pool: pool_size = 400, per_host_limit = 300 

  • Image encoding: Resize to ≤1024×1024; JPEG (quality 90) preferred, PNG fallback; base64 data URL to the OpenAI-compatible /v1/chat/completions endpoint.   

Output Format 

Final results were written in Snappy-compressed Parquet with the schema: 

  • reocr_reason: string 

  • doc_id: string 

  • vl_ocr_content: string 

Sharding policy: Flush at 25,000 rows per shard with atomic renaming and timestamped filenames to ensure idempotency. 

Quantitative Results 

Both RolmOCR and DotsOCR were evaluated on a dataset of 25,000 images (12,500 per node) on MareNostrum5. Metrics were collected in Grafana over a 30-minute window and are reported across latency, concurrency, caching, and throughput dimensions. 

1) Latency & Token-level Metrics (Grafana 30-minute Window) 

Table 1: Latency and token-level metrics for RolmOCR and DotsOCR on MareNostrum5 

Latency distributions (P50, P95, P99) and throughput are shown in Table 1

  • DotsOCR delivers much higher generation throughput, peaking at 16.7K tokens/s versus 7.03K for RolmOCR. 

  • Inter-token latency is more variable for DotsOCR, but overall end-to-end latency remains competitive across both models. 

  • Prompt lengths and input sizes were similar, demonstrating efficiency in handling equivalent workloads. 

In short, DotsOCR is better suited for high-speed token generation, while RolmOCR remains steady with lower variance. 

2) Concurrency & Request Flow 

Table 2: Concurrency and request flow metrics for RolmOCR and DotsOCR on MN5 

Concurrency behavior is summarized in Table 2

  • DotsOCR supports significantly higher concurrency, averaging 425 active requests vs. 291 for RolmOCR. 

  • RolmOCR maintained zero queued requests throughout the run, while DotsOCR occasionally queued requests due to its longer sequence lengths. 

  • DotsOCR also processed more successful HTTP requests per five-minute interval, confirming its ability to handle heavier demand. 

Overall, DotsOCR prioritizes throughput at scale, while RolmOCR prioritizes predictable, queue-free execution. 

3) Caching Behavior 

Cache utilization metrics are shown in Table 3

Table 3: KV/GPU cache usage metrics for RolmOCR and DotsOCR on MN5 

  • Both models averaged ~13% KV/GPU cache usage. 

  • Peaks remained well below 20%, confirming that caching was not a bottleneck in this workload. 

Quality Assessment (LLM-as-a-Judge) 

To assess text quality beyond raw throughput, we evaluated 500 images (100 per class: formula, table, table of contents, scattered text, and noise) against three OCR systems: Tesseract (baseline), RolmOCR, and DotsOCR. Evaluations were performed using Qwen2.5-VL-32B-Instruct as an automated judge, served through our vLLM API on MareNostrum5. 

The judge was given one image and two OCR transcripts (A = Tesseract, B = RolmOCR or DotsOCR). It returned JSON-formatted scores (0–100) across four criteria: 

  • Coverage / Completeness — all text from the image captured, with no missing lines or sections. 

  • Text Accuracy — including Turkish diacritics (ş, ğ, ı, …). 

  • Formatting — lists, alignment, tables, and layout fidelity. 

  • Grammar & Fluency — correct sentence structure, punctuation, spelling, and overall readability. 

For reproducibility, temperature was fixed at 0, and the judge was instructed to output JSON only. 

Example prompt used for the LLM judge:  

“Judge Turkish OCR quality against the image. Score A and B (0–100) on coverage, accuracy, formatting, fluency; declare per-criterion winners and overall; tag common errors; return valid JSON only.” 

Sample output (verbatim): 

Figure 2: Judge VLM output sample 

Results: 

RolmOCR shows dramatically fewer errors across all categories—especially layout, diacritics, numbers, and missing_sections while hallucinations are near zero. This indicates stronger visual grounding and better Turkish text handling (diacritics), which reduces downstream correction effort. However, DotsOCR handles these categories even better than RolmOCR, achieving fewer overall errors and generating no nonsensical text, unlike RolmOCR. This shows that DotsOCR is better suited to data with more diacritics and noise, but RolmOCR keeps good visual grounding and effective Turkish text handling. 

Figure 3: Counts of error tags (e.g., missing_sections, diacritics, numbers, …) assigned by the judge to Tesseract vs RolmOCR across the 500 samples. 

Figure 4: Counts of error tags (e.g., missing_sections, diacritics, numbers, layout, …) assigned by the judge to Tesseract vs DotsOCR across the 500 samples. 

Figure 5: Counts of error tags (e.g., missing_sections, diacritics, numbers, layout, …) assigned by the judge to RolmOCR vs DotsOCR across the 500 samples. 

RolmOCR outperforms Tesseract on all four criteria, with the largest gains in Formatting and Fluency. This aligns with the error-reduction pattern, showing that fewer layout/diacritic issues translate into more readable, faithful outputs. DotsOCR, on the other hand, appears to surpass the RolmOCR model, achieving higher overall scores. 

Figure 6: Mean scores (0–100) for Coverage, Accuracy, Formatting, and Fluency aggregated over all 500 images for Tesseract vs RolmOCR. 

 Figure 7: Mean scores (0–100) for Coverage, Accuracy, Formatting, and Fluency aggregated over all 500 images for Tesseract vs DotsOCR. 

Figure 8: Mean scores (0–100) for Coverage, Accuracy, Formatting, and Fluency aggregated over all 500 images for RolmOCR vs DotsOCR 

RolmOCR is consistently strong across classes, with TOC and tables particularly high, and even noise pages handled robustly (minimal hallucinations, high coverage) compared to Tesseract. Minor dips appear on scattered_text and formulas, primarily in formatting rather than coverage. Tesseract, on the other hand, shows class-sensitive weaknesses, especially on scattered_text (formatting/fluency) and tables/formulas (numeric/structure fidelity). The contrast with RolmOCR highlights where the VL approach closes the largest gaps—layout and diacritics in particular. 

Figure 9: RolmOCR’s mean scores by class (formula, table, toc, scattered_text, noise) across the four criteria. 

Figure 10: DotsOCR’s mean scores by class (formula, table, toc, scattered_text, noise) across the four criteria. 

Figure 11: Tesseract’s mean scores by class across the same four criteria. 

Figure 12: All 3 model’s mean scores by class across the same four criteria 

DotsOCR demonstrates consistently higher performance scores compared to RolmOCR across all four evaluation criteria, including coverage, accuracy, formatting, and fluency assessments. 

Figure 13: Mean scores (0–100) for Coverage, Accuracy, Formatting, and Fluency by class aggregated over all 500 images for RolmOCR vs DotsOCR. 

DotsOCR shows significantly fewer errors in its outputs compared to RolmOCR, especially on layouts derived from tables or table-like content. Although the Judge model’s decision highlights many more differences through error tags, the overall accuracy remains similar. This issue mainly stems from the prompting strategies used. 

Figure 14: Mean scores (0–100) for Coverage, Accuracy, Formatting, and Fluency by class  aggregated over all 500 images for RolmOCR vs DotsOCR. 

Figure 15: Distribution of winners across evaluation criteria. 

Our Mind: Scaling VL-OCR at MareNostrum5 

We set out to prove VL-OCR could replace traditional OCR at enterprise scale—and succeeded decisively. The breakthrough wasn't complex distributed architecture but elegant simplicity: independent SLURM-orchestrated jobs, each running vLLM on 4× H100 GPUs with comprehensive monitoring. 

Our technical insights challenged assumptions. KV cache averaged only 13%—not the bottleneck we expected. The sweet spot emerged at 300 DPI renders resized to 1024px with JPEG 90 encoding, balancing accuracy and throughput perfectly. 

DotsOCR delivered superior performance: 16.7K tokens/s throughput with exceptional Turkish diacritic handling and layout fidelity, though with occasional queuing. RolmOCR offered predictable latency and zero queues—choose DotsOCR for maximum throughput, RolmOCR for consistency. 

Our LLM-as-a-Judge methodology revolutionized OCR evaluation, moving beyond character error rates to assess holistic quality: coverage, accuracy, formatting, and fluency. Both models crushed Tesseract across all metrics, with DotsOCR leading in complex layouts and noisy documents. 

The operational impact is transformative—dramatically reduced post-processing effort and higher confidence in extracted data. This study proves VL-OCR is production-ready for millions of documents, opening new possibilities for intelligent document processing at unprecedented scale. 

Key Takeaways 

In a comparative evaluation, DotsOCR is the clear winner, outperforming RolmOCR in both output quality and processing efficiency. 

  • Quality and Accuracy: DotsOCR delivers significantly higher quality with fewer errors in layout, formatting, and handling of Turkish-specific diacritics. This advantage is especially pronounced in structured content like tables, leading to more reliable data and less need for manual post-editing. 

  • Performance and Throughput: Both models are highly scalable and reliable. While RolmOCR had a slightly higher throughput (~8.1 img/s vs. ~7.5 img/s for DotsOCR), DotsOCR achieved its performance while handling a much longer token sequence, demonstrating superior efficiency. 


References 

  1. BSC MareNostrum5 - https://www.bsc.es/supportkc/docs/MareNostrum5/intro/ 

  2. SLURM Workload Manager - https://slurm.schedmd.com/ 

  3. SingularityCE (container runtime used on MN5) - https://sylabs.io/singularity/ 

  4. vLLM Paper- https://arxiv.org/abs/2309.06180  

  5. vLLM Github - https://github.com/vllm-project/vllm 

  6. RolmOCR - https://huggingface.co/reducto/RolmOCR 

  7. DotsOCR - https://huggingface.co/rednote-hilab/dots.ocr 

  8. Tesseract OCR - https://github.com/tesseract-ocr/tesseract 

  9. Prometheus - https://github.com/prometheus/prometheus 

  10. Grafana - https://github.com/grafana/grafana 

  11. NVIDIA GPU Exporter - https://github.com/utkuozdemir/nvidia_gpu_exporter 

  12. Qwen2.5-VL-32B-Instruct (VLM judge) - https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct 

To view or add a comment, sign in

Explore content categories