computer-science

Unlimited OCR: One-Shot Long-Horizon Parsing

Hacker News · Jun 23, 2026, 11:35 AM

Key takeaways

Welcome the Era of One-shot Long-horizon Parsing.
[2026/06/23] 📄 Our paper is now available on ar Xiv. [2026/06/23] 🤝 Thanks to the Model Scope community for their support.
Install the local SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

Welcome the Era of One-shot Long-horizon Parsing.

[2026/06/23] 📄 Our paper is now available on ar Xiv. [2026/06/23] 🤝 Thanks to the Model Scope community for their support. Our model is now available at Model Scope. [2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further. Inference Transformers Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9：

torch==2.10.0 torchvision==0.25.0 transformers==4.57.1 Pillow==12.1.1 matplotlib==3.10.8 einops==0.8.2 addict==2.4.0 easydict==1.13 pymupdf==1.27.2.2 psutil==7.2.2 import os import torch from transformers import AutoModel, AutoTokenizer model_name = 'baidu/Unlimited-OCR' tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained( model_name, trust_remote_code=True, use_safetensors=True, torch_dtype=torch.bfloat16, ) model = model.eval().cuda() # ── Single image supports two configs: gundam or base ── # gundam: base_size=1024, image_size=640, crop_mode=True # base: base_size=1024, image_size=1024, crop_mode=False model.infer( tokenizer, prompt='<image>document parsing.', image_file='your_image.jpg', output_path='your/output/dir', base_size=1024, image_size=640, crop_mode=True, max_length=32768, no_repeat_ngram_size=35, ngram_window=128, save_results=True, ) # ── Multi page / PDF only uses base (image_size=1024) ── model.infer_multi( tokenizer, prompt='<image>Multi page parsing.', image_files=['page1.png', 'page2.png', 'page3.png'], output_path='your/output/dir', image_size=1024, max_length=32768, no_repeat_ngram_size=35, ngram_window=1024, save_results=True, ) # ── PDF (convert pages to images, then multi-page parsing) ── import tempfile, fitz # PyMuPDF def pdf_to_images(pdf_path, dpi=300): doc = fitz.open(pdf_path) tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_') mat = fitz.Matrix(dpi / 72, dpi / 72) paths = [] for i, page in enumerate(doc): out = os.path.join(tmp_dir, f'page_{i+1:04d}.png') page.get_pixmap(matrix=mat).save(out) paths.append(out) doc.close() return paths model.infer_multi( tokenizer, prompt='<image>Multi page parsing.', image_files=pdf_to_images('your_doc.pdf', dpi=300), output_path='your/output/dir', image_size=1024, max_length=32768, no_repeat_ngram_size=35, ngram_window=1024, save_results=True, ) SGLang Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:

Article preview — originally published by Hacker News. Full story at the source.

Read full story on Hacker News → More top stories

Aggregated and edited by the Scoop newsroom. We surface news from Hacker News alongside other reporting so you can compare coverage in one place. Editorial policy · Corrections · About Scoop

Unlimited OCR: One-Shot Long-Horizon Parsing

Key takeaways

More in computer-science