Technical Report-v1: DeepOCR

Authors: Ming Liu (pkulium@iastate.edu), Shilong Liu (sl8264@princeton.edu)

Abstract

This report presents DeepOCR, an open-source reproduction of DeepSeek-OCR implemented using the VILA framework. DeepSeek-OCR explores vision-text compression through optical 2D mapping, achieving high compression ratios while maintaining OCR accuracy. Our reproduction successfully implements the core DeepEncoder architecture (SAM + CLIP with 16× compression) and validates the feasibility of context optical compression. We evaluate our model on OmniDocBench and olmOCR benchmarks, achieving competitive performance. This work provides the research community with an accessible implementation for further exploration of optical context compression mechanisms.

1. Introduction

1.1 Motivation

Large Language Models face significant computational challenges when processing long textual contexts due to quadratic scaling with sequence length. DeepSeek-OCR proposes an innovative solution: leveraging visual modality as an efficient compression medium for textual information. By rendering text as images and encoding them with specialized vision encoders, the model achieves 7-20× compression ratios compared to raw text tokens.

1.2 Contributions

This reproduction makes three key contributions:

Open-source implementation: We provide a complete, working implementation of DeepSeek-OCR’s architecture adapted to the VILA framework, enabling further research in optical context compression.
Architecture validation: We successfully reproduce the DeepEncoder design, demonstrating that the combination of window attention (SAM) and global attention (CLIP) with convolutional compression enables efficient high-resolution processing.
Benchmark evaluation: We evaluate our model on two standard benchmarks (OmniDocBench and olmOCR), providing empirical evidence of the approach’s strengths and limitations.

2. Architecture Design

DeepOCR Architecture Figure source: DeepSeek (2025). DeepSeek-OCR: Contexts Optical Compression [1]

2.1 DeepEncoder: The Core Vision Encoder

Our implementation faithfully reproduces DeepSeek-OCR’s novel DeepEncoder architecture, which addresses five critical requirements for optical compression:

Requirement Analysis:

High-resolution processing capability (1024×1024+)
Low activation memory under high resolution
Minimal vision token output
Multi-resolution input support
Moderate parameter count (~380M)

Architectural Solution:

The DeepEncoder consists of three components connected in series:

SAM-base (80M parameters):
- Input processing: 1024×1024 images with 16×16 patches
- Architecture: 12-layer ViT with window attention (window_size=14)
- Output: 1024-dimensional features through three convolutional layers (256→512→1024 dims)
- Key advantage: Window attention keeps activation memory manageable even with 4096 initial tokens
16× Convolutional Compressor:
- Two convolutional layers (kernel=3, stride=2, padding=1)
- Token reduction: 4096 tokens → 256 tokens (16× compression)
- Dimension expansion: 256 → 1024 channels
- Critical role: Reduces token count before expensive global attention
CLIP-large (300M parameters):
- Input: 256 compressed tokens from SAM (not raw images)
- Architecture: 24-layer ViT with dense global attention
- Output: 1024-dimensional features
- Key innovation: Operates on pre-compressed features, avoiding memory explosion

Feature Fusion: The final output concatenates CLIP’s patch features (excluding CLS token) with flattened SAM features, producing 2048-dimensional embeddings:

final_features = concat([CLIP_patches, SAM_spatial_features], dim=-1)
→ Shape: [256, 2048]

2.3 Multimodal Projector

The projector transforms 2048-dimensional vision features to LLM embedding space through:

Linear Projection:

Vision features [B, N, 2048] → LLM space [B, N, hidden_size]

Token Formatting with Structure Signals:

For tiled images, we implement 2D spatial structure encoding:

Local tiles arrangement:
- Reshape tiles into spatial grid: [h_tiles×h2, w_tiles×w2, hidden_size]
- Insert newline tokens after each row: [h_tiles×h2, w_tiles×w2+1, hidden_size]
- Flatten: [(h_tiles×h2) × (w_tiles×w2+1), hidden_size]
Global view formatting:
- Similarly add newline tokens: [h × (w+1), hidden_size]
View separation:
- Concatenate: [local_tokens, global_tokens, view_separator]

Design rationale: The newline and separator tokens help the LLM understand spatial structure and distinguish between different image views, crucial for parsing complex document layouts.

2.4 Decoder Architecture

While the original DeepSeek-OCR uses DeepSeek-3B-MoE (570M activated parameters), our reproduction employs Qwen2-7B-Instruct for several practical reasons:

Framework compatibility: Better integration with VILA training pipeline
Accessibility: Fully open-source with permissive licensing

This substitution represents a reasonable trade-off between reproduction fidelity and practical implementation constraints.

3. Training Methodology

3.1 Data Preparation

Our training follows a two-stage curriculum:

Stage 1: Vision-Language Alignment

Dataset: LLaVA-CC3M-Pretrain-595K
Composition: General image-caption pairs
Purpose: Learn basic vision-to-language mapping
Size: 595K samples

Stage 2: OCR-Specific Pretraining

Dataset: olmOCR-mix-1025
Composition: PDF documents and images with OCR annotations
Size: ~260K samples
Coverage: English and multilingual documents

Data Format: Each training sample consists of:

Image input: [pixel_values, images_crop, images_spatial_crop]
Text prompt: “\nFree OCR." (for plain text extraction)
Ground truth: Document text content

3.2 Two-Stage Training Pipeline

Stage 1: Projector Alignment (1 epoch)

Training configuration:

Trainable components: Multimodal projector only
Frozen components: DeepEncoder (SAM + CLIP), LLM
Batch size: 512 (global)
Learning rate: 1e-3 with cosine schedule
Warmup ratio: 0.03
Sequence length: 4096 tokens
Optimization: AdamW with ZeRO-3 offloading

Objective: Initialize the projector to map frozen vision features into the LLM’s semantic space without catastrophic forgetting.

Stage 2: Full Model Pretraining (1 epoch)

Training configuration:

Trainable components: Multimodal projector + LLM
Frozen components: DeepEncoder (SAM + CLIP)
Batch size: 32 (global)
Learning rate: 5e-5 with cosine schedule
Warmup ratio: 0.03
Sequence length: 4096 tokens
Gradient checkpointing: Enabled for memory efficiency

Objective: Fine-tune the entire vision-language pipeline on OCR-specific data while preserving the vision encoder’s pre-trained representations.

Rationale for freezing DeepEncoder:

Stability: SAM and CLIP are already well-trained on massive datasets
Memory: Enables training on modest GPU infrastructure (2× H200)

3.3 Training Optimizations

Memory Management:

ZeRO-3 optimization for distributed training
Gradient checkpointing to reduce activation memory
Mixed precision training (bfloat16)

Distributed Training:

Pipeline parallelism for DeepEncoder
Synchronization fix for conditional code paths (critical for stability)

Key Implementation Fix: We discovered a critical distributed training bug where different ranks took different code paths based on whether an image had tiles. We fixed this by synchronizing the condition across all processes using all_reduce, ensuring deterministic execution.

4. Experimental Results

4.1 OmniDocBench Evaluation

OmniDocBench is a comprehensive document parsing benchmark covering diverse document types and parsing tasks.

OmniDocBench Results

Analysis:

Text Recognition: Our model achieves 0.093 edit distance on English text, slightly behind DeepSeek-OCR Base (0.054) but significantly better than most end-to-end models using 3949+ tokens.
Table Parsing: Competitive performance (0.142 vs 0.163) demonstrates effective structural understanding despite using similar token counts.
Formula Recognition: The gap in formula parsing (0.493 vs 0.267) suggests our training data lacked sufficient mathematical content, a known limitation of olmOCR-mix-1025.
Chinese Performance: Larger gap in Chinese documents (0.509 vs 0.240) indicates insufficient multilingual training data in our pipeline.

Key Achievement: With ~250 vision tokens, we achieve comparable performance to models using 3949+ tokens, validating the optical compression hypothesis.

4.2 olmOCR Benchmark Evaluation

The olmOCR benchmark tests diverse document understanding capabilities across eight categories.

olmOCR Results

Strengths:

Simple documents: Near-perfect performance on base documents (99.5) demonstrates strong fundamental OCR capability.
Layout understanding: Strong performance on headers/footers (94.3) shows effective spatial structure modeling.
Table structure: Better-than-expected table performance (70.3) validates the 2D spatial encoding approach.

Weaknesses:

Complex layouts: Significant gap on ArXiv papers (51.2) and multi-column documents (68.1) suggests difficulty with complex spatial arrangements.
Degraded quality: Poor performance on “long tiny text” (45.5) indicates resolution limitations when text is extremely small or dense.
Old scans: Gap on historical documents (60.3) suggests insufficient training on degraded/noisy images.

5. Discussion

5.1 Architectural Insights

Success Factors:

Window + Global Attention Hierarchy:
- SAM’s window attention handles high-resolution perception efficiently
- CLIP’s global attention integrates holistic understanding
- This two-stage design avoids the memory explosion of pure global attention
Convolutional Compression:
- 16× compression before global attention is crucial
- Simple conv layers (3×3, stride 2) work effectively
- Spatial structure preservation better than naive pooling
Feature Concatenation:
- Combining SAM (local) and CLIP (semantic) features provides complementary information
- 2048-dim combined features richer than either alone
- Simple concatenation outperforms complex fusion mechanisms

5.2 Performance Analysis

Why the Performance Gap?

Training Data Quantity:
- Our training: ~260K OCR samples
- SOTA models: Likely 1M+ samples with synthetic augmentation
- Impact: Insufficient exposure to diverse layouts and degradations
Training Data Quality:
- olmOCR-mix-1025 lacks mathematical formulas, old scans, and edge cases
- No synthetic data for long-tail scenarios
- Evidence: Large gaps on ArXiv (math-heavy) and old scans
Training Techniques:
- We used only basic supervised learning
- SOTA models use RLVR, dynamic temperature, model souping
- Potential gain: Ablation study shows +14.2 from advanced techniques
Prompt Engineering:
- We used single generic prompt: “Free OCR.”
- SOTA models likely use task-specific prompts
- Evidence: Ablation shows +3.0 from “better prompting”
Architecture Differences:
- Qwen2-VL-7B may not be optimized for dense text
- Original MoE design might have task-specific experts
- Hypothesis: Cannot verify without direct comparison

6. Limitations and Future Work

6.1 Current Limitations

1. Training Infrastructure:

Limited to 1 epoch due to computational constraints
Cannot fully explore learning rate schedules and convergence
No hyperparameter tuning due to cost

2. Data Limitations:

Single dataset (olmOCR-mix-1025) insufficient for generalization, The dataset is English mostly, which lead to poor performance on Chinese test dataset.
Lack of mathematical content hurts ArXiv performance
No old scan or degraded image training

6.2 Future Research Directions

Immediate Improvements:

Data Augmentation:
- Synthesize mathematical content for formula recognition
- Generate degraded images for robustness
- Include more multilingual documents
Training Enhancements:
- Implement dynamic temperature scaling
- Add RLVR for iterative improvement
- Explore model souping for ensemble benefits
Architecture Experiments:
- Test with different LLM
- Ablate window size in SAM
- Experiment with different compression ratios

Long-term Exploration:

Beyond OCR:
- Apply to screenshot understanding
- Extend to UI element detection
- Explore code screenshot → code generation
Efficiency Optimization:
- Quantization for deployment
- Knowledge distillation to smaller models
- Optimize for edge devices

7. Conclusion

This work presents DeepOCR, a successful reproduction of DeepSeek-OCR’s innovative optical context compression approach. Our implementation validates the core architectural hypothesis: combining window attention (SAM), global attention (CLIP), and convolutional compression enables efficient high-resolution document processing with significantly fewer tokens than conventional approaches.

Key Achievements:

Architecture Validation: Successfully reproduced the 380M parameter DeepEncoder, demonstrating that the SAM+CLIP+compression design enables practical vision-text compression.
Open-Source Contribution: Released complete, working code in the VILA framework, providing the research community with an accessible starting point for optical compression research.
Empirical Evidence: Achieved competitive performance on two benchmarks using 15× fewer tokens than baseline VLMs, supporting the feasibility of optical compression.

Performance Summary:

OmniDocBench: 0.356 overall edit distance with ~250 tokens
olmOCR: 65.5 average score across 8 document categories

Critical Insights:

The performance gap compared to state-of-the-art (17.6 points on olmOCR) stems primarily from training data and methodology rather than architectural deficiencies. The ablation study in the olmOCR paper shows similar baseline performance (68.2) improving to 82.4 through advanced techniques—suggesting our reproduction captures the fundamental architecture successfully.

Acknowledgments

This reproduction builds upon the VILA framework and leverages pre-trained SAM and CLIP models. We thank the DeepSeek team for their innovative work and detailed technical report, which made this reproduction possible.

References

[1] DeepSeek-AI. (2025). DeepSeek-OCR: Contexts Optical Compression.

[2] VILA Framework. NVIDIA Research.

[3] Segment Anything Model (SAM). Meta AI Research.

[4] CLIP. OpenAI.

[5] Qwen2-VL. Qwen Team

[6] OmniDocBench. Document parsing benchmark.

[7] olmOCR. Allen Institute for AI.