An Overview of ByteDance’s Document Parsing Model, Dolphin

Introduction

ByteDance is a Chinese technology company that has developed novel video-sharing social networking applications, most notably TikTok. They’ve also made impressive contributions to the AI industry, such as open-sourcing Monolith, a high-throughput, low-latency deep learning framework for large-scale recommendation modeling. There’s been a slew of recent releases from ByteDance including Bagel, an open-source multimodal foundation model with image generation and editing capabilities; Trae, an AI assistant designed for programmers that can answer coding queries, complete code snippets, and develop entire projects from prompts; DAPO,a distributed reinforcement-learning framework for LLM optimization; and UI-TARS, an open-source agent for automating GUI interactions.

Additionally, they introduced Dolphin, (Document Image Parsing via Heterogenous Anchor Prompting), a new multimodal document image parsing model. We’ve been covering open-source document processing models with our SmolDocling (from IBM research and HuggingFace) and olmOCR/rolmOCR (from AllenAI and Reducto) articles and are therefore very excited to explore ByteDance’s contribution to the space.

Prerequisites

There are two parts to this tutorial. (1) An overview covering the model architecture and training methodology and (2) an implementation where we run the model. We’ll show you how you can run this model on DigitalOcean GPU Droplets.

The topics presented in the overview section of this article requires familiarity with transformers, the attention mechanism (self-attention and cross-attention), Vision Language Models (VLMs), etc. The implementation section may require some familiarity with the command-line.

Feel free to skip sections that aren’t of use to you.

Motivation for Developing Dolphin

The researchers’ motivation for developing such a model is that there are limitations with current integration-based document parsing solutions and Vision Language Model (VLM) solutions.

Existing document parsing methods mentioned in the Dolphin paper are listed below. The repositories are linked for your exploration. If the repository was not found, blog posts or papers are linked.

Integration-based Document Parsing: Traditional document parsing solutions use a multistage pipeline with multiple specialized models, starting with layout detection and followed by dedicated recognizers for each element type.

Mathpix TextIn MinerU Layout Parser

General VLMs: Large vision-language models can handle document parsing tasks without task-specific training, thanks to their zero-shot capabilities developed through large-scale pre-training on diverse visual data.

GPT-4V system card (example) Claude-series (documentation) Gemini-series (blog) (documentation) QwenVL-series MiniCPM-series InternVL-series DeepSeekVL2 Step-1V

Expert VLMs: Specialized document parsing models, which are fine-tuned for document-specific challenges and often outperform other models on document parsing benchmarks.

Nougat GOT-OCR2.0 Donut LayoutLM-series UDOP Wukong-Reader KOSMOS-series UniDoc (paper) UReader DocPedia (paper) TGDoc Vary Fox Monkey-series TabPedia TextSquare DocFusion TextHawk-series mPLUG-DocOwl-series SmolDocling PlatyPus olmOCR Ocean-OCR Mistral-OCR

In the paper, the researchers explain that integration-based document parsing solutions require independent optimization of different OCR tasks (e.g., layout detection, reading order prediction, and recognition of textlines, formula, or tables) and existing VLM solutions (both general and expert VLMs) experience layout structure degradation and efficiency bottlenecks when parsing lengthy documents with complicated layouts. As a result of the limitations faced by existing solutions, ByteDance proposes Dolphin for document processing.

What is Dolphin?

Dolphin follows an “analyze-then-parse” approach to extract structured content from documents. The first of these two stages, the analyze stage, involves analyzing the page-level layout to extract elements in reading order. The extracted elements are used in the second parse stage to parallel parse individual elements.This approach of parallel processing, when paired with prompts tailored to specific elements, allows for computational efficiency and accurate content identification.

Architecture

Dolphin leverages an encoder-decoder transformer architecture. The encoder is a Swin Transformer where a page image is taken as an input and output as a sequence of visual embeddings. Using the cross-attention mechanism and the prompt, "Parse the reading order of this document”, the mBart decoder attends to the encoded visual features to derive sequential layout elements that preserve structural relationships.

The second stage uses layout elements to parse content in parallel, making it efficient while keeping element-specific details. This happens in two steps:

Element Image Encoding: Each layout element’s region is cropped from the original image to create a local view. These views are encoded using the Swin Transformer to produce element-specific visual features.

Parallel Content Parsing: With these encoded features, the decoder generates parsed content for each element in parallel.

Training

Dolphin was initialized with pretrained weights from Donut. The training dataset used for instruction tuning includes 30 million samples, covering both page-level documents and element-level components. The table below goes into detail about the types of data samples as well as how these different data formats were processed for either Dolphin’s layout or parsing stage.

Implementation

Dolphin offers two inference frameworks that support parsing documents at two different levels. The first level is page-level parsing, where the entire document page is converted into a structured format using JSON and Markdown. The second level is element-level parsing, which breaks down the document into individual components, such as text, tables, and formulas, for more detailed analysis.

Step 1 : Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option.

Step 2: SSH

SSH into your favourite code editor or terminal

ssh root@<your IPv4 address here>

Step 3: Install Dependencies

In the terminal, copy and paste the following code snippet:

apt install python3-pip python3.10

Step 4: Install Conda

Download the miniconda installer:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Run the installer:

bash Miniconda3-latest-Linux-x86_64.sh

Now let’s set up the Dolphin project and download the necessary model.

Step 5: Create a Conda Environment and Clone Dolphin

conda create -n ai python=3.11 -y && conda activate ai
git clone https://github.com/ByteDance/Dolphin.git && cd Dolphin

This command creates a new Conda environment named ai with Python 3.11, activates it, then clones the Dolphin repository and navigates into its directory.

Step 6: Install Python Requirements

Next, install all the Python libraries Dolphin needs:

pip install -r requirements.txt huggingface_hub

This installs everything listed in Dolphin’s requirements.txt file, plus huggingface_hub for interacting with Hugging Face.

Step 7: Prepare for Model Download

We need a spot to save the model:

mkdir hf_model

Log in to Hugging Face: You’ll need a Hugging Face access token to download the model. If you don’t have one, create it on the Hugging Face website under your profile settings (Settings -> Access Tokens).

Then, log in via the command line:

huggingface-cli login

Paste your token when prompted.

Step 8: Download the Dolphin Model

Finally, download the model files directly into your new directory:

huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model

Step 9: Run Inference

Let’s run inference on the images provided in the Dolphin demo folder.

# Process a single document image
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results

# Process a single document pdf
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_6.pdf --save_dir ./results

# Process all documents in a directory
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results

# Process with custom batch size for parallel element decoding
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 16

Let’s take a look at the page_1 output

Below is the markdown output:

Below is the json output:

We’re pretty pleased with the model’s performance! Try both page-level and element-level parsing and let us know what you think in the comments below.

Conclusion

In summary, ByteDance’s Dolphin model presents a promising approach to document parsing by utilizing an analyze-then-parse strategy. This method, leveraging Heterogenous Anchor Prompting, allows for both accuracy and efficiency, addressing limitations found in existing integration-based and VLM solutions. We went over the model architecture and training process. Additionally, we showed you how you can run Dolphin’s page-level and element-level parsing options on DigitalOcean GPU Droplets.

Happy experimenting!

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products.

An Overview of ByteDance’s Document Parsing Model, Dolphin

DigitalOcean

The simplest scalable cloud. ☁️

Introduction

Prerequisites

Motivation for Developing Dolphin

What is Dolphin?

Architecture

Training

Implementation

Step 1 : Set up a GPU Droplet

Step 2: SSH

Step 3: Install Dependencies

Step 4: Install Conda

Step 5: Create a Conda Environment and Clone Dolphin

Step 6: Install Python Requirements

Step 7: Prepare for Model Download

Step 8: Download the Dolphin Model

Step 9: Run Inference

Conclusion

More articles by this author

Others also viewed

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices

DeepSeek-V3-0324: A Game-Changer in Open-Source LLMs

Zero-Shot object detection: foundations, advances, and perspectives

The Rise of Context Engineering: Why AI's Future Depends on More Than Just Prompts

How to Prompt OpenAI o1 + Should You Use It? - AI&YOU #72

A Journey from AI to LLMs and MCP — 2 — How LLMs Work — Embeddings, Vectors, and Context Windows

A Journey from AI to LLMs and MCP - 3 - Boosting LLM Performance — Fine-Tuning, Prompt Engineering, and RAG

3 advantages of using crowdsourcing in machine learning

Context Engineering in the Age of Multi-Million-Token Models: From Context Rot to Agentic Intelligence

NRGscapes LAB Launches Automated Metadata Engine to Decode UAP Archive

Explore topics

Introduction

Prerequisites

Motivation for Developing Dolphin

What is Dolphin?

Architecture

Training

Implementation

Step 1 : Set up a GPU Droplet

Step 2: SSH

Step 3: Install Dependencies

Step 4: Install Conda

Step 5: Create a Conda Environment and Clone Dolphin

Step 6: Install Python Requirements

Step 7: Prepare for Model Download

Step 8: Download the Dolphin Model

Step 9: Run Inference

Conclusion

How to Set Up n8n: A Step-by-Step Guide for Self-Hosted Workflow Automation

Aug 8, 2025

Kimi K2, An Open-weight Agentic Model From Moonshot AI

Aug 1, 2025

Translating bulk documents with ERNIE 4.5 1-Click GPU Droplets

Jul 22, 2025

Your Guide to the TradingAgents Multi-Agent LLM Framework

Jul 17, 2025

Trae: A New Free AI-Powered Code Editor from ByteDance

Jul 16, 2025

Devstral: An Open-Source Agentic LLM for Software Engineering

Jul 14, 2025

How I built a smart travel app using the GradientAI Platform

Jul 11, 2025

Imagen 4 is Revolutionizing Image Generation Again

Jun 27, 2025

LangChain: A Beginner's Guide to Harness the Power of Language Models

Jun 26, 2025

Creating Next-Level AI Videos with Veo 3

Jun 25, 2025

Others also viewed

Splitting LLMs Across Multiple GPUs: Techniques, Tools, and Best Practices

DeepSeek-V3-0324: A Game-Changer in Open-Source LLMs

Zero-Shot object detection: foundations, advances, and perspectives

The Rise of Context Engineering: Why AI's Future Depends on More Than Just Prompts

How to Prompt OpenAI o1 + Should You Use It? - AI&YOU #72

A Journey from AI to LLMs and MCP — 2 — How LLMs Work — Embeddings, Vectors, and Context Windows

A Journey from AI to LLMs and MCP - 3 - Boosting LLM Performance — Fine-Tuning, Prompt Engineering, and RAG

3 advantages of using crowdsourcing in machine learning

Context Engineering in the Age of Multi-Million-Token Models: From Context Rot to Agentic Intelligence

NRGscapes LAB Launches Automated Metadata Engine to Decode UAP Archive

Explore topics

Zero-Shot object detection: foundations, advances, and perspectives