An Overview of ByteDance’s Document Parsing Model, Dolphin

An Overview of ByteDance’s Document Parsing Model, Dolphin

Introduction

ByteDance is a Chinese technology company that has developed novel video-sharing social networking applications, most notably TikTok. They’ve also made impressive contributions to the AI industry, such as open-sourcing Monolith, a high-throughput, low-latency deep learning framework for large-scale recommendation modeling. There’s been a slew of recent releases from ByteDance including Bagel, an open-source multimodal foundation model with image generation and editing capabilities; Trae, an AI assistant designed for programmers that can answer coding queries, complete code snippets, and develop entire projects from prompts; DAPO,a distributed reinforcement-learning framework for LLM optimization; and UI-TARS, an open-source agent for automating GUI interactions.

Additionally, they introduced Dolphin, (Document Image Parsing via Heterogenous Anchor Prompting), a new multimodal document image parsing model. We’ve been covering open-source document processing models with our SmolDocling (from IBM research and HuggingFace) and olmOCR/rolmOCR (from AllenAI and Reducto) articles and are therefore very excited to explore ByteDance’s contribution to the space.

Prerequisites

There are two parts to this tutorial. (1) An overview covering the model architecture and training methodology and (2) an implementation where we run the model. We’ll show you how you can run this model on DigitalOcean GPU Droplets.

The topics presented in the overview section of this article requires familiarity with transformers, the attention mechanism (self-attention and cross-attention), Vision Language Models (VLMs), etc. The implementation section may require some familiarity with the command-line.

Feel free to skip sections that aren’t of use to you.

Motivation for Developing Dolphin

The researchers’ motivation for developing such a model is that there are limitations with current integration-based document parsing solutions and Vision Language Model (VLM) solutions.

Existing document parsing methods mentioned in the Dolphin paper are listed below. The repositories are linked for your exploration. If the repository was not found, blog posts or papers are linked.


Integration-based Document Parsing: Traditional document parsing solutions use a multistage pipeline with multiple specialized models, starting with layout detection and followed by dedicated recognizers for each element type.

General VLMs: Large vision-language models can handle document parsing tasks without task-specific training, thanks to their zero-shot capabilities developed through large-scale pre-training on diverse visual data.

Expert VLMs: Specialized document parsing models, which are fine-tuned for document-specific challenges and often outperform other models on document parsing benchmarks.


In the paper, the researchers explain that integration-based document parsing solutions require independent optimization of different OCR tasks (e.g., layout detection, reading order prediction, and recognition of textlines, formula, or tables) and existing VLM solutions (both general and expert VLMs) experience layout structure degradation and efficiency bottlenecks when parsing lengthy documents with complicated layouts. As a result of the limitations faced by existing solutions, ByteDance proposes Dolphin for document processing.

What is Dolphin?

Article content

Dolphin follows an “analyze-then-parse” approach to extract structured content from documents. The first of these two stages, the analyze stage, involves analyzing the page-level layout to extract elements in reading order. The extracted elements are used in the second parse stage to parallel parse individual elements.This approach of parallel processing, when paired with prompts tailored to specific elements, allows for computational efficiency and accurate content identification.

Architecture

Article content

Dolphin leverages an encoder-decoder transformer architecture. The encoder is a Swin Transformer where a page image is taken as an input and output as a sequence of visual embeddings. Using the cross-attention mechanism and the prompt, "Parse the reading order of this document”, the mBart decoder attends to the encoded visual features to derive sequential layout elements that preserve structural relationships.

The second stage uses layout elements to parse content in parallel, making it efficient while keeping element-specific details. This happens in two steps:

Element Image Encoding: Each layout element’s region is cropped from the original image to create a local view. These views are encoded using the Swin Transformer to produce element-specific visual features.

Parallel Content Parsing: With these encoded features, the decoder generates parsed content for each element in parallel.

Training

Dolphin was initialized with pretrained weights from Donut. The training dataset used for instruction tuning includes 30 million samples, covering both page-level documents and element-level components. The table below goes into detail about the types of data samples as well as how these different data formats were processed for either Dolphin’s layout or parsing stage.

Article content

Implementation

Dolphin offers two inference frameworks that support parsing documents at two different levels. The first level is page-level parsing, where the entire document page is converted into a structured format using JSON and Markdown. The second level is element-level parsing, which breaks down the document into individual components, such as text, tables, and formulas, for more detailed analysis.

Step 1 : Set up a GPU Droplet

Begin by setting up a DigitalOcean GPU Droplet, select AI/ML and choose the NVIDIA H100 option.

Article content

Step 2: SSH

Article content

SSH into your favourite code editor or terminal

ssh root@<your IPv4 address here>        

Step 3: Install Dependencies

In the terminal, copy and paste the following code snippet:

apt install python3-pip python3.10        

Step 4: Install Conda

Download the miniconda installer:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh        

Run the installer:

bash Miniconda3-latest-Linux-x86_64.sh        

Now let’s set up the Dolphin project and download the necessary model.

Step 5: Create a Conda Environment and Clone Dolphin

conda create -n ai python=3.11 -y && conda activate ai
git clone https://github.com/ByteDance/Dolphin.git && cd Dolphin        

This command creates a new Conda environment named ai with Python 3.11, activates it, then clones the Dolphin repository and navigates into its directory.

Step 6: Install Python Requirements

Next, install all the Python libraries Dolphin needs:

pip install -r requirements.txt huggingface_hub        

This installs everything listed in Dolphin’s requirements.txt file, plus huggingface_hub for interacting with Hugging Face.

Step 7: Prepare for Model Download

We need a spot to save the model:

mkdir hf_model        

Log in to Hugging Face: You’ll need a Hugging Face access token to download the model. If you don’t have one, create it on the Hugging Face website under your profile settings (Settings -> Access Tokens).

Then, log in via the command line:

huggingface-cli login        

Paste your token when prompted.

Step 8: Download the Dolphin Model

Finally, download the model files directly into your new directory:

huggingface-cli download ByteDance/Dolphin --local-dir ./hf_model        

Step 9: Run Inference

Let’s run inference on the images provided in the Dolphin demo folder.

# Process a single document image
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_1.jpeg --save_dir ./results

# Process a single document pdf
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_6.pdf --save_dir ./results

# Process all documents in a directory
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results

# Process with custom batch size for parallel element decoding
python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs --save_dir ./results --max_batch_size 16        
Article content

Let’s take a look at the page_1 output

Below is the markdown output:

Article content

Below is the json output:

Article content

We’re pretty pleased with the model’s performance! Try both page-level and element-level parsing and let us know what you think in the comments below.

Conclusion

In summary, ByteDance’s Dolphin model presents a promising approach to document parsing by utilizing an analyze-then-parse strategy. This method, leveraging Heterogenous Anchor Prompting, allows for both accuracy and efficiency, addressing limitations found in existing integration-based and VLM solutions. We went over the model architecture and training process. Additionally, we showed you how you can run Dolphin’s page-level and element-level parsing options on DigitalOcean GPU Droplets.

Happy experimenting!


Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products.



To view or add a comment, sign in

Others also viewed

Explore topics