---
sidebar_position: -4
slug: /select_pdf_parser
---

# Select PDF parser

Select a visual model for parsing your PDFs.

---

RAGFlow isn't one-size-fits-all. It is built for flexibility and supports deeper customization to accommodate more complex use cases. From v0.17.0 onwards, RAGFlow decouples DeepDoc-specific data extraction tasks from chunking methods **for PDF files**. This separation enables you to autonomously select a visual model for OCR (Optical Character Recognition), TSR (Table Structure Recognition), and DLR (Document Layout Recognition) tasks that balances speed and performance to suit your specific use cases. If your PDFs contain only plain text, you can opt to skip these tasks by selecting the **Naive** option, to reduce the overall parsing time.

![data extraction](https://raw.githubusercontent.com/infiniflow/ragflow-docs/main/images/data_extraction.jpg)

## Prerequisites

- The PDF parser dropdown menu appears only when you select a chunking method compatible with PDFs, including:
    - **General**
    - **Manual**
    - **Paper**
    - **Book**
    - **Laws**
    - **Presentation**
    - **One**
- To use a third-party visual model for parsing PDFs, ensure you have set a default img2txt model under **Set default models** on the **Model providers** page.

## Quickstart

1. On your dataset's **Configuration** page, select a chunking method, say **General**.

   _The **PDF parser** dropdown menu appears._

2. Select the option that works best with your scenario:

  - DeepDoc: (Default) The default visual model performing OCR, TSR, and DLR tasks on PDFs, which can be time-consuming.
  - Naive: Skip OCR, TSR, and DLR tasks if *all* your PDFs are plain text.
  - MinerU: An experimental feature.
  - A third-party visual model provided by a specific model provider.

:::danger IMPORTANG
MinerU PDF document parsing is available starting from v0.21.1. To use this feature, follow these steps:

1. Before deploying ragflow-server, update your **docker/.env** file:  
   - Enable `HF_ENDPOINT=https://hf-mirror.com`
   - Add a MinerU entry: `MINERU_EXECUTABLE=/ragflow/uv_tools/.venv/bin/mineru`

2. Start the ragflow-server and run the following commands inside the container:  

```bash
mkdir uv_tools
cd uv_tools
uv venv .venv
source .venv/bin/activate
uv pip install -U "mineru[core]" -i https://mirrors.aliyun.com/pypi/simple
```

3. Restart the ragflow-server.
4. In the web UI, navigate to the **Configuration** page of your dataset. Click **Built-in** in the **Ingestion pipeline** section, select a chunking method from the **Built-in** dropdown, which supports PDF parsing, and slect **MinerU** in **PDF parser**.
5. If you use a custom ingestion pipeline instead, you must also complete the first three steps before selecting **MinerU** in the **Parsing method** section of the **Parser** component.
:::

:::caution WARNING
Third-party visual models are marked **Experimental**, because we have not fully tested these models for the aforementioned data extraction tasks.
:::

## Frequently asked questions

### When should I select DeepDoc or a third-party visual model as the PDF parser?

Use a visual model to extract data if your PDFs contain formatted or image-based text rather than plain text. DeepDoc is the default visual model but can be time-consuming. You can also choose a lightweight or high-performance img2txt model depending on your needs and hardware capabilities.

### Can I select a visual model to parse my DOCX files?

No, you cannot. This dropdown menu is for PDFs only. To use this feature, convert your DOCX files to PDF first.

