Name: Datatera.ai
Availability: OnlineOnly
Rating: 4.9 (150 reviews)
Author: Datatera.ai

We are happy to announce that we are open-sourcing Docfold - a Python library for extracting structured data from documents. The project is available on PyPI and GitHub under the MIT license.

What is Docfold

Docfold provides a single interface for 15 document-processing engines. Instead of learning the API of each library separately, you make one call:

from docfold import process

result = process("invoice.pdf", engine="pymupdf")

One interface, 15 engines, predictable output.

Supported engines

Docfold brings together local and cloud solutions:

Local: PyMuPDF, Docling, Marker, MinerU, PaddleOCR, Tesseract, Unstructured, Nougat, Surya
Cloud: LlamaParse, Mistral OCR, AWS Textract, Google Document AI, Azure Document Intelligence
LLM-based: Zerox (requires Python 3.11+)

Each engine has its strengths. PyMuPDF excels at native PDFs, Tesseract handles scanned documents well, and Nougat is optimized for academic papers.

Smart routing

Not sure which engine to pick? Docfold will choose for you:

from docfold import process

result = process("document.pdf")  # automatic selection

The router considers file type, OCR layer presence, and which engines are available on your machine.

Why we open-sourced it

Datatera.ai uses Docfold internally for initial document processing. We believe that foundational document tooling should be accessible to everyone. Open-sourcing allows:

Developers to use a battle-tested tool without vendor lock-in
The community to improve and extend the library
Us to receive feedback and contributions from the community

Getting started

Install via pip:

pip install docfold

To install all engines:

pip install docfold[all]

Documentation and examples are available on GitHub. The package is published on PyPI.

Docfold - version 0.5.1, 225 tests, Python 3.10-3.12 support, CI across three operating systems.

Docfold - open-source document processing toolkit

What is Docfold

Supported engines

Smart routing

Why we open-sourced it

Getting started

Ready to bring governed AI data to every team?