Datatera Logo
DATATERA.ai
Back to blog
Open Source5 min read

Docfold - open-source document processing toolkit

We are happy to announce that we are open-sourcing Docfold - a Python library for extracting structured data from documents. The project is available on PyPI and GitHub under the MIT license.

What is Docfold

Docfold provides a single interface for 15 document-processing engines. Instead of learning the API of each library separately, you make one call:

from docfold import process

result = process("invoice.pdf", engine="pymupdf")

One interface, 15 engines, predictable output.

Supported engines

Docfold brings together local and cloud solutions:

  • Local: PyMuPDF, Docling, Marker, MinerU, PaddleOCR, Tesseract, Unstructured, Nougat, Surya
  • Cloud: LlamaParse, Mistral OCR, AWS Textract, Google Document AI, Azure Document Intelligence
  • LLM-based: Zerox (requires Python 3.11+)

Each engine has its strengths. PyMuPDF excels at native PDFs, Tesseract handles scanned documents well, and Nougat is optimized for academic papers.

Smart routing

Not sure which engine to pick? Docfold will choose for you:

from docfold import process

result = process("document.pdf")  # automatic selection

The router considers file type, OCR layer presence, and which engines are available on your machine.

Why we open-sourced it

Datatera.ai uses Docfold internally for initial document processing. We believe that foundational document tooling should be accessible to everyone. Open-sourcing allows:

  • Developers to use a battle-tested tool without vendor lock-in
  • The community to improve and extend the library
  • Us to receive feedback and contributions from the community

Getting started

Install via pip:

pip install docfold

To install all engines:

pip install docfold[all]

Documentation and examples are available on GitHub. The package is published on PyPI.


Docfold - version 0.5.1, 225 tests, Python 3.10-3.12 support, CI across three operating systems.

Ready to bring governed AI data to every team?

Book a call to map your sources, security requirements, and highest-impact use cases.

Book a call