Docfold - open-source document processing toolkit
We are happy to announce that we are open-sourcing Docfold - a Python library for extracting structured data from documents. The project is available on PyPI and GitHub under the MIT license.
What is Docfold
Docfold provides a single interface for 15 document-processing engines. Instead of learning the API of each library separately, you make one call:
from docfold import process
result = process("invoice.pdf", engine="pymupdf")
One interface, 15 engines, predictable output.
Supported engines
Docfold brings together local and cloud solutions:
- Local: PyMuPDF, Docling, Marker, MinerU, PaddleOCR, Tesseract, Unstructured, Nougat, Surya
- Cloud: LlamaParse, Mistral OCR, AWS Textract, Google Document AI, Azure Document Intelligence
- LLM-based: Zerox (requires Python 3.11+)
Each engine has its strengths. PyMuPDF excels at native PDFs, Tesseract handles scanned documents well, and Nougat is optimized for academic papers.
Smart routing
Not sure which engine to pick? Docfold will choose for you:
from docfold import process
result = process("document.pdf") # automatic selection
The router considers file type, OCR layer presence, and which engines are available on your machine.
Why we open-sourced it
Datatera.ai uses Docfold internally for initial document processing. We believe that foundational document tooling should be accessible to everyone. Open-sourcing allows:
- Developers to use a battle-tested tool without vendor lock-in
- The community to improve and extend the library
- Us to receive feedback and contributions from the community
Getting started
Install via pip:
pip install docfold
To install all engines:
pip install docfold[all]
Documentation and examples are available on GitHub. The package is published on PyPI.
Docfold - version 0.5.1, 225 tests, Python 3.10-3.12 support, CI across three operating systems.