DocLang

Your documents are lying
to your models.

The world's knowledge lives in formats designed for rendering, not understanding. Markdown was built for readers. HTML for browsers. LaTeX for typesetting. PDF for print. None were built for machines.

Modern AI pipelines assume clean, structured input. Real-world documents — contracts, invoices, research papers, regulatory filings — are none of those things. Parsers guess at reading order. Tables become flat text. Figures vanish. Metadata is stripped.

The result: your model's accuracy is bottlenecked by document quality, not model quality. You spend more engineering time wrangling pre-processing than building the product.

parse("quarterly_report.pdf")

reading_order

expected sequential hierarchy

received undefined

table_structure

expected 3×12 grid with merged cells

received flat string (156 chars)

figure_references

expected 8 embedded figures

received 0 (omitted)

document_metadata

expected { author, created, lang }

received null

A document representation built for how AI actually reads.

DocLang defines a structured, machine-readable format for documents of any type. Not a converter. Not an API. A standard — like JSON for data, like HTML for the web — that any tool can implement and any pipeline can consume.

Every component carries a semantic tag, bounding box coordinates, and reading order — natively encoded in a format LLM tokenizers can parse without translation overhead. A table encodes its full grid structure via OTSL. A heading carries its level and page position. Your model doesn't have to guess. Governance metadata — PII flags, RAG permissions, training constraints — lives inside <head>, not in a sidecar file.

The same standard extends beyond text documents. Audio transcripts, images, and video segments encode as first-class elements — speakers, timestamps, and scenes using the same primitives as headings and tables.

What your parser returns

Q3 2024Financial Re
port Net Revenue42
M51M39M Figure3.2
omitted author:null

What DocLang returns

<head><author>J. Smith</author></head> <heading level="1">Q3 2024</heading> <table><ched/>Revenue<fcel/>$42M<nl/></table> <picture><src uri="fig-3.2.png"/></picture>

Six properties. No compromises.

AI-native

Every element maps directly to LLM tokens. No translation layers, no postprocessing, no structural guesswork.

Lossless

Tables keep their full grid structure. Figures keep their position. Reading order is preserved, not inferred.

Expressive

Semantic roles, bounding boxes, document hierarchy — all fully encoded. Your model stops hallucinating structure.

Beyond documents

Audio transcripts, images, video segments — same format, same primitives. Speakers, timestamps, and scenes are native elements.

Unambiguous

One canonical representation per content type. No parser-dependent variance. Every tool produces the same output.

Open

A Joint Development Foundation Projects standard and LF AI & Data project. Public spec, open working group, no lock-in.

The business context layer for enterprise AI.

AI is only as reliable as the context it receives. DocLang transforms documents into structured business context that can be trusted across AI agents, workflows, and enterprise systems.

Business context, preserved

Structure alone is not enough. DocLang preserves the meaning, relationships, and business context behind your documents so AI systems can act on knowledge, not just content.

Fewer errors, faster decisions

Reliable structure means fewer errors in automated document workflows — fewer manual reviews, lower compliance exposure, and faster time-to-decision.

Audit-ready by default

Compliance metadata travels with the document, not alongside it. Legal and compliance teams define rules once, and every downstream system reads them automatically.

No lock-in, ever

Swap components as the market evolves. Your documents stay portable because the speicification is standardized and any vendor can implement it.

AI-native document format specification

DocLang is a constrained XML format built from the ground up for LLM tokenizers — a 1-to-1 mapping between DocLang tokens and model tokens, with minimal token count. Every component carries semantic role, geometric bounding box, and reading order. Tables use OTSL: 5 structural tokens where HTML needs 28.

Full spec and reference implementation on GitHub →

DocLang quarterly_report.dclg.xml

<doclang>

  <heading level="1">
    <location value="48"/><location value="40"/>
    <location value="420"/><location value="72"/>
    Q3 Financial Summary
  </heading>

  <table>
    <location value="48"/><location value="88"/>
    <location value="420"/><location value="168"/>
    <ched/>Quarter<ched/>Revenue<ched/>YoY<nl/>
    <fcel/>Q3 2024<fcel/>$42M<fcel/>+18%<nl/>
  </table>

</doclang>

Join the working group.

The spec, the reference implementation, and the working group processes are all public. The standard improves when more perspectives are in the room.

Ready to get involved? Sign the CLA to join the Working Group and start contributing.

Mailing listLatest news, discussions, and announcements.
GitHub orgSpec, reference implementation, and open issues.
Meeting calendarWorking group sessions, open to all contributors.

Is this just another document parser?

No. Parsers convert documents into some proprietary output format. DocLang is a standard — a shared specification that any parser, any converter, any AI tool can implement. The goal is interoperability, not another tool to integrate.

What's wrong with just using Docling / FineReader / MyParser instead of DocLang?

Nothing — that's the point. Enable the DocLang output and the structured data they produce stops being tied to your specific pipeline configuration. It becomes consumable by any downstream system that speaks the DocLang standard.

Docling and ABBYY FineReader Engine already natively support the DocLang standard.

How is this different from what PDF already contains?

PDF is a presentation format. It tells a renderer where to draw pixels. DocLang is a semantic format — it tells a model what content is. A PDF table and a DocLang table are fundamentally different objects.

Who governs the spec?

The DocLang Specification development process is governed by Joint Development Foundation Projects. DocLang is an LF AI & Data project. The DocLang working group — founded by IBM, NVIDIA, Red Hat, ABBYY, and HumanSignal — proposes and reviews changes, but the foundation ensures the process remains open and no single vendor controls the roadmap.

Can I contribute?

Yes. The spec, the reference implementation, and the working group processes are all public. Join the GitHub discussion, open an issue, or attend a working group session. The standard improves when more perspectives are in the room.

The AI-native
document format.

Your documents are lying
to your models.

A document representation built for how AI actually reads.

Six properties. No compromises.

The business context layer for enterprise AI.

AI-native document format specification

Join the working group.

Honest answers to obvious questions.

The substrate your pipeline has been missing.

The AI-nativedocument format.

Your documents are lyingto your models.

A document representation built for how AI actually reads.

Six properties. No compromises.

The business context layer for enterprise AI.

AI-native document format specification

Join the working group.

Honest answers to obvious questions.

The substrate your pipeline has been missing.

The AI-native
document format.

Your documents are lying
to your models.