Preparing Documents with Docling for CAG, RAG, and Embedding Pipelines

Preparing Documents with Docling for CAG, RAG, and Embedding Pipelines

~ 6 min read


If you want reliable AI answers from your own content, document preparation matters as much as model choice.

Most production failures come from bad source material: duplicated pages, broken tables, missing headings, weak metadata, and chunks that lose context. Docling is useful because it converts varied document formats into structured outputs that are easier to split, embed, and trace.

This guide expands beyond CAG. You will see when to use CAG, when to use RAG, when embeddings are worth it, and how to prepare one corpus for all of them.

CAG, RAG, and Embeddings: Quick Definitions

Terms are used differently across teams, so this article uses these definitions:

  • CAG (Context-Augmented Generation): context is assembled ahead of generation and passed directly to the model
  • RAG (Retrieval-Augmented Generation): relevant chunks are retrieved at query time from an index
  • Embeddings: dense vector representations used for semantic search, clustering, deduplication, and routing

When to Use CAG vs RAG

Use CAG when:

  • The working knowledge set is small or bounded
  • You need predictable, fixed context packets (for example, policy packs per team)
  • Latency from retrieval infrastructure is not worth the complexity

Use RAG when:

  • The corpus is large, changes often, or cannot fit in prompt context
  • You need query-time relevance instead of fixed context bundles
  • You want citations back to exact source chunks

Use hybrid (CAG + RAG) when:

  • You always need baseline context plus query-specific retrieval
  • You have mandatory instructions/policies and a large evolving knowledge base

What Good AI Ingestion Input Looks Like

Before writing code, define your target shape:

  • Clean text with preserved structure (headings, lists, tables)
  • Stable chunk IDs so you can re-index safely
  • Source metadata for traceability (file, section, version)
  • Chunk sizes aligned to your embedding and generation limits

Docling helps with the conversion and structure-preserving part, then you layer chunking and indexing on top.

1. Install Docling

python -m venv .venv
source .venv/bin/activate
pip install -U docling
docling --version

2. Convert Source Files to Structured Outputs

Start by converting all source documents into Markdown and JSON so you keep both human-readable and machine-friendly representations.

mkdir -p ingest/raw ingest/normalised

# Example batch conversion for PDFs and DOCX files
# --image-export-mode referenced keeps image refs in output where applicable
docling ingest/raw \
    --from pdf \
    --from docx \
    --to md \
    --to json \
    --output ingest/normalised \
    --image-export-mode referenced

For digitally generated PDFs (already text-based), you can skip OCR to reduce noise and speed up processing:

docling ingest/raw \
    --from pdf \
    --to json \
    --output ingest/normalised \
    --no-ocr

If your corpus has many tables and extraction quality matters, use the more accurate table mode:

docling ingest/raw \
    --from pdf \
    --to json \
    --output ingest/normalised \
    --table-mode accurate

3. Normalise Metadata Before Chunking

Before chunking, enrich each converted file with consistent metadata:

  • doc_id (stable identifier)
  • source_path
  • source_type (pdf, docx, html, md)
  • version or effective_date
  • optional access labels (team, region, policy level)

Without this, retrieval quality may still look good, but provenance and compliance checks are painful later.

4. Chunk with Structural Context (Not Blind Fixed Windows)

Naive fixed-size chunking often splits ideas mid-section and hurts retrieval relevance. Docling’s hybrid chunking keeps more context by combining document structure and token-aware boundaries.

from pathlib import Path
import json

from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

source = Path("ingest/raw/product-manual.pdf")
converter = DocumentConverter()
chunker = HybridChunker()

result = converter.convert(source=source)
doc = result.document

chunks = []
for i, chunk in enumerate(chunker.chunk(dl_doc=doc)):
    chunks.append(
        {
            "id": f"{source.stem}:{i}",
            "text": chunker.contextualize(chunk=chunk),
            "source_path": str(source),
            "chunk_index": i,
        }
    )

Path("ingest/chunks").mkdir(parents=True, exist_ok=True)
Path("ingest/chunks/product-manual.json").write_text(
    json.dumps(chunks, ensure_ascii=False, indent=2),
    encoding="utf-8",
)

print(f"Wrote {len(chunks)} chunks")

The important part is chunker.contextualize(...): it builds chunk text with surrounding structural context, which usually improves retrieval quality versus raw text slices.

For CAG, keep chunks slightly larger if you assemble curated context packs later. For RAG, prefer tighter chunks so retrieval returns focused evidence.

5. When and How to Use Embeddings

Use embeddings when you need semantic matching:

  • Query-to-chunk retrieval (core RAG)
  • Near-duplicate detection
  • Topic clustering and corpus cleanup
  • Routing queries to the right document set

Embeddings are usually not needed for:

  • Exact key lookups (use keyword or structured filters)
  • Tiny static corpora where full-context prompting is simpler

Embedding Pipeline (Practical Sequence)

  1. Convert and normalise with Docling
  2. Chunk with structure-aware logic
  3. Generate embeddings for each chunk text
  4. Upsert vectors and metadata into an index
  5. Retrieve top-k at query time, then optionally re-rank
import json
from pathlib import Path

chunks = json.loads(Path("ingest/chunks/product-manual.json").read_text())

records = []
for chunk in chunks:
    vector = embed_text(chunk["text"])  # provider/model specific
    records.append(
        {
            "id": chunk["id"],
            "vector": vector,
            "metadata": {
                "source_path": chunk["source_path"],
                "chunk_index": chunk["chunk_index"],
                "pipeline_version": "2026-02-22",
            },
            "text": chunk["text"],
        }
    )

upsert_vectors(records)  # vector DB specific

Use deterministic IDs and keep the original text with metadata so you can audit outputs and provide citations.

6. Ingest into CAG, RAG, or Hybrid Stores

At ingestion time, keep payloads explicit and reproducible.

CAG context-pack record

{
    "id": "policy-pack-eu-v3",
    "context_blocks": ["...chunk text 1...", "...chunk text 2..."],
    "metadata": {
        "scope": "eu-support",
        "pipeline_version": "2026-02-22"
    }
}

RAG vector record

{
    "id": "product-manual:42",
    "text": "...contextualised chunk text...",
    "metadata": {
        "source_path": "ingest/raw/product-manual.pdf",
        "chunk_index": 42,
        "pipeline_version": "2026-02-22"
    }
}

Hybrid pattern

  • Load a small fixed CAG context pack first
  • Retrieve top-k chunks from your vector index
  • Merge, de-duplicate, and pass to generation

This gives stable baseline context plus fresh query-time evidence.

7. Other Ingestion Outputs from the Same Corpus

The same Docling-normalised corpus can feed more than chat retrieval:

  • Fine-tuning datasets: instruction/answer examples with source links
  • Evaluation suites: benchmark question sets tied to canonical docs
  • Knowledge graphs: extracted entities and relations from structured chunks
  • Search indexes: lexical BM25 index for hybrid lexical + semantic retrieval

You do not need separate document conversion pipelines for each output. Reuse one normalised source of truth.

8. Validate Before Full Indexing

Run a quick quality gate before large indexing jobs:

  1. Spot-check 20 to 50 chunks for readability and context completeness
  2. Confirm key tables and headings survived conversion
  3. Check for duplicate or near-duplicate chunks
  4. Measure token length distribution against your embedding model limits
  5. Run a small retrieval benchmark on real user queries

This step is where most production quality gains come from.

Common Pitfalls

  • Converting to plain text too early and losing structure
  • Mixing OCR and digital PDFs without separate handling rules
  • Chunking by character count only, with no semantic boundaries
  • Using embeddings without metadata filters (causes noisy retrieval)
  • Choosing CAG for large, fast-changing corpora where RAG is a better fit
  • Ignoring metadata, then being unable to explain answer provenance

Summary

Docling gives you a practical base for AI ingestion across CAG and RAG:

  • Convert heterogeneous files into structured output
  • Apply structure-aware chunking with contextualised text
  • Add embeddings where semantic retrieval is needed
  • Preserve metadata so results are explainable
  • Validate chunk quality before full indexing

If quality is unstable, fix document preparation first. In most systems, that has higher ROI than model swaps.

all posts →