Sarah Chen, SEO Content Strategist
What This Tool Does
Before a document can power a RAG system, it has to be chunked — and naive fixed-size chunking destroys the structure that makes Markdown valuable. This formatter splits along your document's natural heading boundaries while honoring a token budget, so each chunk is a coherent, appropriately sized unit ready to embed and retrieve.
Chunking Strategies
- Heading + token cap (recommended): split on headings, then subdivide any oversized section.
- By heading: one chunk per section at the chosen heading depth.
- By token size: pure token-budget splitting, structure-agnostic.
The split depth controls which heading level starts a new chunk — H2 is a sensible default for most documents.
Overlap & Context
Overlap copies a configurable number of trailing tokens from each chunk into the next, so context isn't lost at boundaries. This matters when an answer spans the end of one section and the start of another — with overlap, a single retrieved chunk still contains the full picture.
Chunk Metadata
Every chunk carries a stable id, the source filename, a heading_path (the chain of headings above it), and an estimated tokens count. The heading path is especially useful: you can surface it as a citation, filter retrieval by section, or reconstruct where a chunk came from.
Export Formats
Export as JSON (one array), JSONL (one object per line, ideal for bulk-loading vector stores), or Markdown with YAML front-matter for human-readable chunk files. The footer reports the chunk count and token distribution so you can tune your settings.
Tips
- Start with Heading + token cap, 512 tokens, 64 overlap, splitting at H2.
- Match the max-token setting to your embedding model's optimal input size.
- Use JSONL when loading directly into a vector database pipeline.
