SmartMarkdown

Markdown RAG Formatter

Split Markdown into retrieval-ready chunks for RAG pipelines. Chunks respect heading boundaries and a token budget, carry overlap for context, and include rich metadata — exportable as JSON, JSONL, or Markdown with front-matter.

Source Markdown Document

RAG Chunks

[
  {
    "id": "knowledge-base-000",
    "source": "knowledge-base.md",
    "heading_path": [
      "Knowledge Base"
    ],
    "tokens": 4,
    "content": "# Knowledge Base"
  },
  {
    "id": "knowledge-base-001",
    "source": "knowledge-base.md",
    "heading_path": [
      "Knowledge Base",
      "Installation"
    ],
    "tokens": 29,
    "content": "## Installation\nInstall the package with npm. Make sure you are on Node 18+.\nRun the post-install step to generate config."
  },
  {
    "id": "knowledge-base-002",
    "source": "knowledge-base.md",
    "heading_path": [
      "Knowledge Base",
      "Configuration"
    ],
    "tokens": 30,
    "content": "## Configuration\nSet the API key in your environment. The config file lives at ./config.json\nand supports overrides per environment."
  },
  {
    "id": "knowledge-base-003",
    "source": "knowledge-base.md",
    "heading_path": [
      "Knowledge Base",
      "Usage"
    ],
    "tokens": 32,
    "content": "## Usage\nImport the client and call connect(). The client handles retries and\nbackoff automatically. See the API reference for all methods."
  },
  {
    "id": "knowledge-base-004",
    "source": "knowledge-base.md",
    "heading_path": [
      "Knowledge Base",
      "Troubleshooting"
    ],
    "tokens": 30,
    "content": "## Troubleshooting\nIf connection fails, check your network and API key. Enable debug logging\nwith DEBUG=app:* to see detailed traces."
  }
]

5 chunks · ~125 total tokens · avg ~25 tokens/chunk

Reviewers

Sarah Chen, SEO Content Strategist

Based on 6 sources
203 people find this tool helpful

What This Tool Does

Before a document can power a RAG system, it has to be chunked — and naive fixed-size chunking destroys the structure that makes Markdown valuable. This formatter splits along your document's natural heading boundaries while honoring a token budget, so each chunk is a coherent, appropriately sized unit ready to embed and retrieve.

Chunking Strategies

  • Heading + token cap (recommended): split on headings, then subdivide any oversized section.
  • By heading: one chunk per section at the chosen heading depth.
  • By token size: pure token-budget splitting, structure-agnostic.

The split depth controls which heading level starts a new chunk — H2 is a sensible default for most documents.

Overlap & Context

Overlap copies a configurable number of trailing tokens from each chunk into the next, so context isn't lost at boundaries. This matters when an answer spans the end of one section and the start of another — with overlap, a single retrieved chunk still contains the full picture.

Chunk Metadata

Every chunk carries a stable id, the source filename, a heading_path (the chain of headings above it), and an estimated tokens count. The heading path is especially useful: you can surface it as a citation, filter retrieval by section, or reconstruct where a chunk came from.

Export Formats

Export as JSON (one array), JSONL (one object per line, ideal for bulk-loading vector stores), or Markdown with YAML front-matter for human-readable chunk files. The footer reports the chunk count and token distribution so you can tune your settings.

Tips

  • Start with Heading + token cap, 512 tokens, 64 overlap, splitting at H2.
  • Match the max-token setting to your embedding model's optimal input size.
  • Use JSONL when loading directly into a vector database pipeline.

FAQ

Frequently Asked Questions