Open-Source Tool Turns PDFs Into JSONL Training Data for LLM Fine-Tuning

Most teams hit the same wall when fine-tuning GPT-style models: compiling clean, realistic training data takes forever. At AI Tech Inspire, we spotted a newly open-sourced utility that aims squarely at that bottleneck—turning domain PDFs and documents into ready-to-train JSONL files via an agentic retrieval pipeline and the OpenAI API. If your backlog includes “convert client PDFs into instruction pairs,” this is one you’ll want to evaluate.

What was open-sourced, in plain terms

Open-source tool that converts domain documents (e.g., PDFs) into realistic training datasets.
Uses the Agno agentic framework for knowledge retrieval and orchestration.
Generates examples via the OpenAI API.
Outputs JSONL formatted data ready for LLM fine-tuning.
Reportedly applied to medical compliance, a legal proof of concept, and financial regulations.
Quality depends on the source material; prompt tweaks are expected.
Designed to reduce the manual time spent creating fine-tuning datasets.
A GitHub repository is available for developers to try.

Why this matters for builders

Fine-tuning is often easier than dataset prep. Many teams already use RAG-like pipelines for production, but eventually want their model to demonstrate native fluency in their domain—without always relying on retrieval at runtime. That shift requires high-quality, diverse, and representative examples. The challenge: clients hand over binders of policy PDFs, SLAs, playbooks, or technical specs, and someone has to translate that into instruction/response pairs, classification examples, or structured extraction targets.

Manually generating these examples is slow, inconsistent, and expensive. It also tends to bias toward simple cases. An automated pipeline that: (1) ingests docs, (2) retrieves context, (3) composes prompts, (4) synthesizes candidate examples, and (5) exports clean JSONL could reduce the cycle time from weeks to days. That’s the practical promise behind this tool.

“Garbage in, garbage out” still applies. The approach amplifies good source material—and faithfully reflects messy data when that’s what it’s given.

How the pipeline likely works (mental model)

While every implementation differs, here’s a reasonable mental model based on the description:

Document ingestion: PDFs and related files are parsed and chunked (e.g., by headings or semantic length).
Agentic retrieval: Using the Agno framework, an agent retrieves relevant chunks per topic, objective, or schema.
Prompted synthesis: With retrieved context, the pipeline asks the OpenAI API to generate training pairs—Q&A, instructions, rationales, or structured fields.
Validation and formatting: Outputs are normalized into JSONL suitable for fine-tuning endpoints (e.g., message-style examples).

Expect to iterate on prompt templates. For example, you might dial up scenario complexity, enforce structured outputs, or request rationales for later pruning. Small prompt tweaks can significantly change data variety and usefulness.

What domains this could help immediately

Compliance and policy: HIPAA summaries, SOC 2 control Q&A, internal policy application to scenarios.
Legal prototypes: Clause classification, obligation extraction, scenario-based question generation for proofs of concept.
Financial regulations: Rule interpretation, exception handling, and documentation-based checks.

These areas share a feature: dense text with structured logic hiding inside. Automatic synthesis can surface edge cases that busy humans might overlook on a first pass.

Fine-tuning vs. RAG: where this fits

RAG is often the fastest path to production. But teams frequently blend strategies:

Fine-tune for style, policy adherence, and task formats you always want “baked in.”
RAG for freshness and long-tail knowledge that changes often.

Automated dataset creation helps with both: you can fine-tune a base policy-following behavior while keeping retrieval for the newest rules. Tools like LangChain or LlamaIndex can still power your runtime retrieval, while this generator accelerates the pretraining of domain-specific responses.

Quickstart-style walkthrough (conceptual)

While exact commands depend on the repo, a typical flow looks like this:

Prepare a folder of clean PDFs or text exports.
Configure your OPENAI_API_KEY (consider export OPENAI_API_KEY=... locally or a secure secret manager).
Run the script/agent to index docs and generate examples.
Open the resulting .jsonl and spot-check for quality, leakage, and formatting.
Feed the file into your fine-tuning endpoint (e.g., see OpenAI fine-tuning docs or tools on Hugging Face).

A minimal line might resemble:

{"messages": [{"role": "system", "content": "You are a helpful assistant for policy compliance."}, {"role": "user", "content": "Does Section 4.2 allow storing PII in third-party tools?"}, {"role": "assistant", "content": "No. Section 4.2 requires vendor DPAs and encryption at rest with AES-256."}]}

Adjust for your target format—some endpoints prefer prompt/completion, others use messages.

How to judge dataset quality

Before burning API credits on fine-tuning, try this checklist:

Diversity: Do examples cover edge cases, ambiguity, and exceptions—not just happy paths?
Grounding: Are responses traceable to specific clauses? Consider adding a hidden “source span” for internal audits.
Structure: Maintain consistent schemas for extraction tasks; use explicit keys in outputs.
Leakage control: Remove proprietary content you don’t intend the model to memorize, or switch to a local stack if necessary.
Deduplication: Collapse near-duplicates to avoid overweighting any pattern.
Human-in-the-loop: Sample and review 1–5% of lines. A short review loop often lifts quality dramatically.

How it compares to common approaches

Manual authoring: Highest control, slowest throughput. The open-source tool aims to 10x the initial draft and keep humans for review and rejection sampling.
Synthetic pipelines with LangChain/LlamaIndex: Similar in spirit; this project uses Agno for the agentic piece. If your team standardizes on one framework, portability matters.
Program synthesis with DSPy: Offers declarative optimization of prompts and subroutines. This open-source tool looks more turnkey, focused on dataset output.
Public datasets on Hugging Face: Great for baselines, but domain specificity still requires custom examples from your policies and PDFs.

If you plan to fine-tune using PyTorch or TensorFlow locally, the JSONL outputs still help—you can convert them to your trainer’s preferred schema and use libraries like datasets on Hugging Face to iterate.

Cost, privacy, and operational notes

API costs: Synthetic generation isn’t free. Budget for multiple passes as you iterate prompts and filters.
Data governance: If documents contain sensitive or regulated data, confirm contractual terms before using third-party APIs. Consider redaction, synthetic red-teaming, or local model variants.
Model drift: Regulated domains evolve. Time-box your datasets and note the policy versions used during generation.
Eval sets: Keep a holdout set authored or validated by subject-matter experts. Don’t train on your evaluation questions.

Practical prompt ideas to boost quality

Ask for contrastive pairs: one correct answer and several plausible-but-wrong distractors, labeled explicitly.
Request structured rationales, then strip them for training but keep them for QA review.
Generate scenario templates (“new hire onboarding,” “data deletion request”) and vary entities, dates, and constraints.
Enforce schemas in outputs: e.g., {category, clause_id, policy_basis, answer, confidence}.

Who should try this

Teams that maintain large policy or knowledge repositories and need to move beyond RAG-only solutions. Consulting firms building domain copilots. Platform teams who want a repeatable, documented way to produce training data for internal models. If your developers already juggle retrieval frameworks and fine-tuning scripts, adding an automated data generator can shave weeks off client onboarding.

Conversely, if your domain is highly sensitive and you can’t use external APIs, consider swapping the generator for local models or adapting the pipeline to run with on-prem hardware (and ensuring CUDA and infra are in place). The conceptual workflow still applies.

Bottom line from AI Tech Inspire

This open-source release won’t magically solve data quality, but it reframes the effort: spend less time drafting examples and more time curating, rejecting, and refining. That’s the right direction. The inclusion of an agentic retrieval layer is a smart touch; it nudges examples to stay grounded in the source corpus rather than drifting into generic internet answers.

If dataset preparation has been the hardest part of your fine-tuning efforts, this project is worth a test run. Start with a narrow slice of your corpus, generate a few hundred examples, and run a small fine-tune. Measure real shifts in behavior—format adherence, policy fidelity, and error rates on tricky cases. If the lift is there, scale thoughtfully.

The GitHub repository is available for developers ready to dive in. And if you adopt it, tell us what you learn—AI Tech Inspire is always on the lookout for practical, repeatable patterns that save builders time without sacrificing trust.

Recommended Resources

As an Amazon Associate, I earn from qualifying purchases.

Fiverr Marketplace

Hire AI talent.

Raspberry Pi Kits

Edge AI & robotics.