Dataset Format

We accept JSONL (.jsonl) files only. Each line must be a valid JSON object. Validation happens before payment so you can fix issues early. If validation fails, you can update the file and retry.

Accepted format: JSONL only

We accept .jsonl (JSON Lines) files only. Each line in the file must be a standalone, valid JSON object. This ensures every record is parsed reliably by the trainer with no ambiguity.

Why JSONL only?

Our training pipeline reads your dataset line by line and parses each line as a JSON object. Other formats (CSV, TXT, plain JSON arrays) cannot be reliably mapped to the structured record shapes the trainer expects. JSONL eliminates format ambiguity and ensures your data trains exactly as intended.

Supported record structures

Chat format (recommended for instruct models):
{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Bonjour"}]}
Text format (raw training text):
{"text": "This is a training example."}
Instruction format (Alpaca-style):
{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}

Required fields

Each JSON object must contain one of these field structures:

Chat: A messages array with role and content fields (roles: system, user, assistant).
Text: A text field containing the training example as a string.
Instruction: An instruction field, optional input, and an output field.

Tier limits

Make sure your dataset fits within BeaverYard's published caps for total size, token estimate, record count, and max line length. Final tier assignment happens automatically after upload and validation.

Launch (S/M/L): Lower-capacity pricing bands for smaller datasets.
Orbit (S/M/L): Higher-capacity pricing bands for larger datasets.
All tiers: JSONL only, UTF-8 only, max 200,000 records/lines, and max line length of 20,000 characters.

Check the Pricing page for exact byte limits per tier.

Validation checks

File must be a .jsonl file with valid JSON on every non-empty line.
File must be UTF-8 encoded and contain at least one valid record.
Each non-empty line must be a JSON object using one of the supported record structures.
Record count over 200,000 or any line over 20,000 characters fails validation before any tier is assigned.
Otherwise, dataset size and token estimate determine the required tier, and the higher required tier wins if those metrics disagree.

Per-row limit vs total dataset limit

There are two separate size constraints — don't confuse them:

Max dataset tokens (5.5M–28M): This is the total combined token estimate for your entire dataset. Checked against BeaverYard's published tier limits. Final tier assignment happens automatically after upload and validation.
Max line length (20,000 characters / ~5,000 tokens): This is the limit for one single row or record inside the dataset. Every tier enforces this limit. It ensures each entry fits safely inside the AI's context window during training and prevents silent data truncation.

If you get a LINE_TOO_LONG error: find and shorten the individual rows that exceed 20,000 characters. This is a per-row limit, not a file size limit — even a 1 KB file can fail this check if a single entry is too long.

Dataset quality is your responsibility

We validate format and published caps, but we do not score or improve dataset quality for you.

We do not rewrite, deduplicate, or clean your dataset automatically.
We do not judge whether your examples are high-quality for your use case.
Training results depend on the quality and relevance of the data you upload.

Common mistakes

If your dataset fails validation, it likely falls into one of these categories (which are detailed further in our Troubleshooting Guide):

Format mismatch
Missing required fields
Published tier limits exceeded
Per-line or total-size limits exceeded

Helpful links

How it works Pricing Troubleshooting