Dataset Format
We accept JSONL (.jsonl) files only. Each line must be a valid JSON object. Validation happens before payment so you can fix issues early. If validation fails, you can update the file and retry.
Accepted format: JSONL only
We accept .jsonl (JSON Lines) files only. Each line in the file must be a standalone, valid JSON object. This ensures every record is parsed reliably by the trainer with no ambiguity.
Why JSONL only?
Our training pipeline reads your dataset line by line and parses each line as a JSON object. Other formats (CSV, TXT, plain JSON arrays) cannot be reliably mapped to the structured record shapes the trainer expects. JSONL eliminates format ambiguity and ensures your data trains exactly as intended.
Supported record structures
- Chat format (recommended for instruct models):{"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Bonjour"}]}
- Text format (raw training text):{"text": "This is a training example."}
- Instruction format (Alpaca-style):{"instruction": "Translate to French", "input": "Hello", "output": "Bonjour"}
Required fields
Each JSON object must contain one of these field structures:
- Chat: A
messagesarray withroleandcontentfields (roles:system,user,assistant). - Text: A
textfield containing the training example as a string. - Instruction: An
instructionfield, optionalinput, and anoutputfield.
Tier limits
Make sure your dataset fits within BeaverYard's published caps for total size, token estimate, record count, and max line length. Final tier assignment happens automatically after upload and validation.
- Launch (S/M/L): Lower-capacity pricing bands for smaller datasets.
- Orbit (S/M/L): Higher-capacity pricing bands for larger datasets.
- All tiers: JSONL only, UTF-8 only, max 200,000 records/lines, and max line length of 20,000 characters.
Check the Pricing page for exact byte limits per tier.
Validation checks
- File must be a
.jsonlfile with valid JSON on every non-empty line. - File must be UTF-8 encoded and contain at least one valid record.
- Each non-empty line must be a JSON object using one of the supported record structures.
- Record count over 200,000 or any line over 20,000 characters fails validation before any tier is assigned.
- Otherwise, dataset size and token estimate determine the required tier, and the higher required tier wins if those metrics disagree.
Per-row limit vs total dataset limit
There are two separate size constraints — don't confuse them:
- Max dataset tokens (5.5M–28M): This is the total combined token estimate for your entire dataset. Checked against BeaverYard's published tier limits. Final tier assignment happens automatically after upload and validation.
- Max line length (20,000 characters / ~5,000 tokens): This is the limit for one single row or record inside the dataset. Every tier enforces this limit. It ensures each entry fits safely inside the AI's context window during training and prevents silent data truncation.
If you get a LINE_TOO_LONG error: find and shorten the individual rows that exceed 20,000 characters. This is a per-row limit, not a file size limit — even a 1 KB file can fail this check if a single entry is too long.
Dataset quality is your responsibility
We validate format and published caps, but we do not score or improve dataset quality for you.
- We do not rewrite, deduplicate, or clean your dataset automatically.
- We do not judge whether your examples are high-quality for your use case.
- Training results depend on the quality and relevance of the data you upload.
Common mistakes
If your dataset fails validation, it likely falls into one of these categories (which are detailed further in our Troubleshooting Guide):
- Format mismatch
- Missing required fields
- Published tier limits exceeded
- Per-line or total-size limits exceeded