40.2M High-Quality
Multilingual Documents
Premium datasets in Spanish, Arabic, and Norwegian.
15B tokens per language with 4+ FineWeb-Edu quality rating and high knowledge density.
Premium Multilingual Datasets
Carefully curated and validated datasets designed specifically for training high-performance language models
Spanish
13.4M documents
High-quality Spanish content covering diverse domains and text types
Arabic
13.4M documents
Premium Arabic text data covering diverse domains and professional content
Norwegian
13.4M documents
Comprehensive Norwegian content from educational and professional sources
Uncompromising Quality Standards
Every document in our datasets has been rigorously validated and filtered to ensure maximum training efficiency
FineWeb-Edu Validated
All documents scored 4+ on the FineWeb-Edu quality classifier
Deduplicated
Advanced deduplication algorithms ensure unique, high-value content
Rich Annotations
Every document annotated with title, topic classification, and format metadata
Comprehensive Token Statistics
Detailed breakdown of document token distribution across all languages. Each dataset maintains consistent statistical properties.
Token Distribution per Document
Number of documents by token count range (consistent across Spanish, Arabic, and Norwegian)
Peak distribution around 200-600 tokens per document ensures optimal training efficiency
Request Dataset Access
Get in touch to discuss your requirements and receive access to our premium multilingual datasets
Request Access
Fill out the form below and we'll get back to you within 24 hours
What's Included:
- β’ Complete dataset documentation
- β’ Quality metrics and statistics
- β’ Licensing and usage guidelines
- β’ Technical support