Web Scraped Data |
Text collected from the internet via web crawling. It includes a vast array of topics and styles. |
Common Crawl archives, public websites |
All Publicly Available Websites |
Content from websites that are accessible to the public, covering diverse topics and formats. |
Various websites across the internet |
Books |
Digitized books covering fiction, non-fiction, textbooks, etc. |
Project Gutenberg, Open Library |
Academic Papers |
Scholarly articles and research papers across various fields. |
arXiv, PubMed |
Wikipedia Articles |
Crowdsourced encyclopedia entries covering diverse topics. |
Wikipedia dumps |
News Articles |
Reports and articles from newspapers and media outlets. |
News websites, RSS feeds |
Code Repositories |
Source code and documentation from programming projects. |
GitHub, GitLab |
Question-Answer Data |
Data from Q&A forums where users ask and answer questions. |
Stack Exchange, Quora |
Dialogue Transcripts |
Conversations between two or more parties, including informal chats. |
Movie scripts, chat logs |
Government Documents |
Public records, legislative texts, and official reports. |
Government websites, public databases |
Legal Documents |
Texts of laws, case studies, and legal analyses. |
Court records, legal journals |
Social Media Data |
User-generated content from social platforms, including posts, comments, and messages. |
Twitter tweets, Reddit posts, Facebook |
Product Review Data |
User reviews and ratings of products and services. |
Amazon reviews, Yelp, TripAdvisor |
Transcriptions |
Text converted from audio or video recordings. |
Subtitles, speech-to-text corpora |
Multilingual Data |
Texts in various languages for training multilingual models. |
Parallel corpora, multilingual websites |
Medical Literature |
Articles and papers related to medicine and healthcare. |
Medical journals, clinical trial reports |
Patent Documents |
Technical documents describing inventions and processes. |
Patent offices, USPTO database |
Manuals & Documentation |
Instructional materials and user guides. |
Software docs, product manuals |
Open-source Datasets |
Curated datasets specifically prepared for training models. |
OpenWebText, C4 dataset |