| Web Scraped Data | 
Text collected from the internet via web crawling. It includes a vast array of topics and styles. | 
Common Crawl archives, public websites | 
| All Publicly Available Websites | 
Content from websites that are accessible to the public, covering diverse topics and formats. | 
Various websites across the internet | 
| Books | 
Digitized books covering fiction, non-fiction, textbooks, etc. | 
Project Gutenberg, Open Library | 
| Academic Papers | 
Scholarly articles and research papers across various fields. | 
arXiv, PubMed | 
| Wikipedia Articles | 
Crowdsourced encyclopedia entries covering diverse topics. | 
Wikipedia dumps | 
| News Articles | 
Reports and articles from newspapers and media outlets. | 
News websites, RSS feeds | 
| Code Repositories | 
Source code and documentation from programming projects. | 
GitHub, GitLab | 
| Question-Answer Data | 
Data from Q&A forums where users ask and answer questions. | 
Stack Exchange, Quora | 
| Dialogue Transcripts | 
Conversations between two or more parties, including informal chats. | 
Movie scripts, chat logs | 
| Government Documents | 
Public records, legislative texts, and official reports. | 
Government websites, public databases | 
| Legal Documents | 
Texts of laws, case studies, and legal analyses. | 
Court records, legal journals | 
| Social Media Data | 
User-generated content from social platforms, including posts, comments, and messages. | 
Twitter tweets, Reddit posts, Facebook | 
| Product Review Data | 
User reviews and ratings of products and services. | 
Amazon reviews, Yelp, TripAdvisor | 
| Transcriptions | 
Text converted from audio or video recordings. | 
Subtitles, speech-to-text corpora | 
| Multilingual Data | 
Texts in various languages for training multilingual models. | 
Parallel corpora, multilingual websites | 
| Medical Literature | 
Articles and papers related to medicine and healthcare. | 
Medical journals, clinical trial reports | 
| Patent Documents | 
Technical documents describing inventions and processes. | 
Patent offices, USPTO database | 
| Manuals & Documentation | 
Instructional materials and user guides. | 
Software docs, product manuals | 
| Open-source Datasets | 
Curated datasets specifically prepared for training models. | 
OpenWebText, C4 dataset |