Type of Data Description Source Examples
Web Scraped Data Text collected from the internet via web crawling. It includes a vast array of topics and styles. Common Crawl archives, public websites
All Publicly Available Websites Content from websites that are accessible to the public, covering diverse topics and formats. Various websites across the internet
Books Digitized books covering fiction, non-fiction, textbooks, etc. Project Gutenberg, Open Library
Academic Papers Scholarly articles and research papers across various fields. arXiv, PubMed
Wikipedia Articles Crowdsourced encyclopedia entries covering diverse topics. Wikipedia dumps
News Articles Reports and articles from newspapers and media outlets. News websites, RSS feeds
Code Repositories Source code and documentation from programming projects. GitHub, GitLab
Question-Answer Data Data from Q&A forums where users ask and answer questions. Stack Exchange, Quora
Dialogue Transcripts Conversations between two or more parties, including informal chats. Movie scripts, chat logs
Government Documents Public records, legislative texts, and official reports. Government websites, public databases
Legal Documents Texts of laws, case studies, and legal analyses. Court records, legal journals
Social Media Data User-generated content from social platforms, including posts, comments, and messages. Twitter tweets, Reddit posts, Facebook
Product Review Data User reviews and ratings of products and services. Amazon reviews, Yelp, TripAdvisor
Transcriptions Text converted from audio or video recordings. Subtitles, speech-to-text corpora
Multilingual Data Texts in various languages for training multilingual models. Parallel corpora, multilingual websites
Medical Literature Articles and papers related to medicine and healthcare. Medical journals, clinical trial reports
Patent Documents Technical documents describing inventions and processes. Patent offices, USPTO database
Manuals & Documentation Instructional materials and user guides. Software docs, product manuals
Open-source Datasets Curated datasets specifically prepared for training models. OpenWebText, C4 dataset