Semantic Scholar Open Research Corpus (S2ORC)
A vast corpus of 8.1M open access papers with rich metadata, abstracts, references, and full text.
Executive Summary
The Semantic Scholar Open Research Corpus (S2ORC) is a comprehensive, general-purpose dataset comprising 8.1 million open-access academic papers. It spans a wide array of academic disciplines, making it a valuable resource for researchers and developers working with scientific literature. The corpus is distinguished by its rich metadata, which includes detailed information about each paper, along with their abstracts, resolved bibliographic references, and structured full text. This extensive dataset is designed to facilitate various research and development activities, particularly in areas like natural language processing, information retrieval, and machine learning applied to scientific documents. By providing structured access to a vast collection of scholarly articles, S2ORC enables the training of large language models, the development of new text classification systems, and the exploration of academic trends and connections. Its open-access nature ensures broad utility across the academic and technical communities.
Use Cases
- Training and fine-tuning Large Language Models (LLMs) for scientific domains.
- Developing and evaluating text classification and information retrieval systems for academic papers.
- Building applications that require programmatic access to a large corpus of scientific literature.
- Enriching metadata and facilitating discovery in academic reference management tools.
- Analyzing academic trends, citations, and research landscapes across disciplines.
Features
Intelligence
- Rich Metadata Extraction: Automated extraction and structuring of comprehensive metadata including authors, affiliations, venues, and publication dates.
- Full Text Structuring: Processing of full text to identify sections, paragraphs, and key entities, enabling deeper textual analysis.
- Reference Resolution: Automated resolution and linking of bibliographic references within papers.
Support
- Email Support: Assistance available via email for technical queries and data usage.
- Online Documentation: Comprehensive API documentation and tutorials for developers.
Technical Specifications
- Architecture
- Leverages AI/ML for data extraction and structuring, accessible via RESTful API for programmatic interaction.
- Deployment
- Dataset (downloadable) and Cloud API
- API Available
- Yes
Infrastructure
- AWS
AI/ML Stack
- Machine Learning
- Natural Language Processing
Integrations
- Paperlib
- CiteSpy
- Zotero
Security & Compliance
Certifications: SOC 2, GDPR
Encryption: Data encrypted at rest and in transit within AWS infrastructure.
Pricing
- Model
- Free (Open Access)
- Starting Price
- Free
- Target Customer
- SMB,Mid-Market,Enterprise
About Semantic Scholar
Semantic Scholar is a free, AI-powered research tool developed by the Allen Institute for AI (AI2), a non-profit research institute. It helps scholars find and understand scientific literature quickly using machine reading and semantic search, extracting key information and providing relevance-ranked results and paper summaries to accelerate research across over 200 million academic papers.