Corpus - Groups

Semantic Scholar Open Research Corpus

The Semantic Scholar Open Research Corpus contains meta-data of 46,947,044 published research papers in Computer Science, Neuroscience, and Bio-medicine from 1936 to 2019.
- Dataset
- JSON
ROC-Stories: A Corpus for Evaluating Story Generation Models

ROC-Stories: A Corpus for Evaluating Story Generation Models
- Dataset
- JSON
OPUS-100

The dataset used in the paper is a subset of the OPUS-MT dataset, containing 1M randomly sampled examples from the OPUS-100 dataset.
- Dataset
- JSON
PB-Br.v1

The PB-Br.v1 corpus is a corpus of Brazilian Portuguese texts annotated with semantic roles.
- Dataset
- JSON
PB-Br.v2

The PB-Br.v2 corpus is a corpus of Brazilian Portuguese texts annotated with semantic roles.
- Dataset
- JSON
PropBank.Br

The PropBank.Br corpus is a corpus of Brazilian Portuguese texts annotated with semantic roles.
- Dataset
- JSON
Asian Scientific Paper Excerpt Corpus (ASPEC)

Asian Scientific Paper Excerpt Corpus (ASPEC)
- Dataset
- JSON
Penn Discourse Treebank 2.0

The Penn Discourse Treebank 2.0 (PDTB 2.0) is a large scale corpus containing 2,312 Wall Street Journal (WSJ) articles.
- Dataset
- JSON
Brown Corpus

The Brown corpus is an out-of-domain dataset.
- Dataset
- JSON
Switchboard Corpus

The Switchboard corpus is a dataset of speech recordings from a switchboard, which is a device that allows multiple people to speak at the same time.
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
- Dataset
- JSON

12 datasets found