Dataset - LDM

DDI corpus of the 2013 DDIExtraction challenge

The DDI corpus of the 2013 DDIExtraction challenge contains thousands of XML files, each of which are constructed by several records. The dataset is used to train and test a...
- Dataset
- JSON
Hub5e-swb Dataset

The Hub5e-swb dataset is a dataset of speech recordings from a hub5e-swb device, which is a device that allows multiple people to speak at the same time.
- Dataset
- JSON
Copenhagen Corpus of Eye Tracking Recordings from Natural Reading of Danish T...

The Copenhagen Corpus of eye tracking recordings from natural reading of Danish texts
- Dataset
- JSON
The Pile

The Pile dataset contains 3.5 million samples of diverse text for language modeling.
- Dataset
- JSON
CAP: Corpus of Adjective Pairs

The CAP dataset is a corpus of adjective pairs used to evaluate adjective order preferences in language models.
- Dataset
- JSON
Music Corpus

The dataset used for term clustering to build a modular ontology according to core ontology from domain-specific text.
- Dataset
- JSON
Corpus of Spoken Dutch

The Corpus of Spoken Dutch (CGN) is a dataset of spoken Dutch recordings.
- Dataset
- JSON
Video Corpus

A corpus of free and representative video content was gathered. This corpus includes videos having progressive scanning, 1280x720 resolution, and framerates between 24-30 frames...
- Dataset
- JSON
S2ORC

A collection of 81.1 million scholarly publications in English from various academic fields, used to pre-train a language model.
- Dataset
- JSON
WSJ corpus

The WSJ corpus contains 81.48 hours of speech from 283 adults.
- Dataset
- JSON
TIMIT

The TIMIT corpus is a widely used benchmark for speech recognition tasks. It contains 3,696 training utterances from 462 speakers, excluding the SA sentences. The core test set...
- Dataset
- JSON
UKWaC and Wackypedia corpora

The dataset used in this paper is a large text corpus compiled from UKWaC and Wackypedia corpora.
- Dataset
- JSON
Switchboard Corpus

The Switchboard corpus is a dataset of speech recordings from a switchboard, which is a device that allows multiple people to speak at the same time.
- Dataset
- JSON
MuST-C: a Multilingual Speech Translation Corpus

MuST-C is a multilingual speech translation corpus.
- Dataset
- JSON
Leela’s corpus

The dataset contains word order frequencies from Leela’s corpus, which are used as a proxy for cognitive cost.
- Dataset
- JSON
Switchboard

Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment.
- Dataset
- JSON
Librispeech

The Librispeech dataset is a large-scale speaker-dependent speech corpus containing 1080 hours of speech, 5600 utterances, and 1000 speakers.
- Dataset
- JSON
Leibniz University Hannover

Imported

NLPContributionGraph Trial Dataset

An Annotation Scheme for Machine Reading of Scholarly Contributions in Natural Language Processing Literature This dataset is the result of a pilot annotation exercise to...
- Imported Dataset
- JSON

You can also access this registry using the API (see API Docs).

18 datasets found

Leibniz University Hannover

Imported