Dataset - LDM

TQ+ datasets for whataboutism detection

Two new datasets for whataboutism detection
- Dataset
- JSON
Vietnamese Diacritic Restoration Dataset

The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs.
- Dataset
- JSON
IPA Transcription of Bengali Texts

A comprehensive study of IPA transcription issues and challenges for Bangla, a novel IPA transcription framework, a DUAL-IPA, a sentence level ipa transcripted parallel corpus...
- Dataset
- JSON
Corpus of Spoken Dutch

The Corpus of Spoken Dutch (CGN) is a dataset of spoken Dutch recordings.
- Dataset
- JSON
Language Models of Spoken Dutch

The dataset consists of subtitles of television shows provided by the Flemish public-service broadcaster VRT. The dataset is used to train language models of spoken Dutch.
- Dataset
- JSON
Scholarly Paper Recommendation via User's Recent Research Interests

The dataset used in this paper is a collection of research papers, and the authors propose a scholarly paper recommendation system.
- Dataset
- JSON
Interactive Research Paper Recommender System

The dataset used in this paper is a collection of research papers, and the authors propose an interactive research paper recommender system.
- Dataset
- JSON
OPT

The dataset used in the paper is OPT, a large language model.
- Dataset
- JSON
LLaMA

The dataset used in the paper is LLaMA, a large language model.
- Dataset
- JSON
Grammaticality Judgment Task

The dataset used in the paper is a grammaticality judgment task featuring four linguistic phenomena: anaphora, center embedding, comparatives, and negative polarity constructions.
- Dataset
- JSON
TruthfulQA

The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods.
- Dataset
- JSON
WIMCOR: A Large Harvested Corpus of Location Metonymy

WIMCOR is a large and rich dataset of location metonymy, extracted using Wikipedia. It is suitable for metonymy detection and entity linking tasks.
- Dataset
- JSON
Extracting Blockchain Concepts from Text

The dataset is used to extract information from whitepapers and academic articles focused on the blockchain area to organize this information and aid users to navigate the space.
- Dataset
- JSON
ACL Anthology Dataset

The ACL Anthology dataset contains 21,212 papers, 17,792 authors, 342 venues, and 110,975 citations.
- Dataset
- JSON
ACL Anthology

The ACL Anthology dataset contains papers on natural language processing, including citation patterns, authorship, and language use over time.
- Dataset
- JSON
Collecting and Characterizing Natural Language Utterances for Specifying Data...

A dataset of natural language utterances for specifying data visualizations.
- Dataset
- JSON
Semantic Profiling of Natural Language Utterances for Data Visualization Gene...

A dataset of 500 natural language utterances for data visualization generation, including utterances with uncertainties and missing data references.
- Dataset
- JSON
NL4Opt Generation Dataset

The NL4Opt Generation Dataset consists of 1101 examples, divided into the train, dev, and test splits composed of 713, 99, and 289 examples, respectively. Each example consists...
- Dataset
- JSON
Zh-En Multi-Domain Dataset

The Zh-En multi-domain dataset consists of four balanced domains: news, patent, subtitles, and COVID-19.
- Dataset
- JSON
XSUM Dataset

The XSUM dataset comprises 226,711 British Broadcasting Corporation (BBC) articles paired with their single-sentence summaries.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

420 datasets found