-
TQ+ datasets for whataboutism detection
Two new datasets for whataboutism detection -
Vietnamese Diacritic Restoration Dataset
The dataset used for Vietnamese diacritic restoration problem, consisting of 180,000 sentence pairs. -
IPA Transcription of Bengali Texts
A comprehensive study of IPA transcription issues and challenges for Bangla, a novel IPA transcription framework, a DUAL-IPA, a sentence level ipa transcripted parallel corpus... -
Corpus of Spoken Dutch
The Corpus of Spoken Dutch (CGN) is a dataset of spoken Dutch recordings. -
Language Models of Spoken Dutch
The dataset consists of subtitles of television shows provided by the Flemish public-service broadcaster VRT. The dataset is used to train language models of spoken Dutch. -
Scholarly Paper Recommendation via User's Recent Research Interests
The dataset used in this paper is a collection of research papers, and the authors propose a scholarly paper recommendation system. -
Interactive Research Paper Recommender System
The dataset used in this paper is a collection of research papers, and the authors propose an interactive research paper recommender system. -
Grammaticality Judgment Task
The dataset used in the paper is a grammaticality judgment task featuring four linguistic phenomena: anaphora, center embedding, comparatives, and negative polarity constructions. -
TruthfulQA
The TruthfulQA dataset is a dataset that contains 817 questions designed to evaluate language models' preference to mimic some human falsehoods. -
WIMCOR: A Large Harvested Corpus of Location Metonymy
WIMCOR is a large and rich dataset of location metonymy, extracted using Wikipedia. It is suitable for metonymy detection and entity linking tasks. -
Extracting Blockchain Concepts from Text
The dataset is used to extract information from whitepapers and academic articles focused on the blockchain area to organize this information and aid users to navigate the space. -
ACL Anthology Dataset
The ACL Anthology dataset contains 21,212 papers, 17,792 authors, 342 venues, and 110,975 citations. -
ACL Anthology
The ACL Anthology dataset contains papers on natural language processing, including citation patterns, authorship, and language use over time. -
Collecting and Characterizing Natural Language Utterances for Specifying Data...
A dataset of natural language utterances for specifying data visualizations. -
Semantic Profiling of Natural Language Utterances for Data Visualization Gene...
A dataset of 500 natural language utterances for data visualization generation, including utterances with uncertainties and missing data references. -
NL4Opt Generation Dataset
The NL4Opt Generation Dataset consists of 1101 examples, divided into the train, dev, and test splits composed of 713, 99, and 289 examples, respectively. Each example consists... -
Zh-En Multi-Domain Dataset
The Zh-En multi-domain dataset consists of four balanced domains: news, patent, subtitles, and COVID-19. -
XSUM Dataset
The XSUM dataset comprises 226,711 British Broadcasting Corporation (BBC) articles paired with their single-sentence summaries.