-
Design and Implementation of a Tool for Extracting Uzbek Syllables
A comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms. -
Wikitext-103 and MusDB datasets
The dataset used in the paper is not explicitly mentioned, but it is mentioned that the authors trained a 16 layers transformer (Vaswani et al., 2017) based language model on... -
Sentence Compression via DC Programming Approach
The dataset used in this paper for sentence compression task. -
Buffer of Thoughts
Buffer of Thoughts is a novel and versatile thought-augmented reasoning approach for enhancing accuracy, efficiency and robustness of large language models (LLMs). -
Winograd Schema - Knowledge Extraction Using Narrative Chains
The Winograd Schema Challenge (WSC) is a test of machine intelligence, designed to be an improvement on the Turing test. A Winograd Schema consists of a sentence and a... -
Automatic Scansion of Poetry
Automatic scansion of poetry. -
Supervised Machine Learning for Hybrid Meter
Supervised machine learning for hybrid meter. -
Automatic Analysis of Rhythmic Poetry
Automatic analysis of rhythmic poetry with applications to generation and translation. -
ZeuScansion
A tool for scansion of English poetry. -
Automatic Scansion of Classical Greek Hexameter
A fully automatic approach to the scansion of Classical Greek hexameter verse. -
Graph-free Multi-hop Reading Comprehension: A Select-to-Guide Strategy
Multi-hop reading comprehension (MHRC) requires not only to predict the correct answer span in the given passage, but also to provide a chain of supporting evidences for... -
EBM-NLP corpus
A dataset of 4,993 RCT abstracts sourced from the MedLine with a general focus on the domain areas of cardiovascular disease, autism and cancer. -
DocRED dataset
The DocRED dataset was built from Wikipedia and Wikidata, covering various relations related to science, art, personal life, etc. -
Learning to summarize with human feedback
The paper presents a study on the impact of synthetic data on large language models (LLMs) and proposes a method to steer LLMs towards desirable non-differentiable attributes. -
Reward Model Ensembles
The authors used three datasets: TL;DR, HELPFULNESS, and XSUM/NLI. -
Navigating the Grey Area: How Expressions of Uncertainty and Overconfidence A...
The authors used a variety of datasets for question answering, including TriviaQA, Natural Questions, CountryQA, and Jeopardy questions. -
BIG-Bench Hard
The BIG-Bench Hard dataset is derived from the original BIG-Bench evaluation suite, focusing on tasks that pose challenges to existing language models. -
CoNLL-2009
The CoNLL-2009 dataset is used for semantic role labeling (SRL) task. It contains 10,177 sentences in English and 10,177 sentences in Chinese.