-
Twitter OOV Word Dataset
The dataset is a collection of Twitter tweets, filtered to include only English language tweets. The dataset is used to study out-of-vocabulary (OOV) words in Twitter. -
LLM dataset
The dataset used in this paper is not explicitly described, but it is mentioned that it is a large language model (LLM) and that the authors used it to train and evaluate their... -
Utilizing Prolog for converting between active and passive sentence with thre...
This work introduces a simple but efficient method to solve one of the critical aspects of English grammar, the relationship between active sentence and passive sentence. -
Reddit Comments dataset
The Reddit Comments dataset is constructed from publicly available user comments on submissions on the Reddit website. -
Open Subtitles dataset
The Open Subtitles dataset consists of transcriptions of spoken dialog in movies and television shows. -
UzSyllable dataset
A comprehensive dataset for evaluating and training machine learning algorithms for syllable prediction accuracy and performance. -
Design and Implementation of a Tool for Extracting Uzbek Syllables
A comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms. -
ZeuScansion
A tool for scansion of English poetry. -
BIG-Bench Hard
The BIG-Bench Hard dataset is derived from the original BIG-Bench evaluation suite, focusing on tasks that pose challenges to existing language models. -
Leveraging QA Datasets to Improve Generative Data Augmentation
The paper proposes a method to leverage QA datasets for training generative language models to be context generators for a given question and answer. -
Femicide perception dataset
Femicide perception dataset: a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers. -
Wang271K Dataset
The Wang271K dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors. -
SIGHAN Datasets
The SIGHAN datasets are used for Chinese Spelling Check (CSC) task, with a limited number of Chinese characters and their corresponding errors. -
Chinese Spelling Check Dataset
The dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors. -
COVID-19 Twitter Data
The COVID-19 Twitter Data dataset contains tweets about the COVID-19 pandemic. -
Phi-2: A Dataset for Language Model Evaluation
The Phi-2 dataset is a collection of language models used to evaluate the performance of language models.