Natural Language Processing - Groups

Twitter OOV Word Dataset

The dataset is a collection of Twitter tweets, filtered to include only English language tweets. The dataset is used to study out-of-vocabulary (OOV) words in Twitter.

Dataset
JSON

LLM dataset

The dataset used in this paper is not explicitly described, but it is mentioned that it is a large language model (LLM) and that the authors used it to train and evaluate their...

Dataset
JSON

Utilizing Prolog for converting between active and passive sentence with thre...

This work introduces a simple but eﬃcient method to solve one of the critical aspects of English grammar, the relationship between active sentence and passive sentence.

Dataset
JSON

GPT-2 XL

The GPT-2 dataset is a large language model, specifically the GPT-2 XL model, trained on the Common Crawl dataset.

Dataset
JSON

Reddit Comments dataset

The Reddit Comments dataset is constructed from publicly available user comments on submissions on the Reddit website.

Dataset
JSON

Open Subtitles dataset

The Open Subtitles dataset consists of transcriptions of spoken dialog in movies and television shows.

Dataset
JSON

UzSyllable dataset

A comprehensive dataset for evaluating and training machine learning algorithms for syllable prediction accuracy and performance.

Dataset
JSON

Design and Implementation of a Tool for Extracting Uzbek Syllables

A comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms.

Dataset
JSON

ZeuScansion

A tool for scansion of English poetry.

Dataset
JSON

NLDD

NLDD dataset, a specialized open-source collection designed for the Natural Language to Software Generation domain.

Dataset
JSON

BIG-Bench Hard

The BIG-Bench Hard dataset is derived from the original BIG-Bench evaluation suite, focusing on tasks that pose challenges to existing language models.

Dataset
JSON

Leveraging QA Datasets to Improve Generative Data Augmentation

The paper proposes a method to leverage QA datasets for training generative language models to be context generators for a given question and answer.

Dataset
JSON

LongPile

LongPile is a diverse dataset derived from the Pile corpus.

Dataset
JSON

PG-19

PG-19 is a well-established benchmark for long-form language modeling.

Dataset
JSON

Femicide perception dataset

Femicide perception dataset: a large-scale perception survey of GBV descriptions automatically extracted from a corpus of Italian newspapers.

Dataset
JSON

Wang271K Dataset

The Wang271K dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors.

Dataset
JSON

SIGHAN Datasets

The SIGHAN datasets are used for Chinese Spelling Check (CSC) task, with a limited number of Chinese characters and their corresponding errors.

Dataset
JSON

Chinese Spelling Check Dataset

The dataset is used for Chinese Spelling Check (CSC) task, with a large number of Chinese characters and their corresponding errors.

Dataset
JSON

COVID-19 Twitter Data

The COVID-19 Twitter Data dataset contains tweets about the COVID-19 pandemic.

Dataset
JSON

Phi-2: A Dataset for Language Model Evaluation

The Phi-2 dataset is a collection of language models used to evaluate the performance of language models.

Dataset
JSON

120 datasets found