Dataset - LDM

SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.
- Dataset
- JSON
YELP

The YELP dataset is used for language modeling.
- Dataset
- JSON
PTB

Object tracking by reconstruction with view-specific discriminative correlation filters.
- Dataset
- JSON
Penn Treebank (PTB) and WikiText-2 (WT-2)

The dataset used in the paper is Penn Treebank (PTB) and WikiText-2 (WT-2), which are language modeling datasets.
- Dataset
- JSON
Patrika Dataset

Patrika dataset is used as independent test set.
- Dataset
- JSON
Nayadiganta Dataset

Nayadiganta dataset is used as independent test set.
- Dataset
- JSON
Hindinews and Livehindustan Articles

Hindinews, Livehindustan and Patrika newspaper articles available open source in Kaggle encompassing similar domains.
- Dataset
- JSON
Bengali and Hindi News Articles

Bengali dataset consists of articles from online public news portals such as Prothom-Alo, BDNews24 and Nayadiganta. The articles encompass domains such as politics,...
- Dataset
- JSON
Chinese Poetry

The Chinese Poetry dataset is a dataset of Chinese poems used for language modeling.
- Dataset
- JSON
Text8

Word2Vec is a distributed word embedding generator that uses an artificial neural network to learn dense vector representations of words.
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
Wikitext-103

The dataset used in this paper is Wikitext-103, a general English language corpus containing good and featured Wikipedia articles.
- Dataset
- JSON
GLUE

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have...
- Dataset
- JSON
Penn Treebank (PTB) dataset

The Penn Treebank (PTB) dataset is used for word ordering task. The dataset is used to evaluate the performance of different models for word ordering.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

34 datasets found