Language Modeling - Groups

LibriSpeech LM

The LibriSpeech LM corpus used for pre-training speech-text models.
- Dataset
- JSON
Penn Treebank PCFG

Penn Treebank PCFG dataset
- Dataset
- JSON
Simple CFG

Simple CFG dataset
- Dataset
- JSON
SAM: Semantic Attribute Modulation for Language Modeling and Style Variation

The Semantic Attribute Modulation (SAM) for language modeling and style variation.
- Dataset
- JSON
LLaMA-7B

A benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.
- Dataset
- JSON
Penn Tree Bank (PTB)

The Penn Tree Bank (PTB) dataset used for language modeling.
- Dataset
- JSON
ControlVAE: Controllable Variational Autoencoder

The dataset used for language modeling, disentangled representation learning, and image generation.
- Dataset
- JSON
BookCorpus Dataset

The dataset used in the paper is the bookcorpus dataset.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
Morfessor 2.0 dataset

Morfessor 2.0 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
Den samiske tekstbanken dataset

Den samiske tekstbanken dataset for North S´ami language model
- Dataset
- JSON
Morpho Challenge 2010 dataset

Morpho Challenge 2010 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
Wikitext-2

The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks.
- Dataset
- JSON
Billion Word Benchmark Dataset

The dataset contains 768M tokens for language modeling.
- Dataset
- JSON
SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.
- Dataset
- JSON
YELP

The YELP dataset is used for language modeling.
- Dataset
- JSON
PTB

Object tracking by reconstruction with view-specific discriminative correlation filters.
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
Penn Treebank (PTB) and WikiText-2 (WT-2)

The dataset used in the paper is Penn Treebank (PTB) and WikiText-2 (WT-2), which are language modeling datasets.
- Dataset
- JSON
Patrika Dataset

Patrika dataset is used as independent test set.
- Dataset
- JSON

55 datasets found