Language Modeling - Groups

SAM: Semantic Attribute Modulation for Language Modeling and Style Variation

The Semantic Attribute Modulation (SAM) for language modeling and style variation.
- Dataset
- JSON
LLaMA-7B

A benchmark for evaluating the perception ability of Large Vision-Language Models (LVLMs) via various subtasks and scenarios.
- Dataset
- JSON
Penn Tree Bank (PTB)

The Penn Tree Bank (PTB) dataset used for language modeling.
- Dataset
- JSON
ControlVAE: Controllable Variational Autoencoder

The dataset used for language modeling, disentangled representation learning, and image generation.
- Dataset
- JSON
BookCorpus Dataset

The dataset used in the paper is the bookcorpus dataset.
- Dataset
- JSON
Word2Vec

Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification
- Dataset
- JSON
Morfessor 2.0 dataset

Morfessor 2.0 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
Den samiske tekstbanken dataset

Den samiske tekstbanken dataset for North S´ami language model
- Dataset
- JSON
Morpho Challenge 2010 dataset

Morpho Challenge 2010 dataset for English, Finnish and Turkish language models
- Dataset
- JSON
Wikitext-2

The dataset used in this paper is not explicitly described. However, it is mentioned that the authors used the Wikitext-2 dataset for text generation tasks.
- Dataset
- JSON
Billion Word Benchmark Dataset

The dataset contains 768M tokens for language modeling.
- Dataset
- JSON
SlimPajama

The dataset is used to evaluate the performance of the xLSTM architecture on various tasks, including language modeling, question answering, and text classification.
- Dataset
- JSON
YELP

The YELP dataset is used for language modeling.
- Dataset
- JSON
PTB

Object tracking by reconstruction with view-specific discriminative correlation filters.
- Dataset
- JSON
C4

The dataset used for pre-training language models, containing a large collection of text documents.
- Dataset
- JSON
Penn Treebank (PTB) and WikiText-2 (WT-2)

The dataset used in the paper is Penn Treebank (PTB) and WikiText-2 (WT-2), which are language modeling datasets.
- Dataset
- JSON
Patrika Dataset

Patrika dataset is used as independent test set.
- Dataset
- JSON
Nayadiganta Dataset

Nayadiganta dataset is used as independent test set.
- Dataset
- JSON
Hindinews and Livehindustan Articles

Hindinews, Livehindustan and Patrika newspaper articles available open source in Kaggle encompassing similar domains.
- Dataset
- JSON
Bengali and Hindi News Articles

Bengali dataset consists of articles from online public news portals such as Prothom-Alo, BDNews24 and Nayadiganta. The articles encompass domains such as politics,...
- Dataset
- JSON

52 datasets found