Dataset - LDM

Enron Email Corpus

The dataset is used to discover hierarchical relationships from unstructured observations, specifically in the setting of discovering pairwise hierarchical relations between...
- Dataset
- JSON
NAVER Open Podium and NAVER Encyclopedia

A large dataset of Korean text.
- Dataset
- JSON
CiteULike

CiteULike is a user-article dataset, where each article has a 300-dimension tf-idf vector. XING is a user-view-job dataset where each job is described by a 2738-dimension...
- Dataset
- JSON
YELP

The YELP dataset is used for language modeling.
- Dataset
- JSON
News Articles Dataset

The dataset used in this paper is a collection of news articles from an international news website, covering a time span from September 2012 to April 2014.
- Dataset
- JSON
Jester

The Jester dataset is of continuous jokes ratings from -10 to 10, containing the jokes’ texts.
- Dataset
- JSON
Yahoo and Yelp corpora

The Yahoo and Yelp corpora dataset contains 100k sentences with greater average length.
- Dataset
- JSON
20NewsGroups

The dataset used in this paper is a collection of documents from various domains, including news, articles, and emails.
- Dataset
- JSON
Penn Treebank

The Penn Treebank dataset contains one million words of 1989 Wall Street Journal material annotated in Treebank II style, with 42k sentences of varying lengths.
- Dataset
- JSON
SNLI

The dataset used in the paper is the Stanford Natural Language Inference (SNLI) dataset, which consists of 549,367 premise-hypothesis pairs for train/dev/test sets and target...
- Dataset
- JSON
DailyDialog

The DailyDialog dataset is a large-scale multi-turn dialogue dataset, consisting of 10,000 conversations with 5 turns each.
- Dataset
- JSON
Customer Service Calls Dataset

A dataset consisting of ten years of customer service calls to a fleet truck company.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

32 datasets found