13 datasets found

Formats: JSON Tags: Text Data

Filter Results
  • The Pile

    The Pile dataset contains 3.5 million samples of diverse text for language modeling.
  • LAMBADA

    The dataset used in the paper is a corpus of text containing approximately 10,000 examples, each a sequence of sentences extracted from books.
  • Wall Street Journal

    The Wall Street Journal dataset is used for syntactic linearization. It contains a large corpus of news articles with their corresponding syntactic trees.
  • TMC

    The TMC dataset is a collection of air traffic reports.
  • Reuters21578

    The problem of similarity search is to find the most similar items in a large collection to a query item of interest. Fast similarity search is at the core of many information...
  • CNAE-9 Dataset

    The CNAE-9 dataset is a set of 9 categories from the National Classification of Economic Activities.
  • MuST-C

    MuST-C is a multilingual speech translation dataset, which contains at least 385 hours of audio recordings from TED Talks, with their manual transcriptions and translations at...
  • PeerRead Dataset

    The paper presents the PeerRead dataset for testing the sparse deep generative model.
  • CommonCrawl

    CommonCrawl is a non-profit organization that provides a large corpus of web pages for research and development purposes.
  • CCNet

    The dataset used in the paper to train the Toolformer model.
  • OSCAR

    The OSCAR corpus is a multilingual web corpus that is used for pre-training large generative language models. It is a document-oriented corpus that is comparable in size and...
  • fr-wiki

    The fr-wiki dataset is a Wikipedia dataset for French, containing 0.5GT.
  • Common Crawl

    The Common Crawl (CC) project browses and indexes all content available online. It generates 200-300 TiB of data per month (around 5% of which is in French), and constitutes the...
You can also access this registry using the API (see API Docs).