49 datasets found

Tags: Multilingual

Filter Results
  • MCV-10

    This work showcases a cost-effective method for generating training data for speech processing tasks. The dataset MCV-10 is a multilingual dataset that contains 50 hours of...
  • TransMuCoRes

    Translated dataset for Multilingual Coreference Resolution (TransMuCoRes) in 31 South Asian languages.
  • DBP15KZH-EN, DBP15KJA-EN, and DBP15KFR-EN datasets

    The DBP15KZH-EN, DBP15KJA-EN, and DBP15KFR-EN datasets are used for cross-lingual entity alignment. The datasets contain entities, relations, and attributes, and are used to...
  • Multilingual Text Classification Dataset

    Multilingual text classification dataset with 17 different languages
  • MARC

    The MARC dataset is a multilingual text classification dataset that contains 6 languages.
  • MEANTIME

    MEANTIME, the NewsReader Multilingual Event and Time Corpus.
  • Robust Multilingual Named Entity Recognition with Shallow Semi-Supervised Fea...

    Multilingual Named Entity Recognition approach based on a robust and general set of features across languages and datasets.
  • UD Dataset v2.7

    The UD Dataset v2.7 is a multilingual dataset for dependency parsing.
  • WikiANN

    The WikiANN dataset is a multilingual dataset for named entity recognition.
  • Parallel Meaning Bank

    A semantically annotated parallel corpus for English, German, Italian, and Dutch where sentences are aligned with scoped meaning representations in order to capture the...
  • CLEF 2003

    The dataset used for the experiments in the paper.
  • PaLI

    The PaLI dataset is a large-scale multilingual language-image model.
  • WikiMatrix

    The WikiMatrix dataset is a multilingual dataset that contains parallel texts between English and other languages.
  • DBpedia

    DBpedia is a public knowledge graph which is derived from structured information in Wikipedia, mainly infoboxes.
  • MMLU

    The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
  • HellaSwag

    The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
  • ARC

    The dataset is used for instruction-tuning of LLMs in multiple languages using reinforcement learning from human feedback.
  • BABEL

    The BABEL dataset is a multilingual speech recognition dataset containing over 1,000 hours of speech from 6 languages.
  • Very Deep Multilingual Convolutional Neural Networks for LVCSR

    Convolutional neural networks (CNNs) are a standard component of many current state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) systems. However, CNNs in...
  • XNLI

    The XNLI dataset comprises pairs of sentences with a label categorizing the semantic relationship between the two sentences into one of three classifications: entailment,...
You can also access this registry using the API (see API Docs).