Dataset - LDM

SexHateLex

The SexHateLex lexicon is a large collection of sexist and abusive terms in Chinese.
- Dataset
- JSON
SWSR: A Chinese Dataset and Lexicon for Online Sexism Detection

The SWSR dataset consists of two files: SexWeibo.csv and SexComment.csv, containing weibos (posts) and comments (replies) respectively.
- Dataset
- JSON
DialogRE

The DialogRE dataset is a dialogue-based relation extraction dataset for Chinese, which is used to improve machine reading comprehension.
- Dataset
- JSON
C3

C3 is a multiple-choice reading comprehension dataset. Here we use the C3 dataset4 proposed in [32], which is a competition dataset. It is necessary to construct pseudo-evidence...
- Dataset
- JSON
WAT2015

The dataset used in the paper is the WAT2015 translation task from Japanese (ja) to/from English (en) and Chinese (zh).
- Dataset
- JSON
Chinese–Japanese Unsupervised Neural Machine Translation Using Sub-character ...

Chinese–Japanese Unsupervised Neural Machine Translation Using Sub-character Level Information
- Dataset
- JSON
Chinese Corpus

The dataset is used to analyze corpora in a completely language independent and unsupervised way without any prior linguistic knowledge.
- Dataset
- JSON
CoNLL-2009

The CoNLL-2009 dataset is used for semantic role labeling (SRL) task. It contains 10,177 sentences in English and 10,177 sentences in Chinese.
- Dataset
- JSON
Chinese Prosody Prediction Dataset

The dataset used in the paper for automatic prosody prediction for Chinese speech synthesis using BLSTM-RNN and embedding features.
- Dataset
- JSON
Chinese-to-English Controlled Machine Translation

The dataset for Chinese-to-English controlled machine translation.
- Dataset
- JSON
Chinese Controlled Paraphrase Generation

The dataset for Chinese controlled paraphrase generation.
- Dataset
- JSON
Chinese Medical Short Sentence (CMSS) corpus

The Chinese Medical Short Sentence (CMSS) corpus contains 17,787 sentences that classified in three symptom severity rating: slightly, moderately and heavily.
- Dataset
- JSON
FewCLUE dataset

The FewCLUE dataset is a Chinese few-shot learning evaluation benchmark.
- Dataset
- JSON
CFT

The CFT dataset is a Chinese machine reading comprehension dataset.
- Dataset
- JSON
PD

The PD dataset is a Chinese machine reading comprehension dataset.
- Dataset
- JSON
CMRC-2017

The CMRC-2017 dataset is a Chinese machine reading comprehension dataset.
- Dataset
- JSON
Chinese Spell Check

The proposed approach achieves SOTA error correction results on two spell check datasets.
- Dataset
- JSON
NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) datasets

Chinese↔English translation tasks, KFTT English↔Japanese translation datasets.
- Dataset
- JSON
DuReader

DuReader dataset is a Chinese machine reading comprehension dataset, focusing on real-world web data
- Dataset
- JSON
China Workshop on Machine Translation in 2017

The dataset used in the paper is the news data from China Workshop on Machine Translation in 2017.
- Dataset
- JSON

You can also access this registry using the API (see API Docs).

22 datasets found