The Google Billion Word dataset is one of the largest language modeling datasets with almost one billion tokens and a vocabulary of over 800K words, based on an English corpus of 30,301,028 shuffled sentences.
BibTex:
Before browse our site, please accept our cookies policy