Word2Vec dictionary for 65000 Gutenberg E-books

Thomas Egense
Description: 55,000 e-books from Project Gutenberg (http://www.gutenberg.org/). About 35.000 books are english, but over 50 different languages are represented. The word2vec algorithm does a good job at seperating the different languages, so it is almost like it is 50 different word2vec dictionaries. Corpus size: 30GB of text spliteded into in 230 million sentences sentences with all punctuations removed. Word2Vec takes about 1.5 week/CPU time to build the dictionary. Word2Vec parameters: Software implementation:Google Model: Skip-Gram Word...
This data repository is not currently reporting usage information. For information on how your repository can submit usage information, please see our documentation.