HC Resources | BLCU balanced corpus frequency lists

These lists are based on a ridiculous 15 billion (simplified) character corpus, composed of news, literature, blogs and much more. It is probably the biggest, most comprehensive dataset available. You can access the corpus online here and read more about the project here (in Chinese). The ZIP-file linked to above contains text files for each part of the corpus, as well as a global file. If you can’t view the text files, a user over at Pleco’s forum posted UTF-8 encoded versions that work well for me. The lists contain some oddities, such as 第 coming out on top and some kana from Japanese showing up; please see the discussion over at Pleco for cleaned-up versions if you want to remove these.

This resource was discussed in this article: The most common Chinese words, characters and components for language learners and teachers