Powered by GitBook

Word Count in a large file

How would you count word occurence in a very large file? How to keep track of top 10 occurring words?

方法1：

create N output files
Fetch words from big data files
For each word, calculate the hashcode
output file sequence = hashcode % N
Now all duplicate words will go into same file so we can count the word frequency from each output file separately and append them into a final merged file

results matching ""

No results matching ""