Word Count in a large file
How would you count word occurence in a very large file? How to keep track of top 10 occurring words?
方法1:
- create N output files
- Fetch words from big data files
- For each word, calculate the hashcode
- output file sequence = hashcode % N
- Now all duplicate words will go into same file so we can count the word frequency from each output file separately and append them into a final merged file