Word Count in a large file

How would you count word occurence in a very large file? How to keep track of top 10 occurring words?

方法1:

  1. create N output files
  2. Fetch words from big data files
  3. For each word, calculate the hashcode
  4. output file sequence = hashcode % N
  5. Now all duplicate words will go into same file so we can count the word frequency from each output file separately and append them into a final merged file

results matching ""

    No results matching ""