How to detect duplicate documents in billions of urls

Question: You have a billion urls, where each is a huge page. How do you detect the duplicate documents?

  • If the two documents have exactly the same links inside the page
  • They have the same title
  • Their creation time are the same

解题思路

Observations:

  1. Pages are huge, so bringing all of them in memory is a costly affair. We need a shorter representation of pages in memory. A hash is an obvious choice for this.
  2. Billions of urls exist so we don’t want to compare every page with every other page (that would be O(n2).

Based on the above two observations we can derive an algorithm which is as follows:

  1. Iterate through the pages and compute the hash value of each one.
  2. Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.

This algorithm will provide us a list of unique urls. But wait, can this fit on one computer?

  • How much space does each page take up in the hash table?

    Each page hashes to a four byte value.

    Each url is an average of 30 characters, so that’s another 30 bytes at least.

    Each url takes up roughly 34 bytes.

  • 4 bytes * 1 billion = 31.6 gigabytes. We’re going to have trouble holding that all in memory!

方案1:保存到不同的文件

If we stored all the data on one machine, we would do two passes of the document. The first pass would split the list of URLs into 4000 chunks of 1 GB each. An easy way to do that might be to store each URL u in a file named .txt where x = hash(u) %4000. That is, we divide up the URLs based on their hash value (modulo the number of chunks). This way, all URLs with the same hash value would be in the same file.

In the second pass, we would essentially implement the simple solution we came up with earlier: load each file into memory, create a hash table of the URLs, and look for duplicates.

方案2:使用多个机器

we could split this up across machines, and deal with network latency. Let’s go with this solution, and assume we have n machines.

  1. First, we hash the document to get a hash value v
  2. v%n tells us which machine this document’s hash table can be found on.
  3. v/n is the value in the hash table that is located on its machine.

References

1 How to detect duplicate documents in billions of urls

results matching ""

    No results matching ""