Design A Web Crawler
Web frontier:list of URLs to visit
分析
1 如何抽象互联网?
可以把互联网看出有向图:网页是节点,超链接是边。
2 抓取算法
BFS is normally used. However, DFS is also used in some situation, such as if your crawler has already established a connection with the website, it might just DFS all the URLs within this website to save some handshaking overhead.
Senior: 采用优先队列调度,区别于单纯的BFS,对于每个网页设定一定的抓取权重,优先抓取权重较高的网页。对于权重的设定,考虑的因素有:1. 是否属于一个比较热门的网站 2. 链接长度 3. link到该网页的网页的权重 4. 该网页被指向的次数 等等。进一步考虑,对于热门的网站,不能无限制的抓取,所以需要进行二级调度。首先调度抓取哪个网站,然后选中了要抓取的网站之后,调度在该网站中抓取哪些网页。这样做的好处是,非常礼貌的对单个网站的抓取有一定的限制,也给其他网站的网页抓取一些机会。
3 Decide what you want to crawl?
- Having a few websites in your crawler’s most frequent list to crawl, such as some authoritative news website, etc
- You should have lots of fetchers living on many host classes. Use machine learning to predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.