The Way To Build A Crawler
x— layout: post location: Shanghai, China tldr: false audio: false title: The Way to Build a Crawler categories: Java —
References
[1] 硅谷之路爬虫系列
Introduction
How can we fetch the data from the Internet and extract useful information from the data to help decision making?
Basic Crawler
Two types of crawlers:
- Crawlers downloading web pages;
- Crawlers extracting links from the pages.
Work flow
- loads the crawling task (url list, common resources) into memory;
- scan through the url list, for each url the crawler does the following steps:
- first check if it is connected to the server, if not create one;
- then sends a http request to the sever to download data;
- label the current url is completed;
- parses the data and extract urls, add them into the url list.
- continue doing this util no urls to be crawled.
Discussion
- What if the crawling task is too huge?
- How to store the states for the urls?
- How to crawl pages from different site with different page schemas?
Maybe
- Divide them into pieces?
- Store it in memory, file, database?
- Define different parsing strategies and pass them into the crawlers?
Multithread
Main idea
- A thread runs a crawler;
- To run multiple crawlers using multithreading programming.
- Scheduling
Issue
taskTable and pageTable is shared;
Three types of solutions
- 睡眠(Sleep)
- 条件变量(Conditional Variable)
- 信号量(Semaphore)
Pub/Sub Model
LMAX Disruptor这套最快的无锁生产者消费者的模型
Distributed
4 Components
- Crawlers:
- Database: the place to store the tasks and pages;
- Sender:
- Receiver
Conclusion
Tree and Graphs
If you like this post or if you use one of the Open Source projects I maintain, say hello by email. There is also my Amazon Wish List. Thank you ♥
Comments