The Way To Build A Crawler

23 July 2016 —

x— layout: post location: Shanghai, China tldr: false audio: false title: The Way to Build a Crawler categories: Java —

References

[1] 硅谷之路爬虫系列

Introduction

How can we fetch the data from the Internet and extract useful information from the data to help decision making?

Basic Crawler

Two types of crawlers:

Crawlers downloading web pages;
Crawlers extracting links from the pages.

Work flow

loads the crawling task (url list, common resources) into memory;
scan through the url list, for each url the crawler does the following steps:
- first check if it is connected to the server, if not create one;
- then sends a http request to the sever to download data;
- label the current url is completed;
- parses the data and extract urls, add them into the url list.
continue doing this util no urls to be crawled.

Discussion

What if the crawling task is too huge?
How to store the states for the urls?
How to crawl pages from different site with different page schemas?

Maybe

Divide them into pieces?
Store it in memory, file, database?
Define different parsing strategies and pass them into the crawlers?

Multithread

Main idea

A thread runs a crawler;
To run multiple crawlers using multithreading programming.
Scheduling

Issue

taskTable and pageTable is shared;

Three types of solutions

睡眠（Sleep）
条件变量（Conditional Variable)
信号量（Semaphore）

Pub/Sub Model

LMAX Disruptor这套最快的无锁生产者消费者的模型

Distributed

4 Components

Crawlers:
Database: the place to store the tasks and pages;
Sender:
Receiver

Conclusion

public class Teisei {
    public static void main(String args[]){
        System.out.println("Hello world! This is Teisei's blog");
    }
}

Tree and Graphs

Back to Content

By the way, if you found a typo, please fork and edit this post. Thank you so much! This post is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

If you like this post or if you use one of the Open Source projects I maintain, say hello by email. There is also my Amazon Wish List. Thank you ♥

Teisei

The Way To Build A Crawler

References

Introduction

Basic Crawler

Two types of crawlers:

Work flow

Discussion

Maybe

Multithread

Main idea

Issue

Three types of solutions

Pub/Sub Model

Distributed

4 Components

Conclusion

Related Posts

Manual of Jekyll 12 Jan 2020

2016码农校招内推消息汇总 16 Aug 2016

Popular Java Interview Questions 12 Jul 2016

Solutions for Cracking the Coding Interview 04 Jul 2016

Useful links 13 Jun 2016

Combine two strings 12 Jun 2016

Leetcode Locked Problems 05 Jun 2016

Programming Interview Questions 03 Jun 2016

Comments