SmoothCrawler-Cluster#

A Python package which is encapsulation of high fault tolerance features for building cluster or distributed crawler system with SmoothCrawler.

Overview#

Do you ever been troubled how to develop a high fault tolerance program of crawler? In generally, crawler must gets fails or other weired issues so that occur any exceptions in running for a while because the Font-End code always be changed (no matter for new feature, breaking change or against web spider 🫣 ). So about the SmoothCrawler, it targets to separating the different concerns to different objects to let every components could focus on what thing they should and only to do; about the SmoothCrawler-Cluster, it extends the SoC features as a distributed or cluster system further, it gives crawler has high fault tolerance feature!

However, we could develop a high fault tolerance crawler easily. Let’s demonstrate an example to show how easy and clear it is!

from smoothcrawler_cluster import ZookeeperCrawler

# Instantiate some crawlers with Zookeeper
zk_crawler = ZookeeperCrawler(runner=2,
                              backup=1,
                              ensure_initial=True,
                              zk_hosts="localhost:2181")
# Same as SmoothCrawler general usage, register the components into factory
zk_crawler.register_factory(http_req_sender=RequestsHTTPRequest(),
                            http_resp_parser=RequestsExampleHTTPResponseParser(),
                            data_process=ExampleDataHandler())
# Listen and wait for tasks
zk_crawler.run()

General Documentation#

This part of documentation, which introduces the package and has some step-by-step instructions for using or building a crawler with high fault tolerance features.

Contents:

Usage Guides#

This section for who racks his/her brain developing and designing his/her own customized crawler but still have no idea.

Usage Guides:

API Reference#

Detail information about some function, class or method.

API Reference:

Development Documentation#

If you’re curious about the detail of implementation of this package includes workflow, software architecture, system design or development, this section is for you.

Development Documentation

Change Logs#

Release information.

Change Logs