Depend on third party application - Zookeeper#

Managing the meta-data objects by Zookeeper. So any meta-data operations would be through by Zookeeper.

Crawler which supports to manage meta-data by Zookeeper:

ZookeeperCrawler

In SmoothCrawler-Cluster, it could use ZookeeperCrawler to setup a crawler cluster which let Zookeeper manages meta-data. We could import it as following:

from smoothcrawler_cluster import ZookeeperCrawler

And remember it must set the Zookeeper host(s) by argument zk_hosts.

zk_crawler = ZookeeperCrawler(runner=5,
                              name="crawler_<index>",
                              zk_hosts="localhost:2181")    # Don't forget to set this option!

It could use zkCli.sh to check the data in Zookeeper after we setup the crawler cluster.

root@d8e0a4b0c3f4:/apache-zookeeper-3.8.0-bin# ./bin/zkCli.sh
Connecting to localhost:2181
...
Welcome to ZooKeeper!
...
WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0]

In generally, it could use general CRUD commands to verify data operations like get:

[zk: localhost:2181(CONNECTED) 2] get /smoothcrawler/group/sc-crawler-cluster/state
{"total_crawler": 3, "total_runner": 2, "total_backup": 1, "standby_id": "3", "current_crawler": ["sc-crawler_1", "sc-crawler_2", "sc-crawler_3"], "current_runner": ["sc-crawler_1", "sc-crawler_2", "sc-crawler_1", "sc-crawler_2", "sc-crawler_1", "sc-crawler_2"], "current_backup": ["sc-crawler_3", "sc-crawler_3", "sc-crawler_3"], "fail_crawler": [], "fail_runner": [], "fail_backup": []}