Task#
In meta-data object Task, it has some option has complex structure value like option running_content, running_result and result_detail. The value in these options is dict type value, even is a list of dict type element. For operations with these complex options, it may have some unexpected behavior or hardcode. So here are 3 objects for resolving these issues and let usage to be more expected and easily to maintain.
- smoothcrawler_cluster.model.metadata.RunningContent = <class 'smoothcrawler_cluster.model.metadata.RunningContent'>#
The object for the task content detail
About the option running_content of meta-data object Task.
By the way, this is a namedtuple object.
- smoothcrawler_cluster.model.metadata.RunningResult = <class 'smoothcrawler_cluster.model.metadata.RunningResult'>#
The object for the running result statistics
The object for the option running_result of meta-data object Task.
- smoothcrawler_cluster.model.metadata.ResultDetail = <class 'smoothcrawler_cluster.model.metadata.ResultDetail'>#
The object for the running result details
The object for the option result_detail of meta-data object Task.
- Parameters:
- class smoothcrawler_cluster.model.metadata.Task[source]#
Meta-Data for one specific crawler’s task
The current web spider task Runner member got. It’s the record for Runner or Backup Runner in different scenarios to do different things.
Runner member#
For Runner, it could know which task and what detail it should use to run.
Backup Runner member#
For Backup Runner, it would try to get this info if anyone of Runner members doesn’t update heartbeat stamp and timeout. And the Backup Runner member would use the info to know it should take over original Runner member to wait or run the current web spider task.
Zookeeper node path:
/smoothcrawler/node/<crawler name>/task/
Example data:
{ "running_content": [ { "task_id": 0, "url": "https://www.example.com", "method": "GET", "parameters": None, "header": None, "body": None } ], "cookies": {}, "authorization": {}, "in_progressing_id": "0", "running_result": { "success_count": 0, "fail_count": 0 }, "running_status": "running", "result_detail": [ { "task_id": 0, "state": "done", "status_code": 200, "response": "", "error_msg": None } ] }
- to_readable_object() dict[source]#
Converse the instance’s current data to be dict type value. Its target is let data converse as JSON format value for deserializing conveniently.
- Returns:
A dict type value keeps the current instance’s data.
- Return type:
- property running_content: List[dict]#
Properties with both a getter and setter for detail of task content. In generally, it should record some necessary info about task like url, method, etc. It suggests that developers could use object RunningContent to configure this attribute.
This property would be reset back to empty list after crawler instance finish all tasks.
- Setter would raise ValueError in 2 scenarios:
Value type is NOT list.
Any one of elements in list data type is NOT dict or RunningContent.
- property cookie: dict#
Properties with both a getter and setter for the cookie for tasks.
Setter would raise ValueError if the value data type is NOT dict.
- Type:
- property authorization: dict#
Properties with both a getter and setter for the authorization for tasks.
Setter would raise ValueError if the value data type is NOT dict.
- Type:
- property in_progressing_id: str#
Properties with both a getter and setter for the ID of the task which be run by crawler instance currently.
- Setter would raise ValueError in 2 scenarios:
Value type is NOT str or int.
Value could NOT be parsed as number format value.
- Type:
- property running_result: dict#
Properties with both a getter and setter for the statistics of the running result of all tasks. Currently, it only has 2 type data: amount of total success (_success_count_) and amount of total fail (_fail_count_).
Setter would raise ValueError if the value data type is NOT dict or RunningResult.
- Type: