Task#

In meta-data object Task, it has some option has complex structure value like option running_content, running_result and result_detail. The value in these options is dict type value, even is a list of dict type element. For operations with these complex options, it may have some unexpected behavior or hardcode. So here are 3 objects for resolving these issues and let usage to be more expected and easily to maintain.

smoothcrawler_cluster.model.metadata.RunningContent = <class 'smoothcrawler_cluster.model.metadata.RunningContent'>#

The object for the task content detail

About the option running_content of meta-data object Task.

By the way, this is a namedtuple object.

Parameters:
  • task_id (str) – Task ID. It’s a unique index for every task content object.

  • url (str) – The URL target to crawl.

  • method (str) – HTTP method of URL to send request.

  • parameters (str) – Parameter of HTTP request.

  • header (str) – HTTP header of request.

  • body (str) – The body of HTTP request.

smoothcrawler_cluster.model.metadata.RunningResult = <class 'smoothcrawler_cluster.model.metadata.RunningResult'>#

The object for the running result statistics

The object for the option running_result of meta-data object Task.

Parameters:
  • success_count (int) – The total amount of successfully done task.

  • fail_count (int) – The total amount of fail done task.

smoothcrawler_cluster.model.metadata.ResultDetail = <class 'smoothcrawler_cluster.model.metadata.ResultDetail'>#

The object for the running result details

The object for the option result_detail of meta-data object Task.

Parameters:
  • task_id (str) – Task ID. It would be assigned from the task ID in RunningContent.

  • state (str) – Task running state. This option could be assigned by enum TaskResult.

  • status_code (str) – The HTTP status code in response.

  • response (str) – The HTTP response.

  • error_msg (str) – Error message.

class smoothcrawler_cluster.model.metadata.Task[source]#

Meta-Data for one specific crawler’s task

The current web spider task Runner member got. It’s the record for Runner or Backup Runner in different scenarios to do different things.

Runner member#

For Runner, it could know which task and what detail it should use to run.

Backup Runner member#

For Backup Runner, it would try to get this info if anyone of Runner members doesn’t update heartbeat stamp and timeout. And the Backup Runner member would use the info to know it should take over original Runner member to wait or run the current web spider task.

  • Zookeeper node path:

/smoothcrawler/node/<crawler name>/task/

  • Example data:

{
    "running_content": [
        {
            "task_id": 0,
            "url": "https://www.example.com",
            "method": "GET",
            "parameters": None,
            "header": None,
            "body": None
        }
    ],
    "cookies": {},
    "authorization": {},
    "in_progressing_id": "0",
    "running_result": {
        "success_count": 0,
        "fail_count": 0
    },
    "running_status": "running",
    "result_detail": [
        {
            "task_id": 0,
            "state": "done",
            "status_code": 200,
            "response": "",
            "error_msg": None
        }
    ]
}
to_readable_object() dict[source]#

Converse the instance’s current data to be dict type value. Its target is let data converse as JSON format value for deserializing conveniently.

Returns:

A dict type value keeps the current instance’s data.

Return type:

dict

property running_content: List[dict]#

Properties with both a getter and setter for detail of task content. In generally, it should record some necessary info about task like url, method, etc. It suggests that developers could use object RunningContent to configure this attribute.

This property would be reset back to empty list after crawler instance finish all tasks.

Setter would raise ValueError in 2 scenarios:
  • Value type is NOT list.

  • Any one of elements in list data type is NOT dict or RunningContent.

Type:

list of dict

property cookie: dict#

Properties with both a getter and setter for the cookie for tasks.

Setter would raise ValueError if the value data type is NOT dict.

Type:

dict

property authorization: dict#

Properties with both a getter and setter for the authorization for tasks.

Setter would raise ValueError if the value data type is NOT dict.

Type:

dict

property in_progressing_id: str#

Properties with both a getter and setter for the ID of the task which be run by crawler instance currently.

Setter would raise ValueError in 2 scenarios:
  • Value type is NOT str or int.

  • Value could NOT be parsed as number format value.

Type:

str

property running_result: dict#

Properties with both a getter and setter for the statistics of the running result of all tasks. Currently, it only has 2 type data: amount of total success (_success_count_) and amount of total fail (_fail_count_).

Setter would raise ValueError if the value data type is NOT dict or RunningResult.

Type:

dict

property running_status: str#

Properties with both a getter and setter for the status of crawler runs task. It suggests developers configure this attribute by enum object TaskResult.

Setter would raise ValueError if the value data type is NOT dict or TaskResult.

Type:

str

property result_detail: List[dict]#

Properties with both a getter and setter for the details of every task’s running result.

Setter would raise 2 types ValueError in below 2 scenarios:
  • Value data type is NOT list.

  • Any one element of list data type is NOT dict or ResultDetail.

Type:

list of dict