企业sns网站需求,长春财经学院是公办还是民办,网站后台可以做两个管理系统么,如何进行关键词分析之前做分布式爬虫的时候,都是从push url来拿到爬虫消费的链接,这里提出一个问题,假如这个请求是post请求的呢,我观察了scrapy-redis的源码,其中spider.py的代码是这样写的
1.scrapy-redis源码分析 def make_request_from_data(self, data):Returns a Reques…之前做分布式爬虫的时候,都是从push url来拿到爬虫消费的链接,这里提出一个问题,假如这个请求是post请求的呢,我观察了scrapy-redis的源码,其中spider.py的代码是这样写的
1.scrapy-redis源码分析 def make_request_from_data(self, data):Returns a Request instance for data coming from Redis.Overriding this function to support the json requested data that containsurl ,meta and other optional parameters. meta is a nested json which contains sub-data.Along with:After accessing the data, sending the FormRequest with url, meta and addition formdata, methodFor example:.. code:: json{url: https://example.com,meta: {job-id:123xsd,start-date:dd/mm/yy,},url_cookie_key:fertxsas,method:POST,}If url is empty, return []. So you should verify the url in the data.If method is empty, the request object will set method to GET, optional.If meta is empty, the request object will set meta to an empty dictionary, optional.This json supported data can be accessed from scrapy.spider through response.request.url, request.meta, request.cookies, request.methodParameters----------data : bytesMessage from redis.formatted_data bytes_to_str(data, self.redis_encoding)if is_dict(formatted_data):parameter json.loads(formatted_data)else:self.logger.warning(f{TextColor.WARNING}WARNING: String request is deprecated, please use JSON data format. fDetail information, please check https://github.com/rmax/scrapy-redis#features{TextColor.ENDC})return FormRequest(formatted_data, dont_filterTrue)if parameter.get(url, None) is None:self.logger.warning(f{TextColor.WARNING}The data from Redis has no url key in push data{TextColor.ENDC})return []url parameter.pop(url)method parameter.pop(method).upper() if method in parameter else GETmetadata parameter.pop(meta) if meta in parameter else {}return FormRequest(url, dont_filterTrue, methodmethod, formdataparameter, metametadata)
源码地址:https://github.com/rmax/scrapy-redis 可以看到这里是可以处理post请求的
2.scrapy-rabbitmq-schrduler源码分析
地址:
https://github.com/aox-lei/scrapy-rabbitmq-scheduler
class RabbitSpider(scrapy.Spider):def _make_request(self, mframe, hframe, body):try:request request_from_dict(pickle.loads(body), self)except Exception as e:body body.decode()request scrapy.Request(body, callbackself.parse, dont_filterTrue)return request
可以看到RabbitSpider继承了spider的嘞,改写了request,当我们发我post请求的时候 request_from_dict(pickle.loads(body), self)会报错
builtins.UnicodeDecodeError: utf-8 codec cant decode byte 0x80 in position 0: invalid start byte
pick.loads 在尝试反序列化字节数据时遇到无法解码的字节序列造成的。具体来说UnicodeDecodeError: utf-8 codec cant decode byte 0x80 in position 0: invalid start byte 说明传入的数据包含非 UTF-8 编码的字节可能是二进制数据或其他编码格式的数据。
def _make_request(self, mframe, hframe, body):try:# 反序列化 body 数据data pickle.loads(body)# 获取请求的 URL 和其他参数url data.get(url)method data.get(method, GET).upper() # 默认 GET如果是 POST 需要设置为 POSTheaders data.get(headers, {})cookies data.get(cookies, {})body_data data.get(body) # 可能是 POST 请求的表单数据callback_str data.get(callback) # 回调函数名称字符串errback_str data.get(errback) # 错误回调函数名称字符串meta data.get(meta, {})# 尝试从全局字典中获取回调函数# 使用爬虫实例的 getattr 方法获取回调函数callback getattr(self, callback_str, None) if callback_str else Noneerrback getattr(self, errback_str, None) if errback_str else None# # 确保回调函数存在# if callback is None:# self.logger.error(fCallback function {callback_str} not found.)# if errback is None:# self.logger.error(fErrback function {errback_str} not found.)# 判断请求方法如果是 POST则使用 FormRequestif callback:if method POST:# FormRequest 适用于带有表单数据的 POST 请求request scrapy.FormRequest(urlurl,methodPOST,headersheaders,cookiescookies,bodybody_data, # 请求的主体callbackcallback,errbackerrback,metameta,dont_filterTrue)else:# 默认处理 GET 请求request scrapy.Request(urlurl,headersheaders,cookiescookies,callbackcallback,errbackerrback,metameta,dont_filterTrue)else: passexcept Exception as e:body body.decode()request scrapy.Request(body, callbackself.parse, dont_filterTrue)return request
直接获取callback是个字符串而不是函数,要在spider中获取到对应的函数
注:由于scrapy-rabbitmq-scheduler无人更新维护,目前新的scrapy已经不支持,上述最新的代码已推github:https://github.com/tieyongjie/scrapy-rabbitmq-task
安装直接安装
pip install scrapy-rabbitmq-task