视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
Pyspider中给爬虫伪造随机请求头的实例
2020-11-27 14:21:40 责编:小采
文档


这篇文章主要介绍了关于Pyspider中给爬虫伪造随机请求头的实例,有着一定的参考价值,现在分享给大家,有需要的朋友可以参考一下

Pyspider 中采用了 tornado 库来做 http 请求,在请求过程中可以添加各种参数,例如请求链接超时时间,请求传输数据超时时间,请求头等等,但是根据pyspider的原始框架,给爬虫添加参数只能通过 crawl_config这个Python字典来完成(如下所示),框架代码将这个字典中的参数转换成 task 数据,进行http请求。这个参数的缺点是不方便给每一次请求做随机请求头。

crawl_config = {
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"timeout": 120,
"connect_timeout": 60,
"retries": 5,
"fetch_type": 'js',
"auto_recrawl": True,
}

这里写出给爬虫添加随机请求头的方法:

1、编写脚本,将脚本放置在 pyspider 的 libs 文件夹下,命名为 header_switch.py

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Created on 2017-10-18 11:52:26
import random
import time
class HeadersSelector(object):
 """
 Header 中缺少几个字段 Host 和 Cookie
 """
 headers_1 = {
 "Proxy-Connection": "keep-alive",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "DNT": "1",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
 "Referer": "https://www.baidu.com/s?wd=%BC%96%E7%A0%81&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=If-None-Match&inputT=7282&rsv_t",
 "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
 } # 网上找的浏览器
 headers_2 = {
 "Proxy-Connection": "keep-alive",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
 "Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
 "DNT": "1",
 "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnPAvZN",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
 } # window 7 系统浏览器
 headers_3 = {
 "Proxy-Connection": "keep-alive",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "User-Agent": "Mozilla/5.0 (X11; Linux x86_; rv:52.0) Gecko/20100101 Firefox/52.0",
 "Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
 "DNT": "1",
 "Referer": "https://www.baidu.com/s?wd=http%B4%20Pragma&rsf=1&rsp=4&f=1&oq=Pragma&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",
 } # Linux 系统 firefox 浏览器
 headers_4 = {
 "Proxy-Connection": "keep-alive",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win; x; rv:55.0) Gecko/20100101 Firefox/55.0",
 "Accept": "*/*",
 "DNT": "1",
 "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnP",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
 } # Win10 系统 firefox 浏览器
 headers_5 = {
 "Connection": "keep-alive",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win; x;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
 "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
 } # Win10 系统 Chrome 浏览器
 headers_6 = {
 "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
 "Accept-Encoding": "gzip, deflate, sdch",
 "Accept-Language": "zh-CN,zh;q=0.8",
 "Pragma": "no-cache",
 "Cache-Control": "no-cache",
 "Connection": "keep-alive",
 "DNT": "1",
 "Referer": "https://www.baidu.com/s?wd=If-None-Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rq",
 "Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
 } # win10 系统浏览器
 def __init__(self):
 pass
 def select_header(self):
 n = random.randint(1, 6)
 switch={
 1: self.headers_1
 2: self.headers_2
 3: self.headers_3
 4: self.headers_4
 5: self.headers_5
 6: self.headers_6
 }
 headers = switch[n]
 return headers

其中,我只写了6个请求头,如果爬虫的量非常大,完全可以写更多的请求头,甚至上百个,然后将 random的随机范围扩大,进行选择。

2、在pyspider 脚本中编写如下代码:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-08-18 11:52:26
from pyspider.libs.base_handler import *
from pyspider.addings.headers_switch import HeadersSelector
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
 reload(sys)
 sys.setdefaultencoding(defaultencoding)
class Handler(BaseHandler):
 crawl_config = {
 "user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
 "timeout": 120,
 "connect_timeout": 60,
 "retries": 5,
 "fetch_type": 'js',
 "auto_recrawl": True,
 }
 @every(minutes=24 * 60)
 def on_start(self):
 header_slt = HeadersSelector()
 header = header_slt.select_header() # 获取一个新的 header
 # header["X-Requested-With"] = "XMLHttpRequest"
 orig_href = 'http://sww.bjxch.gov.cn/gggs.html'
 self.crawl(orig_href,
 callback=self.index_page,
 headers=header) # 请求头必须写在 crawl 里,cookies 从 response.cookies 中找
 @config(age=24 * 60 * 60)
 def index_page(self, response):
 header_slt = HeadersSelector()
 header = header_slt.select_header() # 获取一个新的 header
 # header["X-Requested-With"] = "XMLHttpRequest"
 if response.cookies:
 header["Cookies"] = response.cookies

其中最重要的就是在每个回调函数 on_start,index_page 等等 当中,每次调用时,都会实例化一个 header 选择器,给每一次请求添加不一样的 header。要注意添加的如下代码:

 header_slt = HeadersSelector()
 header = header_slt.select_header() # 获取一个新的 header
 # header["X-Requested-With"] = "XMLHttpRequest"
 header["Host"] = "www.baidu.com"
 if response.cookies:
 header["Cookies"] = response.cookies

当使用 XHR 发送 AJAX 请求时会带上 Header,常被用来判断是不是 Ajax 请求, headers 要添加 {‘X-Requested-With': ‘XMLHttpRequest'} 才能抓取到内容。

确定了 url 也就确定了请求头中的 Host,需要按需添加,urlparse包里给出了根据 url解析出 host的方法函数,直接调用netloc即可。

如果响应中有 cookie,就需要将 cookie 添加到请求头中。

如果还有别的伪装需求,自行添加。

如此即可实现随机请求头,完。

下载本文
显示全文
专题