视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
浅谈python中爬虫框架(talonspider)的介绍
2020-11-27 14:23:52 责编:小采
文档


本文给大家介绍的是使用python开发的爬虫框架talonspider的简单介绍以及使用方法,有需要的小伙伴可以参考下

1.为什么写这个?

一些简单的页面,无需用比较大的框架来进行爬取,自己纯手写又比较麻烦

因此针对这个需求写了talonspider:

?1.针对单页面的item提取 - 具体介绍点这里
?2.spider模块 - 具体介绍点这里

2.介绍&&使用

2.1.item

这个模块是可以使用的,对于一些请求比较简单的网站(比如只需要get请求),单单只用这个模块就可以快速地编写出你想要的爬虫,比如(以下使用python3,python2见examples目录):

2.1.1.单页面单目标

比如要获取这个网址http://book.qidian.com/info/1004608738 的书籍信息,封面等信息,可直接这样写:

import time
from talonspider import Item, TextField, AttrField
from pprint import pprint

class TestSpider(Item):
 title = TextField(css_select='.book-info>h1>em')
 author = TextField(css_select='a.writer')
 cover = AttrField(css_select='a#bookImg>img', attr='src')

 def tal_title(self, title):
 return title

 def tal_cover(self, cover):
 return 'http:' + cover

if name == 'main':
 item_data = TestSpider.get_item(url='http://book.qidian.com/info/1004608738')
 pprint(item_data)

具体见qidian_details_by_item.py

2.1.1.单页面多目标

比如获取豆瓣250电影首页展示的25部电影,这一个页面有25个目标,可直接这样写:

from talonspider import Item, TextField, AttrField
from pprint import pprint

# 定义继承自item的爬虫类
class DoubanSpider(Item):
 target_item = TextField(css_select='p.item')
 title = TextField(css_select='span.title')
 cover = AttrField(css_select='p.pic>a>img', attr='src')
 abstract = TextField(css_select='span.inq')

 def tal_title(self, title):
 if isinstance(title, str):
 return title
 else:
 return ''.join([i.text.strip().replace('xa0', '') for i in title])

if name == 'main':
 items_data = DoubanSpider.get_items(url='movie.douban.com/top250')
 result = []
 for item in items_data:
 result.append({
 'title': item.title,
 'cover': item.cover,
 'abstract': item.abstract,
 })
 pprint(result)

具体见douban_page_by_item.py

2.2.spider

当需要爬取有层次的页面时,比如爬取豆瓣250全部电影,这时候spider部分就派上了用场:

# !/usr/bin/env python
from talonspider import Spider, Item, TextField, AttrField, Request
from talonspider.utils import get_random_user_agent


# 定义继承自item的爬虫类
class DoubanItem(Item):
 target_item = TextField(css_select='p.item')
 title = TextField(css_select='span.title')
 cover = AttrField(css_select='p.pic>a>img', attr='src')
 abstract = TextField(css_select='span.inq')

 def tal_title(self, title):
 if isinstance(title, str):
 return title
 else:
 return ''.join([i.text.strip().replace('xa0', '') for i in title])


class DoubanSpider(Spider):
 # 定义起始url,必须
 start_urls = ['https://movie.douban.com/top250']
 # requests配置
 request_config = {
 'RETRIES': 3,
 'DELAY': 0,
 'TIMEOUT': 20
 }
 # 解析函数 必须有
 def parse(self, html):
 # 将html转化为etree
 etree = self.e_html(html)
 # 提取目标值生成新的url
 pages = [i.get('href') for i in etree.cssselect('.paginator>a')]
 pages.insert(0, '?start=0&filter=')
 headers = {
 "User-Agent": get_random_user_agent()
 }
 for page in pages:
 url = self.start_urls[0] + page
 yield Request(url, request_config=self.request_config, headers=headers, callback=self.parse_item)

 def parse_item(self, html):
 items_data = DoubanItem.get_items(html=html)
 # result = []
 for item in items_data:
 # result.append({
 # 'title': item.title,
 # 'cover': item.cover,
 # 'abstract': item.abstract,
 # })
 # 保存
 with open('douban250.txt', 'a+') as f:
 f.writelines(item.title + '
')


if name == 'main':
 DoubanSpider.start()

控制台:

/Users/howie/anaconda3/envs/work3/bin/python /Users/howie/Documents/programming/python/git/talonspider/examples/douban_page_by_spider.py
2017-06-07 23:17:30,346 - talonspider - INFO: talonspider started
2017-06-07 23:17:30,693 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250
2017-06-07 23:17:31,074 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=25&filter=
2017-06-07 23:17:31,416 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=50&filter=
2017-06-07 23:17:31,853 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=75&filter=
2017-06-07 23:17:32,523 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=100&filter=
2017-06-07 23:17:33,032 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=125&filter=
2017-06-07 23:17:33,537 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=150&filter=
2017-06-07 23:17:33,990 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=175&filter=
2017-06-07 23:17:34,406 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=200&filter=
2017-06-07 23:17:34,787 - talonspider_requests - INFO: GET a url: https://movie.douban.com/top250?start=225&filter=
2017-06-07 23:17:34,809 - talonspider - INFO: Time usage:0:00:04.462108

Process finished with exit code 0

此时当前目录会生成douban250.txt,具体见douban_page_by_spider.py。

3.说明

学习之作,待完善的地方还有很多,欢迎提意见,项目地址talonspider。

下载本文
显示全文
专题