视频1 视频21 视频41 视频61 视频文章1 视频文章21 视频文章41 视频文章61 推荐1 推荐3 推荐5 推荐7 推荐9 推荐11 推荐13 推荐15 推荐17 推荐19 推荐21 推荐23 推荐25 推荐27 推荐29 推荐31 推荐33 推荐35 推荐37 推荐39 推荐41 推荐43 推荐45 推荐47 推荐49 关键词1 关键词101 关键词201 关键词301 关键词401 关键词501 关键词601 关键词701 关键词801 关键词901 关键词1001 关键词1101 关键词1201 关键词1301 关键词1401 关键词1501 关键词1601 关键词1701 关键词1801 关键词1901 视频扩展1 视频扩展6 视频扩展11 视频扩展16 文章1 文章201 文章401 文章601 文章801 文章1001 资讯1 资讯501 资讯1001 资讯1501 标签1 标签501 标签1001 关键词1 关键词501 关键词1001 关键词1501 专题2001
scrapy抓取学院新闻报告实例
2020-11-27 14:14:09 责编:小采
文档

分别对应的知识点为:

1.爬出一个页面下的基础数据.
2.通过爬到的数据进行二次爬取.
3.通过循环对网页进行所有数据的爬取.

话不多说,现在开干.

3.1爬出一页新闻栏目下的所有新闻链接


Paste_Image.png

通过对新闻栏目的源代码分析,我们发现所抓数据的结构为


Paste_Image.png

那么我们只需要将爬虫的选择器定位到(li:newsinfo_box_cf),再进行for循环抓取即可.

编写代码
import scrapyclass News2Spider(scrapy.Spider):
 name = "news_info_2"
 start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
 ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
 url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())

测试,通过!


Paste_Image.png

3.2通过爬到的一页新闻链接进入到新闻详情爬取所需要数据(主要是新闻内容)

现在我获得了一组URL,现在我需要进入到每一个URL中抓取我所需要的标题,时间和内容,代码实现也挺简单,只需要在原有代码抓到一个URL时进入该URL并且抓取相应的数据即可.所以,我只需要再写一个进入新闻详情页的抓取方法,并且使用scapy.request调用即可.

编写代码
#进入新闻详情页的抓取方法
def parse_dir_contents(self, response):item = GgglxyItem()item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()item['href'] = responseitem['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
 data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")item['content'] = data[0].xpath('string(.)').extract()[0]
 yield item

整合进原有代码后,有:

import scrapyfrom ggglxy.items import GgglxyItemclass News2Spider(scrapy.Spider):
 name = "news_info_2"
 start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
 ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
 url = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())#调用新闻抓取方法yield scrapy.Request(url, callback=self.parse_dir_contents)#进入新闻详情页的抓取方法 def parse_dir_contents(self, response):
 item = GgglxyItem()
 item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()
 item['href'] = response
 item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
 data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")
 item['content'] = data[0].xpath('string(.)').extract()[0]yield item

测试,通过!


Paste_Image.png

这时我们加一个循环:

NEXT_PAGE_NUM = 1 

NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUM
 yield scrapy.Request(next_url, callback=self.parse)

加入到原本代码:

import scrapyfrom ggglxy.items import GgglxyItem

NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):
 name = "news_info_2"
 start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
 ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
 URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())yield scrapy.Request(URL, callback=self.parse_dir_contents)global NEXT_PAGE_NUM
 NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:
 next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response):
 item = GgglxyItem() 
 item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()
 item['href'] = response 
 item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
 data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")
 item['content'] = data[0].xpath('string(.)').extract()[0] yield item

测试:


Paste_Image.png

抓到的数量为191,但是我们看官网发现有193条新闻,少了两条.
为啥呢?我们注意到log的error有两条:
定位问题:原来发现,学院的新闻栏目还有两条隐藏的二级栏目:
比如:


Paste_Image.png


对应的URL为


Paste_Image.png


URL都长的不一样,难怪抓不到了!
那么我们还得为这两条二级栏目的URL设定专门的规则,只需要加入判断是否为二级栏目:

 if URL.find('type') != -1: yield scrapy.Request(URL, callback=self.parse)

组装原函数:

import scrapy
from ggglxy.items import GgglxyItem

NEXT_PAGE_NUM = 1class News2Spider(scrapy.Spider):
 name = "news_info_2"
 start_urls = ["http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=1",
 ]def parse(self, response):for href in response.xpath("//div[@class='newsinfo_box cf']"):
 URL = response.urljoin(href.xpath("div[@class='news_c fr']/h3/a/@href").extract_first())if URL.find('type') != -1:yield scrapy.Request(URL, callback=self.parse)yield scrapy.Request(URL, callback=self.parse_dir_contents)
 global NEXT_PAGE_NUM
 NEXT_PAGE_NUM = NEXT_PAGE_NUM + 1if NEXT_PAGE_NUM<11:
 next_url = 'http://ggglxy.scu.edu.cn/index.php?c=special&sid=1&page=%s' % NEXT_PAGE_NUMyield scrapy.Request(next_url, callback=self.parse) def parse_dir_contents(self, response):
 item = GgglxyItem() 
 item['date'] = response.xpath("//div[@class='detail_zy_title']/p/text()").extract_first()
 item['href'] = response 
 item['title'] = response.xpath("//div[@class='detail_zy_title']/h1/text()").extract_first()
 data = response.xpath("//div[@class='detail_zy_c pb30 mb30']")
 item['content'] = data[0].xpath('string(.)').extract()[0] yield item

测试:


Paste_Image.png

我们发现,抓取的数据由以前的193条增加到了23,log里面也没有error了,说明我们的抓取规则OK!

4.获得抓取数据

 scrapy crawl news_info_2 -o 0016.json

下载本文
显示全文
专题