编程 Python

scrapy spider的几种爬取方式实例代码

Posted in Python onJanuary 25, 2018

本节课介绍了scrapy的爬虫框架，重点说了scrapy组件spider。

spider的几种爬取方式：

爬取1页内容
按照给定列表拼出链接爬取多页
找到‘下一页'标签进行爬取
进入链接，按照链接进行爬取

下面分别给出了示例

1.爬取1页内容

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class JulyeduSpider(scrapy.Spider):
  name = "julyedu"
  start_urls = [
    'https://www.julyedu.com/category/index',
  ]

  def parse(self, response):
    for julyedu_class in response.xpath('//div[@class="course_info_box"]'):
      print julyedu_class.xpath('a/h4/text()').extract_first()
      print julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first()
      print julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first()
      print response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
      print "\n"

      yield {
        'title':julyedu_class.xpath('a/h4/text()').extract_first(),
        'desc': julyedu_class.xpath('a/p[@class="course-info-tip"][1]/text()').extract_first(),
        'time': julyedu_class.xpath('a/p[@class="course-info-tip"][2]/text()').extract_first(),
        'img_url': response.urljoin(julyedu_class.xpath('a/img[1]/@src').extract_first())
      }

2.按照给定列表拼出链接爬取多页

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class CnBlogSpider(scrapy.Spider):
  name = "cnblogs"
  allowed_domains = ["cnblogs.com"]
  start_urls = [
    'http://www.cnblogs.com/pick/#p%s' % p for p in xrange(1, 11)
    ]

  def parse(self, response):
    for article in response.xpath('//div[@class="post_item"]'):
      print article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip()
      print response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip()
      print article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip()
      print response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip()
      print article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip()
      print ""

      yield {
        'title': article.xpath('div[@class="post_item_body"]/h3/a/text()').extract_first().strip(),
        'link': response.urljoin(article.xpath('div[@class="post_item_body"]/h3/a/@href').extract_first()).strip(),
        'summary': article.xpath('div[@class="post_item_body"]/p/text()').extract_first().strip(),
        'author': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/a/text()').extract_first().strip(),
        'author_link': response.urljoin(article.xpath('div[@class="post_item_body"]/div/a/@href').extract_first()).strip(),
        'comment': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_comment"]/a/text()').extract_first().strip(),
        'view': article.xpath('div[@class="post_item_body"]/div[@class="post_item_foot"]/span[@class="article_view"]/a/text()').extract_first().strip(),
      }

3.找到‘下一页'标签进行爬取

import scrapy
class QuotesSpider(scrapy.Spider):
  name = "quotes"
  start_urls = [
    'http://quotes.toscrape.com/tag/humor/',
  ]

  def parse(self, response):
    for quote in response.xpath('//div[@class="quote"]'):
      yield {
        'text': quote.xpath('span[@class="text"]/text()').extract_first(),
        'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),
      }

    next_page = response.xpath('//li[@class="next"]/@herf').extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback=self.parse)

4.进入链接，按照链接进行爬取

#by 寒小阳(hanxiaoyang.ml@gmail.com)

import scrapy


class QQNewsSpider(scrapy.Spider):
  name = 'qqnews'
  start_urls = ['http://news.qq.com/society_index.shtml']

  def parse(self, response):
    for href in response.xpath('//*[@id="news"]/div/div/div/div/em/a/@href'):
      full_url = response.urljoin(href.extract())
      yield scrapy.Request(full_url, callback=self.parse_question)

  def parse_question(self, response):
    print response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()
    print response.xpath('//span[@class="a_time"]/text()').extract_first()
    print response.xpath('//span[@class="a_catalog"]/a/text()').extract_first()
    print "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())
    print ""
    yield {
      'title': response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first(),
      'content': "\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract()),
      'time': response.xpath('//span[@class="a_time"]/text()').extract_first(),
      'cate': response.xpath('//span[@class="a_catalog"]/a/text()').extract_first(),
    }

总结

以上就是本文关于scrapy spider的几种爬取方式实例代码的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

scrapy spider的几种爬取方式实例代码

- Author -

NodYoung

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python 过滤字符串的技巧,map与itertools.imap

Sep 06 Python

Python实现excel转sqlite的方法

Jul 17 Python

Python竟能画这么漂亮的花,帅呆了(代码分享)

Nov 15 Python

Python实现抢购IPhone手机

Feb 07 Python

python 切换root 执行命令的方法

Jan 19 Python

浅谈python的深浅拷贝以及fromkeys的用法

Mar 08 Python

python数据分析:关键字提取方式

Feb 24 Python

Django 多对多字段的更新和插入数据实例

Mar 31 Python

python多线程实现同时执行两个while循环的操作

May 02 Python

基于Python3读写INI配置文件过程解析

Jul 23 Python

python脚本使用阿里云slb对恶意攻击进行封堵的实现

Feb 04 Python

OpenCV-Python实现轮廓拟合

Jun 08 Python

scrapy爬虫完整实例

Jan 25 #Python

python实现画圆功能

Jan 25 #Python

Python中常用信号signal类型实例

Jan 25 #Python

简单实现python画圆功能

Jan 25 #Python

Python中sort和sorted函数代码解析

Jan 25 #Python

django在接受post请求时显示403forbidden实例解析

Jan 25 #Python

Python微信公众号开发平台

Jan 25 #Python

You might like

ThinkPHP3.1新特性之对分组支持的改进与完善概述

2014/06/19 PHP

php类中的$this，static，final，const，self这几个关键字使用方法

2015/12/14 PHP

Zend Framework教程之Zend_Db_Table表关联实例详解

2016/03/23 PHP

yii2控制器Controller Ajax操作示例

2016/07/23 PHP

Jquery 跨域访问 Lightswitch OData Service的方法

2013/09/11 Javascript

下拉列表选择项的选中在不同浏览器中的兼容性问题探讨

2013/09/18 Javascript

js判断登录与否并确定跳转页面的方法

2015/01/30 Javascript

javascript实现3D切换焦点图

2015/10/16 Javascript

jQuery 3 中的新增功能汇总介绍

2016/06/12 Javascript

详解jQuery中的DOM操作

2016/12/23 Javascript

jQuery实现节点的追加、替换、删除、复制功能示例

2017/07/11 jQuery

node.js学习之事件模块Events的使用示例

2017/09/28 Javascript

原生js实现移动小球（碰撞检测）

2020/12/17 Javascript

python使用新浪微博api上传图片到微博示例

2014/01/10 Python

python处理文本文件实现生成指定格式文件的方法

2014/07/31 Python

python 拷贝特定后缀名文件,并保留原始目录结构的实例

2018/04/27 Python

python实现windows下文件备份脚本

2018/05/27 Python

Python 字符串转换为整形和浮点类型的方法

2018/07/17 Python

python实现诗歌游戏（类继承）

2019/02/26 Python

解决Python中回文数和质数的问题

2019/11/24 Python

基于tensorflow指定GPU运行及GPU资源分配的几种方式小结

2020/02/03 Python

详解canvas在圆弧周围绘制文本的两种写法

2018/05/22 HTML / CSS

移动端Html5中百度地图的点击事件

2019/01/31 HTML / CSS

阿里健康大药房：阿里自营网上药店

2017/08/01 全球购物

美国玩具公司：U.S.Toy

2018/05/19 全球购物

Tretorn美国官网：瑞典外套和鞋类品牌，抵御风雨

2018/07/19 全球购物

三个儿子教学反思

2014/02/03 职场文书

运动会稿件200字

2014/02/07 职场文书

竞选班长的演讲稿

2014/04/24 职场文书

大学生村官座谈会发言材料

2014/05/25 职场文书

三严三实对照检查材料

2014/08/25 职场文书

社团个人总结范文

2015/03/05 职场文书

2015年乡镇流动人口工作总结

2015/05/12 职场文书

2019年恭贺升学祝福语集锦

2019/08/15 职场文书

用Python将库打包发布到pypi

2021/04/13 Python

如何用PHP实现多线程编程

2021/05/26 PHP