Scrapy爬虫实例讲解_校花网


Posted in Python onOctober 23, 2017

学习爬虫有一段时间了,今天使用Scrapy框架将校花网的图片爬取到本地。Scrapy爬虫框架相对于使用requests库进行网页的爬取,拥有更高的性能。

Scrapy官方定义:Scrapy是用于抓取网站并提取结构化数据的应用程序框架,可用于广泛的有用应用程序,如数据挖掘,信息处理或历史存档。

建立Scrapy爬虫工程

在安装好Scrapy框架后,直接使用命令行进行项目的创建:

E:\ScrapyDemo>scrapy startproject xiaohuar
New Scrapy project 'xiaohuar', using template directory 'c:\\users\\lei\\appdata\\local\\programs\\python\\python35\\lib
\\site-packages\\scrapy\\templates\\project', created in:
 E:\ScrapyDemo\xiaohuar

You can start your first spider with:
 cd xiaohuar
 scrapy genspider example example.com

创建一个Scrapy爬虫

创建工程的时候,会自动创建一个与工程同名的目录,进入到目录中执行如下命令:

E:\ScrapyDemo\xiaohuar>scrapy genspider -t basic xiaohua xiaohuar.com
Created spider 'xiaohua' using template 'basic' in module:
xiaohuar.spiders.xiaohua命令中"xiaohua"

是生成Spider中*.py文件的文件名,"xiaohuar.com"是将要爬取网站的URL,可以在程序中更改。

编写Spider代码

编写E:\ScrapyDemo\xiaohuar\xiaohuar\spiders中的xiaohua.py文件。主要是配置URL和对请求到的页面的解析方式。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import re

class XiaohuaSpider(scrapy.Spider):
 name = 'xiaohua'
 allowed_domains = ['xiaohuar.com']
 start_urls = []
 for i in range(43):
  url = "http://www.xiaohuar.com/list-1-%s.html" %i
  start_urls.append(url)

 def parse(self, response):
  if "www.xiaohuar.com/list-1" in response.url:
   # 下载的html源代码
   html = response.text
   # 网页中图片存储地址:src="/d/file/20160126/905e563421921adf9b6fb4408ec4e72f.jpg"
   # 通过正则匹配到所有的图片
   # 获取的是图片的相对路径的列表
   img_urls = re.findall(r'/d/file/\d+/\w+\.jpg',html)
   
   # 使用循环对图片页进行请求
   for img_url in img_urls:
    # 将图片的URL补全
    if "http://" not in img_url:
     img_url = "http://www.xiaohuar.com%s" %img_url
    
    # 回调,返回response
    yield Request(img_url)
  else:
   # 下载图片 
   url = response.url
   # 保存的图片文件名
   title = re.findall(r'\w*.jpg',url)[0]
   # 保存图片
   with open('E:\\xiaohua_img\\%s' % title, 'wb') as f:
    f.write(response.body)

这里使用正则表达式对图片的地址进行匹配,其他网页也都大同小异,需要根据具体的网页源代码进行分析。

运行爬虫

E:\ScrapyDemo\xiaohuar>scrapy crawl xiaohua
2017-10-22 22:30:11 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: xiaohuar)
2017-10-22 22:30:11 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'xiaohuar', 'SPIDER_MODULES': ['xiaohuar.
spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'xiaohuar.spiders'}
2017-10-22 22:30:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2017-10-22 22:30:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-22 22:30:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-22 22:30:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-22 22:30:12 [scrapy.core.engine] INFO: Spider opened
2017-10-22 22:30:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min
)
2017-10-22 22:30:12 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-22 22:30:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/robots.txt> (referer: None)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/list-1-0.html> (referer: None
)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170721/cb96f1b106b3d
b4a6bfcf3d2e880dea0.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170824/dcc166b0eba6a
37e05424cfc29023121.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170916/7f78145b1ca16
2eb814fbc03ad24fbc1.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170919/2f728d0f110a2
1fea95ce13e0b010d06.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170819/9c3dfeef7e08c
c0303ce233e4ddafa7f.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170917/715515e7fe1f1
cb9fd388bbbb00467c2.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170628/f3d06ef49965a
edbe18286a2f221fd9f.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170513/6121e3e90ff3b
a4c9398121bda1dd582.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170516/6e295fe48c332
45be858c40d37fb5ee6.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170707/f7ca636f73937
e33836e765b7261f036.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170528/b352258c83776
b9a2462277dec375d0c.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170527/4a7a7f1e6b69f
126292b981c90110d0a.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170715/61110ba027f00
4fb503ff09cdee44d0c.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170520/dd21a21751e24
a8f161792b66011688c.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170529/8140c4ad797ca
01f5e99d09c82dd8a42.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170603/e55f77fb3aa3c
7f118a46eeef5c0fbbf.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170529/e5902d4d3e408
29f9a0d30f7488eab84.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170604/ec3794d0d42b5
38bf4461a84dac32509.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170603/c34b29f68e8f9
6d44c63fe29bf4a66b8.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170701/fb18711a6af87
f30942d6a19f6da6b3e.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170619/e0456729d4dcb
ea569a1acbc6a47ab69.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.xiaohuar.com/d/file/20170626/0ab1d89f54c90
df477a90aa533ceea36.jpg> (referer: http://www.xiaohuar.com/list-1-0.html)
2017-10-22 22:30:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-22 22:30:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 8785,
 'downloader/request_count': 24,
 'downloader/request_method_count/GET': 24,
 'downloader/response_bytes': 2278896,
 'downloader/response_count': 24,
 'downloader/response_status_count/200': 24,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 10, 22, 14, 30, 15, 892287),
 'log_count/DEBUG': 25,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 24,
 'scheduler/dequeued': 23,
 'scheduler/dequeued/memory': 23,
 'scheduler/enqueued': 23,
 'scheduler/enqueued/memory': 23,
 'start_time': datetime.datetime(2017, 10, 22, 14, 30, 12, 698874)}
2017-10-22 22:30:15 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy crawl xiaohua

图片保存

在图片保存过程中"\"需要进行转义。

>>> r = requests.get("https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1508693697147&di=23eb655d8e450f84cf39453bc1029bc0&imgtype=0&src=http%3A%2F%2Fb.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2Fc9fcc3cec3fdfc038b027f7bde3f8794a5c226fe.jpg")
>>> open("E:\xiaohua_img\01.jpg",'wb').write(r.content)
 File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode by
>>> open("E:\\xiaohua_img\1.jpg",'wb').write(r.content)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
OSError: [Errno 22] Invalid argument: 'E:\\xiaohua_img\x01.jpg'
>>> open("E:\\xiaohua_img\\1.jpg",'wb').write(r.content)

以上这篇Scrapy爬虫实例讲解_校花网就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python字符串中的单双引
Feb 16 Python
Python搭建HTTP服务器和FTP服务器
Mar 09 Python
windows10下python3.5 pip3安装图文教程
Apr 02 Python
VScode编写第一个Python程序HelloWorld步骤
Apr 06 Python
Django model序列化为json的方法示例
Oct 16 Python
Python图像滤波处理操作示例【基于ImageFilter类】
Jan 03 Python
python matplotlib画图库学习绘制常用的图
Mar 19 Python
在Python中合并字典模块ChainMap的隐藏坑【推荐】
Jun 27 Python
pytorch进行上采样的种类实例
Feb 18 Python
使用Pytorch搭建模型的步骤
Nov 16 Python
Python爬虫进阶之爬取某视频并下载的实现
Dec 08 Python
python解包概念及实例
Feb 17 Python
Python学习笔记之if语句的使用示例
Oct 23 #Python
Django实现快速分页的方法实例
Oct 22 #Python
python使用SMTP发送qq或sina邮件
Oct 21 #Python
python爬虫headers设置后无效的解决方法
Oct 21 #Python
Python 结巴分词实现关键词抽取分析
Oct 21 #Python
恢复百度云盘本地误删的文件脚本(简单方法)
Oct 21 #Python
Python实现对百度云的文件上传(实例讲解)
Oct 21 #Python
You might like
解析posix与perl标准的正则表达式区别
2013/06/17 PHP
php中curl、fsocket、file_get_content三个函数的使用比较
2014/05/09 PHP
使用配置类定义Codeigniter全局变量
2014/06/12 PHP
php实现用于删除整个目录的递归函数
2015/03/16 PHP
详解PHP的Yii框架中扩展的安装与使用
2016/04/01 PHP
php如何获取Http请求
2020/04/30 PHP
javascript:json数据的页面绑定示例代码
2014/01/26 Javascript
Jquery 在页面加载后执行的几种方式
2014/03/14 Javascript
使用JQuery库提供的扩展功能实现自定义方法
2014/09/09 Javascript
基于javascript实现tab切换特效
2016/03/29 Javascript
探索Vue.js component内容实现
2016/11/03 Javascript
Bootstrap 手风琴菜单的实现代码
2017/01/20 Javascript
基于JS实现9种不同的面包屑和分布式多步骤导航效果
2017/02/21 Javascript
Koa 中的错误处理解析
2019/04/09 Javascript
微信小程序云开发实现云数据库读写权限
2019/05/17 Javascript
vue 中 命名视图的用法实例详解
2019/08/14 Javascript
微信小程序图片右边加两行文字的代码
2020/04/23 Javascript
[03:22]DOTA2超级联赛专访单车:找到属于自己的英雄
2013/06/08 DOTA
python thread 并发且顺序运行示例
2009/04/09 Python
解析Python中的异常处理
2015/04/28 Python
Python实现利用最大公约数求三个正整数的最小公倍数示例
2017/09/30 Python
python中for用来遍历range函数的方法
2018/06/08 Python
python递归实现快速排序
2018/08/18 Python
Python3字符串encode与decode的讲解
2019/04/02 Python
Python + Flask 实现简单的验证码系统
2019/10/01 Python
HTML5之SVG 2D入门4—笔画与填充
2013/01/30 HTML / CSS
Stuart Weitzman美国官网:美国奢华鞋履品牌
2016/08/18 全球购物
SmartBuyGlasses台湾:名牌眼镜,名牌太阳眼镜及隐形眼镜
2017/01/04 全球购物
一名老师的自我评价
2014/02/07 职场文书
酒店员工检讨书
2014/02/18 职场文书
十佳好少年事迹材料
2014/08/21 职场文书
小学生国庆演讲稿
2014/09/05 职场文书
党员学习中共十八大思想报告
2014/09/12 职场文书
学生检讨书
2015/01/27 职场文书
python神经网络 tf.name_scope 和 tf.variable_scope 的区别
2022/05/04 Python
Nginx使用ngx_http_upstream_module实现负载均衡功能示例
2022/08/05 Servers