使用scrapy ImagesPipeline爬取图片资源的示例代码


Posted in Python onSeptember 28, 2020

这是一个使用scrapy的ImagesPipeline爬取下载图片的示例,生成的图片保存在爬虫的full文件夹里。

scrapy startproject DoubanImgs

cd DoubanImgs

scrapy genspider download_douban  douban.com

vim spiders/download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from ..items import DoubanImgsItem


class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = []
    for i in xrange(23):
      if i == 0:
        page_url = 'http://www.douban.com/photos/album/' + url
      else:
        page_url = 'http://www.douban.com/photos/album/' + url + '/?start=' + str(i*18)
      self.start_urls.append(page_url)
    self.url = url
    # call the father base function

    # super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

vim settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DoubanImgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanImgs'

SPIDER_MODULES = ['DoubanImgs.spiders']
NEWSPIDER_MODULE = 'DoubanImgs.spiders'

ITEM_PIPELINES = {
  'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '.'
IMAGES_EXPIRES = 90

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DoubanImgs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'DoubanImgs.pipelines.DoubanimgsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

vim items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field


class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

vim pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log


class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item


class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

到此这篇关于使用scrapy ImagesPipeline爬取图片资源的示例代码的文章就介绍到这了,更多相关scrapy ImagesPipeline爬取图片内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
Python中Django框架下的staticfiles使用简介
May 30 Python
在Python的Django框架中创建语言文件
Jul 27 Python
pandas创建新Dataframe并添加多行的实例
Apr 08 Python
对python当中不在本路径的py文件的引用详解
Dec 15 Python
对python xlrd读取datetime类型数据的方法详解
Dec 26 Python
python保存字典和读取字典的实例代码
Jul 07 Python
python多任务之协程的使用详解
Aug 26 Python
Windows10下Tensorflow2.0 安装及环境配置教程(图文)
Nov 21 Python
python使用itchat模块给心爱的人每天发天气预报
Nov 25 Python
Python 2种方法求某个范围内的所有素数(质数)
Jan 31 Python
解决pytorch多GPU训练保存的模型,在单GPU环境下加载出错问题
Jun 23 Python
python opencv pytesseract 验证码识别的实现
Aug 28 Python
详解scrapy内置中间件的顺序
Sep 28 #Python
Python爬虫代理池搭建的方法步骤
Sep 28 #Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
You might like
小偷PHP+Html+缓存
2006/11/25 PHP
php5中类的学习
2008/03/28 PHP
php 动态多文件上传
2009/01/18 PHP
FirePHP 推荐一款PHP调试工具
2011/04/23 PHP
thinkphp3.2.3 分页代码分享
2016/07/28 PHP
PHP单例模式详解及实例代码
2016/12/21 PHP
PHP+MySQL实现模糊查询员工信息功能示例
2018/06/01 PHP
thinkPHP5框架中widget的功能与用法详解
2018/06/11 PHP
PHP聊天室简单实现方法详解
2018/12/08 PHP
如果文字过长,则将过长的部分变成省略号显示
2006/06/26 Javascript
JavaScript 高效运行代码分析
2010/03/18 Javascript
JS HTML5实现拖拽移动列表效果
2020/08/27 Javascript
TypeScript入门-基本数据类型
2017/03/28 Javascript
jQuery 循环遍历改变a标签的href(实例讲解)
2017/07/12 jQuery
JS实现图片放大镜插件详解
2017/11/06 Javascript
jQuery+Cookie实现切换皮肤功能【附源码下载】
2018/03/25 jQuery
React SSR样式及SEO的实践
2018/10/22 Javascript
基于Fixed定位的框选功能的实现代码
2019/05/13 Javascript
基于JS实现父组件的请求服务过程解析
2019/10/14 Javascript
python实现保存网页到本地示例
2014/03/16 Python
python实现的重启关机程序实例
2014/08/21 Python
Python Excel vlookup函数实现过程解析
2020/06/22 Python
PyCharm2020最新激活码+激活码补丁(亲测最新版PyCharm2020.2激活成功)
2020/11/25 Python
澳大利亚最大的女装零售商:Millers
2017/09/10 全球购物
丽笙酒店官方网站:Radisson Hotels
2019/05/07 全球购物
水务局局长岗位职责
2013/11/28 职场文书
加拿大留学自荐信
2014/01/28 职场文书
食品安全检查制度
2014/02/03 职场文书
葛优非诚勿扰搞笑征婚台词
2014/03/17 职场文书
《彩色世界》教学反思
2014/04/12 职场文书
什么是就业协议书
2014/04/17 职场文书
电台编导求职信
2014/05/06 职场文书
学校国庆节活动总结
2015/03/23 职场文书
专职安全员岗位职责
2015/04/11 职场文书
mysql部分操作
2021/04/05 MySQL
win10如何开启ahci模式?win10开启ahci模式详细操作教程
2022/07/23 数码科技