使用scrapy ImagesPipeline爬取图片资源的示例代码


Posted in Python onSeptember 28, 2020

这是一个使用scrapy的ImagesPipeline爬取下载图片的示例,生成的图片保存在爬虫的full文件夹里。

scrapy startproject DoubanImgs

cd DoubanImgs

scrapy genspider download_douban  douban.com

vim spiders/download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from ..items import DoubanImgsItem


class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = []
    for i in xrange(23):
      if i == 0:
        page_url = 'http://www.douban.com/photos/album/' + url
      else:
        page_url = 'http://www.douban.com/photos/album/' + url + '/?start=' + str(i*18)
      self.start_urls.append(page_url)
    self.url = url
    # call the father base function

    # super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

vim settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DoubanImgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanImgs'

SPIDER_MODULES = ['DoubanImgs.spiders']
NEWSPIDER_MODULE = 'DoubanImgs.spiders'

ITEM_PIPELINES = {
  'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '.'
IMAGES_EXPIRES = 90

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DoubanImgs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'DoubanImgs.pipelines.DoubanimgsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

vim items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field


class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

vim pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log


class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item


class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

到此这篇关于使用scrapy ImagesPipeline爬取图片资源的示例代码的文章就介绍到这了,更多相关scrapy ImagesPipeline爬取图片内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
python optparse模块使用实例
Apr 09 Python
python实现自动登录后台管理系统
Oct 18 Python
Django ManyToManyField 跨越中间表查询的方法
Dec 18 Python
python获取服务器响应cookie的实例
Dec 28 Python
Python 通过requests实现腾讯新闻抓取爬虫的方法
Feb 22 Python
python实现网站用户名密码自动登录功能
Aug 09 Python
linux环境下安装python虚拟环境及注意事项
Jan 07 Python
django中嵌套的try-except实例
May 21 Python
Python中socket网络通信是干嘛的
May 27 Python
Python操作Word批量生成合同的实现示例
Aug 28 Python
Python 下载Bing壁纸的示例
Sep 29 Python
Python并发编程实例教程之线程的玩法
Jun 20 Python
详解scrapy内置中间件的顺序
Sep 28 #Python
Python爬虫代理池搭建的方法步骤
Sep 28 #Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
You might like
探讨PHP调用时间格式的参数详解
2013/06/06 PHP
YII动态模型(动态表名)支持分析
2016/03/29 PHP
php+ajax登录跳转登录实现思路
2016/07/31 PHP
浅谈PHP错误类型及屏蔽方法
2017/05/27 PHP
php加速缓存器opcache,apc,xcache,eAccelerator原理与配置方法实例分析
2020/03/02 PHP
制作特殊字的脚本
2006/06/26 Javascript
Javascript & DHTML 实例编程(教程)DOM基础和基本API
2007/06/02 Javascript
用js实现输入提示(自动完成)的实例代码
2013/06/14 Javascript
IE中JS跳转丢失referrer问题的2个解决方法
2014/07/18 Javascript
基于jQuery实现顶部导航栏功能
2016/12/27 Javascript
详解Layer弹出层样式
2017/08/21 Javascript
详解使用VUE搭建后台管理系统(vue-cli更新至3.0)
2018/08/22 Javascript
vue slot与传参实例代码讲解
2019/04/28 Javascript
vue实现页面内容禁止选中功能,仅输入框和文本域可选
2019/11/09 Javascript
使用vue实现通过变量动态拼接url
2020/07/22 Javascript
Element Notification通知的实现示例
2020/07/27 Javascript
[49:42]DOTA2上海特级锦标赛主赛事日 - 3 胜者组第二轮#2Secret VS EG第一局
2016/03/04 DOTA
[48:48]VGJ.T vs Liquid 2018国际邀请赛小组赛BO2 第二场 8.19
2018/08/21 DOTA
[02:31]2018年度DOTA2最具人气选手-完美盛典
2018/12/16 DOTA
python抓取网页中图片并保存到本地
2015/12/01 Python
python日期时间转为字符串或者格式化输出的实例
2018/05/29 Python
Python Pandas实现数据分组求平均值并填充nan的示例
2019/07/04 Python
MAC平台基于Python Appium环境搭建过程图解
2020/08/13 Python
Python自动化之UnitTest框架实战记录
2020/09/08 Python
如何在Anaconda中打开python自带idle
2020/09/21 Python
Html5定位终极解决方案
2020/02/05 HTML / CSS
opencv实现图像平移效果
2021/03/24 Python
工业自动化毕业生自荐信范文
2014/01/04 职场文书
致垒球运动员加油稿
2014/02/16 职场文书
师范生自荐信模板
2014/05/28 职场文书
分居协议书范本
2014/11/03 职场文书
个人简历自我评价怎么写
2015/03/10 职场文书
忠诚教育学习心得体会
2016/01/23 职场文书
详解MySQL InnoDB存储引擎的内存管理
2021/04/08 MySQL
Python入门之基础语法详解
2021/05/11 Python
分享提高 Python 代码的可读性的技巧
2022/03/03 Python