使用scrapy ImagesPipeline爬取图片资源的示例代码


Posted in Python onSeptember 28, 2020

这是一个使用scrapy的ImagesPipeline爬取下载图片的示例,生成的图片保存在爬虫的full文件夹里。

scrapy startproject DoubanImgs

cd DoubanImgs

scrapy genspider download_douban  douban.com

vim spiders/download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from ..items import DoubanImgsItem


class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = []
    for i in xrange(23):
      if i == 0:
        page_url = 'http://www.douban.com/photos/album/' + url
      else:
        page_url = 'http://www.douban.com/photos/album/' + url + '/?start=' + str(i*18)
      self.start_urls.append(page_url)
    self.url = url
    # call the father base function

    # super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

vim settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DoubanImgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanImgs'

SPIDER_MODULES = ['DoubanImgs.spiders']
NEWSPIDER_MODULE = 'DoubanImgs.spiders'

ITEM_PIPELINES = {
  'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '.'
IMAGES_EXPIRES = 90

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DoubanImgs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'DoubanImgs.pipelines.DoubanimgsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

vim items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field


class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

vim pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log


class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item


class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

到此这篇关于使用scrapy ImagesPipeline爬取图片资源的示例代码的文章就介绍到这了,更多相关scrapy ImagesPipeline爬取图片内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
c++生成dll使用python调用dll的方法
Jan 20 Python
Python中的面向对象编程详解(下)
Apr 13 Python
使用Python的urllib2模块处理url和图片的技巧两则
Feb 18 Python
Python中的字符串操作和编码Unicode详解
Jan 18 Python
Python3随机漫步生成数据并绘制
Aug 27 Python
在cmder下安装ipython以及环境的搭建
Oct 19 Python
Python3自动签到 定时任务 判断节假日的实例
Nov 13 Python
windows下python虚拟环境virtualenv安装和使用详解
Jul 16 Python
Python实现使用dir获取类的方法列表
Dec 24 Python
Python版中国省市经纬度
Feb 11 Python
python实现扫雷小游戏
Apr 24 Python
Python基础之函数嵌套知识总结
May 23 Python
详解scrapy内置中间件的顺序
Sep 28 #Python
Python爬虫代理池搭建的方法步骤
Sep 28 #Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
You might like
咖啡知识 咖啡养豆要养多久 排气又是什么
2021/03/06 新手入门
用PHP编程语言开发动态WAP页面
2006/10/09 PHP
php 判断服务器操作系统的类型
2014/02/17 PHP
Zend Framework页面缓存实例
2014/06/25 PHP
学习php开源项目的源码指南
2014/12/21 PHP
php中动态调用函数的方法
2015/03/16 PHP
PHP浮点数精度问题汇总
2015/05/13 PHP
学习php设计模式 php实现合成模式(composite)
2015/12/08 PHP
PHP重置数组为连续数字索引的几种方式总结
2018/03/12 PHP
用js脚本控制asp.net下treeview的NodeCheck的实现代码
2010/03/02 Javascript
JavaScript实现向setTimeout执行代码传递参数的方法
2015/04/16 Javascript
JavaScript知识点总结之如何提高性能
2016/01/15 Javascript
js省市县三级联动效果实例
2020/04/15 Javascript
jQuery.Form上传文件操作
2017/02/05 Javascript
vue实现todolist功能、todolist组件拆分及todolist的删除功能
2019/04/11 Javascript
原生js+canvas实现下雪效果
2020/08/02 Javascript
Python中生成器和yield语句的用法详解
2015/04/17 Python
python如何为被装饰的函数保留元数据
2018/03/21 Python
python安装requests库的实例代码
2019/06/25 Python
详解python中__name__的意义以及作用
2019/08/07 Python
python标准库OS模块函数列表与实例全解
2020/03/10 Python
Keras - GPU ID 和显存占用设定步骤
2020/06/22 Python
Pycharm学生免费专业版安装教程的方法步骤
2020/09/24 Python
Python中的流程控制详解
2021/02/18 Python
css3过渡_动力节点Java学院整理
2017/07/11 HTML / CSS
HTML5单页面手势滑屏切换原理分析
2017/07/10 HTML / CSS
Sperry官网:帆船鞋创始品牌
2016/09/07 全球购物
上班玩手机检讨书
2014/02/17 职场文书
情人节寄语大全
2014/04/11 职场文书
教师工作失职检讨书
2014/09/18 职场文书
局领导领导班子四风对照检查材料
2014/09/27 职场文书
质检员工作总结2015
2015/04/25 职场文书
Java各种比较对象的方式的对比总结
2021/06/20 Java/Android
Django实现drf搜索过滤和排序过滤
2021/06/21 Python
python flappy bird小游戏分步实现流程
2022/02/15 Python
浅谈Redis缓冲区机制
2022/06/05 Redis