使用scrapy ImagesPipeline爬取图片资源的示例代码


Posted in Python onSeptember 28, 2020

这是一个使用scrapy的ImagesPipeline爬取下载图片的示例,生成的图片保存在爬虫的full文件夹里。

scrapy startproject DoubanImgs

cd DoubanImgs

scrapy genspider download_douban  douban.com

vim spiders/download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from ..items import DoubanImgsItem


class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = []
    for i in xrange(23):
      if i == 0:
        page_url = 'http://www.douban.com/photos/album/' + url
      else:
        page_url = 'http://www.douban.com/photos/album/' + url + '/?start=' + str(i*18)
      self.start_urls.append(page_url)
    self.url = url
    # call the father base function

    # super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

vim settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DoubanImgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanImgs'

SPIDER_MODULES = ['DoubanImgs.spiders']
NEWSPIDER_MODULE = 'DoubanImgs.spiders'

ITEM_PIPELINES = {
  'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '.'
IMAGES_EXPIRES = 90

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DoubanImgs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'DoubanImgs.pipelines.DoubanimgsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

vim items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field


class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

vim pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log


class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item


class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

到此这篇关于使用scrapy ImagesPipeline爬取图片资源的示例代码的文章就介绍到这了,更多相关scrapy ImagesPipeline爬取图片内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
python控制台显示时钟的示例
Feb 24 Python
在Python的Flask框架中使用日期和时间的教程
Apr 21 Python
Python中使用platform模块获取系统信息的用法教程
Jul 08 Python
TensorFlow变量管理详解
Mar 10 Python
python和shell监控linux服务器的详细代码
Jun 22 Python
python实现三维拟合的方法
Dec 29 Python
在python里协程使用同步锁Lock的实例
Feb 19 Python
Django单元测试中Fixtures用法详解
Feb 25 Python
使用jupyter notebook将文件保存为Markdown,HTML等文件格式
Apr 14 Python
python读取hdfs并返回dataframe教程
Jun 05 Python
对Keras中predict()方法和predict_classes()方法的区别说明
Jun 09 Python
Python预测2020高考分数和录取情况
Jul 08 Python
详解scrapy内置中间件的顺序
Sep 28 #Python
Python爬虫代理池搭建的方法步骤
Sep 28 #Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
You might like
在PHP中养成7个面向对象的好习惯
2010/07/17 PHP
php读取3389的脚本
2014/05/06 PHP
PHP微信开发之文本自动回复
2016/06/23 PHP
thinkphp跨库操作的简单代码实例
2016/09/22 PHP
jQuery 树形结构的选择器
2010/02/15 Javascript
jQuery 在光标定位的地方插入文字的插件
2012/05/10 Javascript
javascript匿名函数应用示例介绍
2014/03/07 Javascript
基于jQuery实现的图片切换焦点图整理
2014/12/07 Javascript
jQuery和AngularJS的区别浅析
2015/01/29 Javascript
javascript基于DOM实现省市级联下拉框的方法
2015/05/14 Javascript
基于socket.io+express实现多房间聊天
2016/03/17 Javascript
星期几的不同脚本写法(推荐)
2016/06/01 Javascript
JS+canvas绘制的动态机械表动画效果
2017/09/12 Javascript
Express本地测试HTTPS的示例代码
2018/06/06 Javascript
js实现点赞按钮功能的实例代码
2020/03/06 Javascript
jQuery加PHP实现图片上传并提交的示例代码
2020/07/16 jQuery
JavaScript构造函数原理及实现流程解析
2020/11/19 Javascript
Python标准库之sqlite3使用实例
2014/11/25 Python
使用Python脚本将绝对url替换为相对url的教程
2015/04/24 Python
python traceback捕获并打印异常的方法
2018/08/31 Python
python os模块在系统管理中的应用
2020/06/22 Python
英国领先的珍珠首饰品牌:Orchira
2016/09/11 全球购物
欧姆龙医疗保健与医疗产品:Omron Healthcare
2020/02/10 全球购物
情人节活动策划方案
2014/02/27 职场文书
投标人廉洁自律承诺书
2014/05/26 职场文书
护林防火标语
2014/06/27 职场文书
11.9消防日宣传标语
2014/10/08 职场文书
学籍证明模板
2014/11/21 职场文书
会议接待欢迎词范文
2015/01/26 职场文书
2016年党员公开承诺书格式范文
2016/03/24 职场文书
什么是检讨书?检讨书的格式及范文
2019/11/05 职场文书
Python语言规范之Pylint的详细用法
2021/06/24 Python
python游戏开发之pygame实现接球小游戏
2022/04/22 Python
python中Pyqt5使用Qlabel标签播放视频
2022/04/22 Python
解决 redis 无法远程连接
2022/05/15 Redis
Spring Cloud OAuth2实现自定义token返回格式
2022/06/25 Java/Android