使用scrapy ImagesPipeline爬取图片资源的示例代码


Posted in Python onSeptember 28, 2020

这是一个使用scrapy的ImagesPipeline爬取下载图片的示例,生成的图片保存在爬虫的full文件夹里。

scrapy startproject DoubanImgs

cd DoubanImgs

scrapy genspider download_douban  douban.com

vim spiders/download_douban.py

# coding=utf-8
from scrapy.spiders import Spider
import re
from scrapy import Request
from ..items import DoubanImgsItem


class download_douban(Spider):
  name = 'download_douban'

  default_headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch, br',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'Cache-Control': 'max-age=0',
    'Connection': 'keep-alive',
    'Host': 'www.douban.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def __init__(self, url='1638835355', *args, **kwargs):
    self.allowed_domains = ['douban.com']
    self.start_urls = []
    for i in xrange(23):
      if i == 0:
        page_url = 'http://www.douban.com/photos/album/' + url
      else:
        page_url = 'http://www.douban.com/photos/album/' + url + '/?start=' + str(i*18)
      self.start_urls.append(page_url)
    self.url = url
    # call the father base function

    # super(download_douban, self).__init__(*args, **kwargs)

  def start_requests(self):

    for url in self.start_urls:
      yield Request(url=url, headers=self.default_headers, callback=self.parse)

  def parse(self, response):
    list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract()
    if list_imgs:
      item = DoubanImgsItem()
      item['image_urls'] = list_imgs
      yield item

vim settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DoubanImgs project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   https://doc.scrapy.org/en/latest/topics/settings.html
#   https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#   https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DoubanImgs'

SPIDER_MODULES = ['DoubanImgs.spiders']
NEWSPIDER_MODULE = 'DoubanImgs.spiders'

ITEM_PIPELINES = {
  'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,
}
IMAGES_STORE = '.'
IMAGES_EXPIRES = 90

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DoubanImgs (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0.5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'DoubanImgs.middlewares.DoubanimgsDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#  'DoubanImgs.pipelines.DoubanimgsPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

vim items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from scrapy import Field


class DoubanImgsItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  image_urls = Field()
  images = Field()
  image_paths = Field()

vim pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy import Request
from scrapy import log


class DoubanImgsPipeline(object):
  def process_item(self, item, spider):
    return item


class DoubanImgDownloadPipeline(ImagesPipeline):
  default_headers = {
    'accept': 'image/webp,image/*,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, sdch, br',
    'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6',
    'cookie': 'bid=yQdC/AzTaCw',
    'referer': 'https://www.douban.com/photos/photo/2370443040/',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
  }

  def get_media_requests(self, item, info):
    for image_url in item['image_urls']:
      self.default_headers['referer'] = image_url
      yield Request(image_url, headers=self.default_headers)

  def item_completed(self, results, item, info):
    image_paths = [x['path'] for ok, x in results if ok]
    if not image_paths:
      raise DropItem("Item contains no images")
    item['image_paths'] = image_paths
    return item

到此这篇关于使用scrapy ImagesPipeline爬取图片资源的示例代码的文章就介绍到这了,更多相关scrapy ImagesPipeline爬取图片内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
Python Web框架Pylons中使用MongoDB的例子
Dec 03 Python
利用python实现数据分析
Jan 11 Python
Python 使用类写装饰器的小技巧
Sep 30 Python
在Django中URL正则表达式匹配的方法
Dec 20 Python
Python批量查询关键词微信指数实例方法
Jun 27 Python
Django分页功能的实现代码详解
Jul 29 Python
python实现二分类的卡方分箱示例
Nov 22 Python
Python中six模块基础用法
Dec 08 Python
Python 程序报错崩溃后如何倒回到崩溃的位置(推荐)
Jun 23 Python
使用sublime text3搭建Python编辑环境的实现
Jan 12 Python
Pyecharts 中Geo函数常用参数的用法说明
Feb 01 Python
用Python实现Newton插值法
Apr 17 Python
详解scrapy内置中间件的顺序
Sep 28 #Python
Python爬虫代理池搭建的方法步骤
Sep 28 #Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
You might like
PHP开发过程中常用函数收藏
2009/12/14 PHP
phpExcel中文帮助手册之常用功能指南
2014/08/18 PHP
laravel安装和配置教程
2014/10/29 PHP
详解Laravel视图间共享数据与视图Composer
2016/08/04 PHP
php smtp实现发送邮件功能
2017/06/22 PHP
PHP中命名空间的使用例子
2019/03/22 PHP
Javascript解决常见浏览器兼容问题的12种方法
2010/01/04 Javascript
js getBoundingClientRect() 来获取页面元素的位置
2010/11/25 Javascript
jQuery学习笔记之jQuery的DOM操作
2010/12/22 Javascript
javascript算法题 求任意一个1-9位不重复的N位数在该组合中的大小排列序号
2012/07/21 Javascript
JS隐藏参数post传值实例
2013/04/18 Javascript
jquery ajax同步异步的执行最终解决方案
2013/04/26 Javascript
jQuery中:has选择器用法实例
2014/12/30 Javascript
Bootstrap面板使用方法
2017/01/16 Javascript
vue+webpack实现异步加载三种用法示例详解
2018/04/24 Javascript
Vue.js中该如何自己维护路由跳转记录
2019/05/19 Javascript
Vue项目结合Vue-layer实现弹框式编辑功能(实例代码)
2020/03/11 Javascript
JS时间戳与日期格式互相转换的简单方法示例
2021/01/30 Javascript
Windows系统下多版本pip的共存问题详解
2017/10/10 Python
python探索之BaseHTTPServer-实现Web服务器介绍
2017/10/28 Python
详谈Python中列表list,元祖tuple和numpy中的array区别
2018/04/18 Python
python3 json数据格式的转换(dumps/loads的使用、dict to str/str to dict、json字符串/字典的相互转换)
2019/04/01 Python
Python一键查找iOS项目中未使用的图片、音频、视频资源
2019/08/12 Python
python多进程(加入进程池)操作常见案例
2019/10/21 Python
Python Web静态服务器非堵塞模式实现方法示例
2019/11/21 Python
Python3如何使用range函数替代xrange函数
2020/10/05 Python
Python爬虫自动化获取华图和粉笔网站的错题(推荐)
2021/01/08 Python
CSS3 linear-gradient线性渐变生成加号和减号的方法
2017/11/21 HTML / CSS
城市轨道专业个人求职信范文
2013/09/23 职场文书
初中三年学生的学习自我评价
2013/11/13 职场文书
银行简历自我评价
2014/02/11 职场文书
名企HR怎样看待求职信
2014/02/23 职场文书
人力管理专业毕业生求职信
2014/02/27 职场文书
节约用水的口号
2014/06/20 职场文书
人民调解协议书范本
2014/10/11 职场文书
win11无线投屏在哪设置? win11无线投屏功能的使用方法
2022/04/08 数码科技