scrapy利用selenium爬取豆瓣阅读的全步骤


Posted in Python onSeptember 20, 2020

首先创建scrapy项目

命令:scrapy startproject douban_read

创建spider

命令:scrapy genspider douban_spider url

网址:https://read.douban.com/charts

关键注释代码中有,若有不足,请多指教

scrapy项目目录结构如下

scrapy利用selenium爬取豆瓣阅读的全步骤

douban_spider.py文件代码

爬虫文件

import scrapy
import re, json

from ..items import DoubanReadItem


class DoubanSpiderSpider(scrapy.Spider):
 name = 'douban_spider'
 # allowed_domains = ['www']
 start_urls = ['https://read.douban.com/charts']

 def parse(self, response):
 # print(response.text)
 # 获取图书分类的url
 type_urls = response.xpath('//div[@class="rankings-nav"]/a[position()>1]/@href').extract()
 # print(type_urls)
 for type_url in type_urls:
  # /charts?type=unfinished_column&index=featured&dcs=charts&dcm=charts-nav
  part_param = re.search(r'charts\?(.*?)&dcs', type_url).group(1)
  # https://read.douban.com/j/index//charts?type=intermediate_finalized&index=science_fiction&verbose=1
  ajax_url = 'https://read.douban.com/j/index//charts?{}&verbose=1'.format(part_param)
  yield scrapy.Request(ajax_url, callback=self.parse_ajax, encoding='utf-8', meta={'request_type': 'ajax'})

 def parse_ajax(self, response):

 # print(response.text)
 # 获取分类中图书的json数据
 json_data = json.loads(response.text)
 for data in json_data['list']:
  item = DoubanReadItem()
  item['book_id'] = data['works']['id']
  item['book_url'] = data['works']['url']
  item['book_title'] = data['works']['title']
  item['book_author'] = data['works']['author']
  item['book_cover_image'] = data['works']['cover']
  item['book_abstract'] = data['works']['abstract']
  item['book_wordCount'] = data['works']['wordCount']
  item['book_kinds'] = data['works']['kinds']
  # 把item yield给Itempipeline
  yield item

item.py文件代码

项目的目标文件

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DoubanReadItem(scrapy.Item):
 # define the fields for your item here like:
 book_id = scrapy.Field()
 book_url = scrapy.Field()
 book_title = scrapy.Field()
 book_author = scrapy.Field()
 book_cover_image = scrapy.Field()
 book_abstract = scrapy.Field()
 book_wordCount = scrapy.Field()
 book_kinds = scrapy.Field()

my_download_middle.py文件代码

所有request都会经过下载中间件,可以通过定制中间件,来完成设置代理,动态设置请求头,自定义下载等操作

import random
import time
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from scrapy.http.response.html import HtmlResponse


class MymiddleWares(object):
 def __init__(self):
 # 请求头列表
 self.USER_AGENT_LIST = [
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
 ]

 def process_request(self, request, spider):
 '''
 下载中间件处理requests的方法
 :param request:马上要被下载器下载request
 :param spider:
 :return:
 '''
 # 在spider中设置了meta的request_type的值为ajax meta参数会贯穿整个scrapy
 request_type = request.meta.get('request_type')
 # 如果不是ajax请求就需要通过selenium来自定义下载request
 if not request_type:
  print('in middler')
  # 1、创建driver
  driver = webdriver.Chrome()
  # 2、请求url
  driver.get(request.url)
  # 3、等待
  # driver.implicitly_wait(20)
  time.sleep(3)
  # 4、获取页面内容
  html_str = driver.page_source
  # 直接返回HtmlResponse给spider解析 下载器就不会下载这个request 达到自定义下载的目的
  return HtmlResponse(url=request.url, body=html_str, request=request, encoding='utf-8')

 else:
  # 如果是ajax请求就需要通过scrapy下载器来下载request
  # ajax请求直接返回json数据不适合上面的selenium下载
  ua = random.choice(self.USER_AGENT_LIST)
  # 设置请求头
  if ua:
  request.headers.setdefault('User-Agent', ua)
  request.headers.setdefault('X-Requested-With', 'XMLHttpRequest')

pipeline.py文件代码

项目的管道文件

import pymongo
from itemadapter import ItemAdapter


class MongoPipeline:
 # 存储集合名字
 collection_name = 'book'

 def __init__(self, mongo_uri, mongo_db):
 self.mongo_uri = mongo_uri
 self.mongo_db = mongo_db

 @classmethod
 def from_crawler(cls, crawler):
 return cls(
  mongo_uri=crawler.settings.get('MONGO_URI'),
  mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
 )

 def open_spider(self, spider):
 '''
 当spider启动的时候调用
 :param spider:
 :return:
 '''
 self.client = pymongo.MongoClient(self.mongo_uri)
 self.db = self.client[self.mongo_db]

 def close_spider(self, spider):
 self.client.close()

 # 保存到mongo的douban_read数据库下的book集合中
 def process_item(self, item, spider):
 self.db[self.collection_name].update({'book_id': item['book_id']}, {'$set': dict(item)}, True)
 # True:有则修改 无则新增
 print(item)
 return item

settings.py文件代码

配置信息

# Scrapy settings for douban_read project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douban_read'

SPIDER_MODULES = ['douban_read.spiders']
NEWSPIDER_MODULE = 'douban_read.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'douban_read (+http://www.yourdomain.com)'

# Obey robots.txt rules
# robot协议
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# 默认请求头
DEFAULT_REQUEST_HEADERS = {
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Language': 'en',
 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',

}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'douban_read.middlewares.DoubanReadSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# 配置下载器中间件
DOWNLOADER_MIDDLEWARES = {
 'douban_read.my_download_middle.MymiddleWares': 543,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 配置ITEM_PIPELINES
ITEM_PIPELINES = {
 'douban_read.pipelines.MongoPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 配置mongo
MONGO_URI = 'localhost'
# 创建数据库:douban_read
MONGO_DATABASE = 'douban_read'

最后启动该项目即可

scrapy crawl douban_spider

数据就保存到mongo数据库了

scrapy利用selenium爬取豆瓣阅读的全步骤

总结

到此这篇关于scrapy利用selenium爬取豆瓣阅读的文章就介绍到这了,更多相关scrapy用selenium爬取豆瓣阅读内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
Python ORM框架SQLAlchemy学习笔记之安装和简单查询实例
Jun 10 Python
Python获取任意xml节点值的方法
May 05 Python
Python中列表元素转为数字的方法分析
Jun 14 Python
pip matplotlib报错equired packages can not be built解决
Jan 06 Python
python smtplib模块实现发送邮件带附件sendmail
May 22 Python
朴素贝叶斯Python实例及解析
Nov 19 Python
基于Python实现迪杰斯特拉和弗洛伊德算法
May 27 Python
详解Django 时间与时区设置问题
Jul 23 Python
python线程join方法原理解析
Feb 11 Python
浅析python标准库中的glob
Mar 13 Python
基于Python的接口自动化读写excel文件的方法
Jan 15 Python
Python数据可视化之绘制柱状图和条形图
May 25 Python
Python操作dict时避免出现KeyError的几种解决方法
Sep 20 #Python
python中random.randint和random.randrange的区别详解
Sep 20 #Python
详解如何在pyqt中通过OpenCV实现对窗口的透视变换
Sep 20 #Python
Python Pillow(PIL)库的用法详解
Sep 19 #Python
Python自动化xpath实现自动抢票抢货
Sep 19 #Python
python 贪心算法的实现
Sep 18 #Python
详解KMP算法以及python如何实现
Sep 18 #Python
You might like
php 操作数组(合并,拆分,追加,查找,删除等)
2012/07/20 PHP
单点登录 Ucenter示例分析
2013/10/29 PHP
php中的常用魔术方法汇总
2016/02/14 PHP
YII框架行为behaviors用法示例
2019/04/26 PHP
修改Laravel自带的认证系统的User类的命名空间的步骤
2019/10/15 PHP
jQuery 使用手册(五)
2009/09/23 Javascript
一个CSS+jQuery实现的放大缩小动画效果
2014/02/19 Javascript
解决jquery版本冲突的有效方法
2014/09/02 Javascript
使用jquery清空、复位整个输入域
2015/04/02 Javascript
Angular.js跨controller实现参数传递的两种方法
2017/02/20 Javascript
js实现移动端编辑添加地址【模仿京东】
2017/04/28 Javascript
详解vue嵌套路由-params传递参数
2017/05/23 Javascript
解决微信二次分享不显示摘要和图片的问题
2017/08/18 Javascript
Vue 实例事件简单示例
2019/09/19 Javascript
layui上传图片到服务器的非项目目录下的方法
2019/09/26 Javascript
在Koa.js中实现文件上传的接口功能
2019/10/08 Javascript
详解Howler.js Web音频播放终极解决方案
2020/08/23 Javascript
Express 配置HTML页面访问的实现
2020/11/01 Javascript
react+antd 递归实现树状目录操作
2020/11/02 Javascript
Python不规范的日期字符串处理类
2014/06/10 Python
Python计时相关操作详解【time,datetime】
2017/05/26 Python
python实现音乐下载器
2018/04/15 Python
python paramiko利用sftp上传目录到远程的实例
2019/01/03 Python
Python实现监控Nginx配置文件的不同并发送邮件报警功能示例
2019/02/26 Python
Django在admin后台集成TinyMCE富文本编辑器的例子
2019/08/09 Python
Python 序列化和反序列化库 MarshMallow 的用法实例代码
2020/02/25 Python
Lookfantastic法国官网:英国知名美妆购物网站
2017/10/28 全球购物
巴西化妆品商店:Lojas Rede
2019/07/26 全球购物
医院护理人员的自我评价分享
2013/10/04 职场文书
巧克力蛋糕店创业计划书
2014/01/14 职场文书
留学生求职信
2014/06/03 职场文书
公安局班子个人对照检查材料思想汇报
2014/10/09 职场文书
挂职个人工作总结
2015/03/05 职场文书
涨价通知
2015/04/23 职场文书
大学生敬老院活动总结
2015/05/07 职场文书
德能勤绩工作总结
2015/08/11 职场文书