Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程


Posted in Python onNovember 11, 2021

前言

接着我的上一篇:Python 详解爬取并统计CSDN全站热榜标题关键词词频流程

我换成Scrapy架构也实现了一遍。获取页面源码底层原理是一样的,Scrapy架构更系统一些。下面我会把需要注意的问题,也说明一下。

提供一下GitHub仓库地址:github本项目地址

环境部署

scrapy安装

pip install scrapy -i https://pypi.douban.com/simple

selenium安装

pip install selenium -i https://pypi.douban.com/simple

jieba安装

pip install jieba -i https://pypi.douban.com/simple

IDE:PyCharm

google chrome driver下载对应版本:google chrome driver下载地址

检查浏览器版本,下载对应版本。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

实现过程

下面开始搞起。

创建项目

使用scrapy命令创建我们的项目。

scrapy startproject csdn_hot_words

项目结构,如同官方给出的结构。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

定义Item实体

按照之前的逻辑,主要属性为标题关键词对应出现次数的字典。代码如下:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
 
class CsdnHotWordsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    words = scrapy.Field()

关键词提取工具

使用jieba分词获取工具。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : analyse_sentence.py
 
import jieba.analyse
 
 
def get_key_word(sentence):
    result_dic = {}
    words_lis = jieba.analyse.extract_tags(
        sentence, topK=3, withWeight=True, allowPOS=())
    for word, flag in words_lis:
        if word in result_dic:
            result_dic[word] += 1
        else:
            result_dic[word] = 1
    return result_dic

爬虫构造

这里需要给爬虫初始化一个浏览器参数,用来实现页面的动态加载。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 23:47
# @Author  : 至尊宝
# @Site    : 
# @File    : csdn.py
 
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
from csdn_hot_words.items import CsdnHotWordsItem
from csdn_hot_words.tools.analyse_sentence import get_key_word
 
 
class CsdnSpider(scrapy.Spider):
    name = 'csdn'
    # allowed_domains = ['blog.csdn.net']
    start_urls = ['https://blog.csdn.net/rank/list']
 
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')  # 使用无头谷歌浏览器模式
        chrome_options.add_argument('--disable-gpu')
        chrome_options.add_argument('--no-sandbox')
        self.browser = webdriver.Chrome(chrome_options=chrome_options,
                                        executable_path="E:\\chromedriver_win32\\chromedriver.exe")
        self.browser.set_page_load_timeout(30)
 
    def parse(self, response, **kwargs):
        titles = response.xpath("//div[@class='hosetitem-title']/a/text()")
        for x in titles:
            item = CsdnHotWordsItem()
            item['words'] = get_key_word(x.get())
            yield item

代码说明

1、这里使用的是chrome的无头模式,就不需要有个浏览器打开再访问,都是后台执行的。

2、需要添加chromedriver的执行文件地址。

3、在parse的部分,可以参考之前我文章的xpath,获取到标题并且调用关键词提取,构造item对象。

中间件代码构造

添加js代码执行内容。中间件完整代码:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium.common.exceptions import TimeoutException
import time
 
from selenium.webdriver.chrome.options import Options
 
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
 
 
class CsdnHotWordsSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
 
        # Should return None or raise an exception.
        return None
 
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
 
        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i
 
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
 
        # Should return either None or an iterable of Request or item objects.
        pass
 
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn't have a response associated.
 
        # Must return only requests (not items).
        for r in start_requests:
            yield r
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
 
 
class CsdnHotWordsDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        js = '''
                        let height = 0
                let interval = setInterval(() => {
                    window.scrollTo({
                        top: height,
                        behavior: "smooth"
                    });
                    height += 500
                }, 500);
                setTimeout(() => {
                    clearInterval(interval)
                }, 20000);
            '''
        try:
            spider.browser.get(request.url)
            spider.browser.execute_script(js)
            time.sleep(20)
            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
                                encoding="utf-8", request=request)
        except TimeoutException as e:
            print('超时异常:{}'.format(e))
            spider.browser.execute_script('window.stop()')
        finally:
            spider.browser.close()
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

制作自定义pipeline

定义按照词频统计最终结果输出到文件。代码如下:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
 
 
class CsdnHotWordsPipeline:
 
    def __init__(self):
        self.file = open('result.txt', 'w', encoding='utf-8')
        self.all_words = []
 
    def process_item(self, item, spider):
        self.all_words.append(item)
        return item
 
    def close_spider(self, spider):
        key_word_dic = {}
        for y in self.all_words:
            print(y)
            for k, v in y['words'].items():
                if k.lower() in key_word_dic:
                    key_word_dic[k.lower()] += v
                else:
                    key_word_dic[k.lower()] = v
        word_count_sort = sorted(key_word_dic.items(),
                                 key=lambda x: x[1], reverse=True)
        for word in word_count_sort:
            self.file.write('{},{}\n'.format(word[0], word[1]))
        self.file.close()

settings配置

配置上要做一些调整。如下调整:

# Scrapy settings for csdn_hot_words project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'csdn_hot_words'
 
SPIDER_MODULES = ['csdn_hot_words.spiders']
NEWSPIDER_MODULE = 'csdn_hot_words.spiders'
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'csdn_hot_words (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
 
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 30
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
 
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
 
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
 
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'
}
 
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,
}
 
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543,
}
 
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }
 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300,
}
 
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
 
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

执行主程序

可以通过scrapy的命令执行,但是为了看日志方便,加了一个主程序代码。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2021/11/5 22:41
# @Author  : 至尊宝
# @Site    : 
# @File    : main.py
from scrapy import cmdline
 
cmdline.execute('scrapy crawl csdn'.split())

执行结果

执行部分日志

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

得到result.txt结果。

Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程

总结

看,java还是yyds。不知道为什么2021这个关键词也可以排名靠前。于是我觉着把我标题也加上2021。

GitHub项目地址在发一遍:github本项目地址

申明一下,本文案例仅研究探索使用,不是为了恶意攻击。

分享:

凡夫俗子不下苦功夫、死力气去努力做成一件事,根本就没资格去谈什么天赋不天赋。

——烽火戏诸侯《剑来》

如果本文对你有用的话,请不要吝啬你的赞,谢谢。

以上就是Python 详解通过Scrapy框架实现爬取CSDN全站热榜标题热词流程的详细内容,更多关于Python Scrapy框架的资料请关注三水点靠木其它相关文章!

Python 相关文章推荐
Python创建日历实例
Aug 21 Python
python实现图片变亮或者变暗的方法
Jun 01 Python
用python简单实现mysql数据同步到ElasticSearch的教程
May 30 Python
django从请求到响应的过程深入讲解
Aug 01 Python
Python 实现中值滤波、均值滤波的方法
Jan 09 Python
详解Python正则表达式re模块
Mar 19 Python
Python datetime和unix时间戳之间相互转换的讲解
Apr 01 Python
tensorflow 重置/清除计算图的实现
Jan 19 Python
django实现将后台model对象转换成json对象并传递给前端jquery
Mar 16 Python
Pandas缺失值2种处理方式代码实例
Jun 13 Python
Python批量修改xml的坐标值全部转为整数的实例代码
Nov 26 Python
Python爬虫中Selenium实现文件上传
Dec 04 Python
Python 多线程处理任务实例
Nov 07 #Python
python利用while求100内的整数和方式
Nov 07 #Python
python中if和elif的区别介绍
Nov 07 #Python
python中取整数的几种方法
Python 中的 copy()和deepcopy()
Nov 07 #Python
Python MNIST手写体识别详解与试练
Python基础 括号()[]{}的详解
Nov 07 #Python
You might like
DC动漫人物排行
2020/03/03 欧美动漫
PHP配置把错误日志以邮件方式发送方法(Windows系统)
2015/06/23 PHP
php通过淘宝API查询IP地址归属等信息
2015/12/25 PHP
Zend Framework基本页面布局分析
2016/03/19 PHP
javascript 操作Word和Excel的实现代码
2009/10/26 Javascript
ASP.NET jQuery 实例18 通过使用jQuery validation插件校验DropDownList
2012/02/03 Javascript
Node.js(安装,启动,测试)
2014/06/09 Javascript
一个不错的字符串转码解码函数(自写)
2014/07/31 Javascript
jQuery随机密码生成的方法
2015/03/09 Javascript
Angularjs 设置全局变量的方法总结
2016/10/20 Javascript
详解angular中如何监控dom渲染完毕
2017/01/03 Javascript
微信小程序 action-sheet 反馈上拉菜单简单实例
2017/05/11 Javascript
vue单页应用加百度统计代码(亲测有效)
2018/01/31 Javascript
基于vue 动态加载图片src的解决方法
2018/02/05 Javascript
react中使用swiper的具体方法
2018/05/15 Javascript
Vue-Quill-Editor富文本编辑器的使用教程
2018/09/21 Javascript
vue2.0移动端滑动事件vue-touch的实例代码
2018/11/27 Javascript
解决vue-cli@3.xx安装不成功的问题及搭建ts-vue项目
2020/02/09 Javascript
微信小程序中使用 async/await的方法实例分析
2020/05/06 Javascript
Python 文件重命名工具代码
2009/07/26 Python
使用go和python递归删除.ds store文件的方法
2014/01/22 Python
python中的五种异常处理机制介绍
2014/09/02 Python
浅析Python 中整型对象存储的位置
2016/05/16 Python
Python实现多进程共享数据的方法分析
2017/12/04 Python
浅谈python数据类型及类型转换
2017/12/18 Python
Python+selenium 获取浏览器窗口坐标、句柄的方法
2018/10/14 Python
Python面向对象之类和对象属性的增删改查操作示例
2018/12/14 Python
python使用suds调用webservice接口的方法
2019/01/03 Python
python使用numpy实现直方图反向投影示例
2020/01/17 Python
详解python常用命令行选项与环境变量
2020/02/20 Python
AmazeUI 图标的示例代码
2020/08/13 HTML / CSS
Expedia印度尼西亚站:预订酒店、廉价航班和度假套餐
2018/01/31 全球购物
研究生导师推荐信
2014/09/06 职场文书
2015学校年度工作总结
2015/05/11 职场文书
公司与个人合作协议书
2016/03/19 职场文书
MySQL如何快速创建800w条测试数据表
2022/03/17 MySQL