编程 Python

Python如何使用队列方式实现多线程爬虫

Posted in Python onMay 12, 2020

说明：糗事百科段子的爬取，采用了队列和多线程的方式，其中关键点是Queue.task_done()、Queue.join()，保证了线程的有序进行。

代码如下

import requests
from lxml import etree
import json
from queue import Queue
import threading

class Qsbk(object):
  def __init__(self):
    self.headers = {
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
      "Referer": "https://www.qiushibaike.com/"
    }
    # 实例化三个队列，用来存放内容
    self.url_queue = Queue()
    self.html_queue = Queue()
    self.content_queue = Queue()

  def get_total_url(self):
    """
    获取了所有的页面url，并且返回url_list
    return:url_list
    现在放入url_queue队列中保存
    """
    url_temp = "https://www.qiushibaike.com/text/page/{}/"
    url_list = list()
    for i in range(1,13):
      # url_list.append(url_temp.format(i))
      # 将生成的url放入url_queue队列
      self.url_queue.put(url_temp.format(i))

  def parse_url(self):
    """
    发送请求，获取响应，同时etree处理html
    """
    while self.url_queue.not_empty:
      # 判断非空，为空时结束循环

      # 从队列中取出一个url
      url = self.url_queue.get()
      print("parsing url:",url)
      # 发送请求
      response = requests.get(url,headers=self.headers,timeout=10)
      # 获取html字符串
      html = response.content.decode()
      # 获取element类型的html
      html = etree.HTML(html)
      # 将生成的element对象放入html_queue队列
      self.html_queue.put(html)
      # Queue.task_done() 在完成一项工作之后，Queue.task_done()函数向任务已经完成的队列发送一个信号
      self.url_queue.task_done()

  def get_content(self):
    """
    解析网页内容，获取想要的信息
    """
    while self.html_queue.not_empty:
      items = list()
      html = self.html_queue.get()
      total_div = html.xpath("//div[@class='col1 old-style-col1']/div")
      for i in total_div:

        author_img = i.xpath(".//a[@rel='nofollow']/img/@src")
        author_img = "https"+author_img[0] if len(author_img)>0 else None

        author_name = i.xpath(".//a[@rel='nofollow']/img/@alt")
        author_name = author_name[0] if len(author_name)>0 else None

        author_href = i.xpath("./a/@href")
        author_href = "https://www.qiushibaike.com/"+author_href[0] if len(author_href)>0 else None

        author_gender = i.xpath("./div[1]/div/@class")
        author_gender = author_gender[0].split(" ")[-1].replace("Icon","").strip() if len(author_gender)>0 else None

        author_age = i.xpath("./div[1]/div/text()")
        author_age = author_age[0] if len(author_age)>0 else None

        content = i.xpath("./a/div/span/text()")
        content = content[0].strip() if len(content)>0 else None

        content_vote = i.xpath("./div[@class='stats']/span[@class='stats-vote']/i/text()")
        content_vote = content_vote[0] if len(content_vote)>0 else None

        content_comment_numbers = i.xpath("./div[@class='stats']/span[@class='stats-comments']/a/i/text()")
        content_comment_numbers = content_comment_numbers[0] if len(content_comment_numbers)>0 else None

        item = {
          "author_name":author_name,
          "author_age" :author_age,
          "author_gender":author_gender,
          "author_img":author_img,
          "author_href":author_href,
          "content":content,
          "content_vote":content_vote,
          "content_comment_numbers":content_comment_numbers,
        }
        items.append(item)
      self.content_queue.put(items)
      # task_done的时候，队列计数减一
      self.html_queue.task_done()

  def save_items(self):
    """
    保存items
    """
    while self.content_queue.not_empty:
      items = self.content_queue.get()
      with open("quishibaike.txt",'a',encoding='utf-8') as f:
        for i in items:
          json.dump(i,f,ensure_ascii=False,indent=2)
      self.content_queue.task_done()

  def run(self):
    # 获取url list
    thread_list = list()
    thread_url = threading.Thread(target=self.get_total_url)
    thread_list.append(thread_url)

    # 发送网络请求
    for i in range(10):
      thread_parse = threading.Thread(target=self.parse_url)
      thread_list.append(thread_parse)

    # 提取数据
    thread_get_content = threading.Thread(target=self.get_content)
    thread_list.append(thread_get_content)

    # 保存
    thread_save = threading.Thread(target=self.save_items)
    thread_list.append(thread_save)


    for t in thread_list:
      # 为每个进程设置为后台进程，效果是主进程退出子进程也会退出
      t.setDaemon(True)
      t.start()
    
    # 让主线程等待，所有的队列为空的时候才能退出
    self.url_queue.join()
    self.html_queue.join()
    self.content_queue.join()


if __name__=="__main__":
  obj = Qsbk()
  obj.run()

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python如何使用队列方式实现多线程爬虫

- Author -

Norni

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Ubuntu下安装PyV8

Mar 13 Python

Python使用pylab库实现画线功能的方法详解

Jun 08 Python

Python使用pickle模块存储数据报错解决示例代码

Jan 26 Python

python自动重试第三方包retrying模块的方法

Apr 24 Python

Python实现朴素贝叶斯分类器的方法详解

Jul 04 Python

python 字符串追加实例

Jul 20 Python

python 实现方阵的对角线遍历示例

Nov 29 Python

Linux下升级安装python3.8并配置pip及yum的教程

Jan 02 Python

Python实现仿射密码的思路详解

Apr 23 Python

Pycharm安装第三方库失败解决方案

Nov 17 Python

Python中的np.argmin()和np.argmax()函数用法

Jun 02 Python

python模块与C和C++动态库相互调用实现过程示例

Nov 02 Python

python的Jenkins接口调用方式

May 12 #Python

jenkins+python自动化测试持续集成教程

May 12 #Python

python百行代码自制电脑端网速悬浮窗的实现

May 12 #Python

基于Python的Jenkins的二次开发操作

May 12 #Python

Python-jenkins模块获取jobs的执行状态操作

May 12 #Python

Python-jenkins 获取job构建信息方式

May 12 #Python

python进行参数传递的方法

May 12 #Python

You might like

php中将html中的br换行符转换为文本输入中的换行符

2013/03/26 PHP

php防注入及开发安全详细解析

2013/08/09 PHP

PHP魔术引号所带来的安全问题分析

2014/07/15 PHP

在Laravel中使用DataTables插件的方法

2018/05/29 PHP

postman的安装与使用方法(模拟Get和Post请求)

2018/08/06 PHP

通过jquery实现tab标签浏览效果

2007/02/20 Javascript

写出更好的JavaScript程序之undefined篇（中）

2009/11/23 Javascript

jQuery实现鼠标移到元素上动态提示消息框效果

2013/10/20 Javascript

js对文章内容进行分页示例代码

2014/03/05 Javascript

JS如何判断json是否为空

2016/07/06 Javascript

微信小程序保留小数(toFixed)详细介绍

2016/11/16 Javascript

微信小程序实现循环动画效果

2018/07/16 Javascript

详解关于Vuex的action传入多个参数的问题

2019/02/22 Javascript

浅谈JavaScript_DOM学习篇_图片切换小案例

2019/03/19 Javascript

11个教程中不常被提及的JavaScript小技巧（推荐）

2019/04/17 Javascript

详解jQuery中的prop()使用方法

2020/01/05 jQuery

微信小程序中target和currentTarget的区别小结

2020/11/06 Javascript

Python读取mp3中ID3信息的方法

2015/03/05 Python

python+selenium开发环境搭建图文教程

2017/08/11 Python

Python获取CPU、内存使用率以及网络使用状态代码

2018/02/08 Python

Flask框架信号用法实例分析

2018/07/24 Python

python tkinter基本属性详解

2019/09/16 Python

Django框架model模型对象验证实现方法分析

2019/10/02 Python

PyCharm导入python项目并配置虚拟环境的教程详解

2019/10/13 Python

numpy np.newaxis 的实用分享

2019/11/30 Python

keras分类模型中的输入数据与标签的维度实例

2020/07/03 Python

基于 HTML5 WebGL 实现的垃圾分类系统

2019/10/08 HTML / CSS

美国在线家装零售商：Build.com

2016/09/02 全球购物

Cole Haan官方网站：美国时尚潮流品牌

2017/12/06 全球购物

了解AppleTalk协议吗

2014/04/01 面试题

个人授权委托书

2014/09/15 职场文书

教师批评与自我批评材料

2014/10/16 职场文书

结婚老公保证书

2015/02/26 职场文书

nginx限制并发连接请求数的方法

2021/04/01 Servers

再次探讨go实现无限 buffer 的 channel方法

2021/06/13 Golang

SQLyog的下载、安装、破解、配置教程（MySQL可视化工具安装）

2022/09/23 MySQL