编程 Python

python下载微信公众号相关文章

Posted in Python onFebruary 26, 2019

本文实例为大家分享了python下载微信公众号相关文章的具体代码，供大家参考，具体内容如下

目的：从零开始学自动化测试公众号中下载“pytest"一系列文档

1、搜索微信号文章关键字搜索

2、对搜索结果前N页进行解析，获取文章标题和对应URL

主要使用的是requests和bs4中的Beautifulsoup

Weixin.py

import requests
from urllib.parse import quote
from bs4 import BeautifulSoup
import re
from WeixinSpider.HTML2doc import MyHTMLParser
 
class WeixinSpider(object):
 
 def __init__(self, gzh_name, pageno,keyword):
  self.GZH_Name = gzh_name
  self.pageno = pageno
  self.keyword = keyword.lower()
  self.page_url = []
  self.article_list = []
  self.headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
  self.timeout = 5
  # [...] 用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k'
  # re+ 匹配1个或多个的表达式。
  self.pattern = r'[\\/:*?"<>|\r\n]+'
 
 def get_page_url(self):
  for i in range(1,self.pageno+1):
   # https://weixin.sogou.com/weixin?query=从零开始学自动化测试&_sug_type_=&s_from=input&_sug_=n&type=2&page=2&ie=utf8
   url = "https://weixin.sogou.com/weixin?query=%s&_sug_type_=&s_from=input&_sug_=n&type=2&page=%s&ie=utf8" \
     % (quote(self.GZH_Name),i)
   self.page_url.append(url)
 
 def get_article_url(self):
  article = {}
  for url in self.page_url:
   response = requests.get(url,headers=self.headers,timeout=self.timeout)
   result = BeautifulSoup(response.text, 'html.parser')
   articles = result.select('ul[class="news-list"] > li > div[class="txt-box"] > h3 > a ')
   for a in articles:
    # print(a.text)
    # print(a["href"])
    if self.keyword in a.text.lower():
      new_text=re.sub(self.pattern,"",a.text)
      article[new_text] = a["href"]
      self.article_list.append(article)
 
 
 
headers = {'User-Agent':
      'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
timeout = 5
gzh_name = 'pytest文档'
My_GZH = WeixinSpider(gzh_name,5,'pytest')
My_GZH.get_page_url()
# print(My_GZH.page_url)
My_GZH.get_article_url()
# print(My_GZH.article_list)
for article in My_GZH.article_list:
 for (key,value) in article.items():
  url=value
  html_response = requests.get(url,headers=headers,timeout=timeout)
  myHTMLParser = MyHTMLParser(key)
  myHTMLParser.feed(html_response.text)
  myHTMLParser.doc.save(myHTMLParser.docfile)

HTML2doc.py

from html.parser import HTMLParser
import requests
from docx import Document
import re
from docx.shared import RGBColor
import docx
 
 
class MyHTMLParser(HTMLParser):
 def __init__(self,docname):
  HTMLParser.__init__(self)
  self.docname=docname
  self.docfile = r"D:\pytest\%s.doc"%self.docname
  self.doc=Document()
  self.title = False
  self.code = False
  self.text=''
  self.processing =None
  self.codeprocessing =None
  self.picindex = 1
  self.headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
  self.timeout = 5
 
 def handle_startendtag(self, tag, attrs):
  # 图片的处理比较复杂，首先需要找到对应的图片的url，然后下载并写入doc中
  if tag == "img":
   if len(attrs) == 0:
    pass
   else:
    for (variable, value) in attrs:
     if variable == "data-type":
      picname = r"D:\pytest\%s%s.%s" % (self.docname, self.picindex, value)
      # print(picname)
     if variable == "data-src":
      picdata = requests.get(value, headers=self.headers, timeout=self.timeout)
      # print(value)
    self.picindex = self.picindex + 1
    # print(self.picindex)
    with open(picname, "wb") as pic:
     pic.write(picdata.content)
    try:
     self.doc.add_picture(picname)
    except docx.image.exceptions.UnexpectedEndOfFileError as e:
     print(e)
 
 def handle_starttag(self, tag, attrs):
  if re.match(r"h(\d)", tag):
   self.title = True
  if tag =="p":
   self.processing = tag
  if tag == "code":
   self.code = True
   self.codeprocessing = tag
 
 def handle_data(self, data):
   if self.title == True:
    self.doc.add_heading(data, level=2)
   # if self.in_div == True and self.tag == "p":
   if self.processing:
    self.text = self.text + data
   if self.code == True:
    p =self.doc.add_paragraph()
    run=p.add_run(data)
    run.font.color.rgb = RGBColor(111,111,111)
 
 def handle_endtag(self, tag):
  self.title = False
  # self.code = False
  if tag == self.processing:
   self.doc.add_paragraph(self.text)
 
   self.processing = None
   self.text=''
  if tag == self.codeprocessing:
   self.code =False

运行结果：

python下载微信公众号相关文章

缺少部分文档，如pytest文档4，是因为搜狗微信文章搜索结果中就没有

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python下载微信公众号相关文章

- Author -

qd_tudou

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python最基本的输入输出详解

Apr 25 Python

python通过smpt发送邮件的方法

Apr 30 Python

使用pyecharts在jupyter notebook上绘图

Apr 23 Python

python unittest实现api自动化测试

Apr 04 Python

Python装饰器原理与简单用法实例分析

Apr 29 Python

对numpy中数组转置的求解以及向量内积计算方法

Oct 31 Python

在python 不同时区之间的差值与转换方法

Jan 14 Python

python django框架中使用FastDFS分布式文件系统的安装方法

Jun 10 Python

wxPython：python首选的GUI库实例分享

Oct 05 Python

Python 爬取必应壁纸的实例讲解

Feb 24 Python

python实现图像拼接功能

Mar 23 Python

Python环境配置实现pip加速过程解析

Nov 27 Python

python处理DICOM并计算三维模型体积

Feb 26 #Python

学习python可以干什么

Feb 26 #Python

Python3几个常见问题的处理方法

Feb 26 #Python

django 自定义过滤器的实现

Feb 26 #Python

使用Python将Mysql的查询数据导出到文件的方法

Feb 25 #Python

Python-ElasticSearch搜索查询的讲解

Feb 25 #Python

Python2 Selenium元素定位的实现(8种)

Feb 25 #Python

You might like

分页详解从此分页无忧(PHP+mysql)

2007/11/23 PHP

php 来访国内外IP判断代码并实现页面跳转

2009/12/18 PHP

破解图片防盗链的代码(asp/php)测试通过

2010/07/02 PHP

php设计模式 Proxy (代理模式)

2011/06/26 PHP

php获得用户ip地址的比较不错的方法

2014/02/08 PHP

初识Laravel

2014/10/30 PHP

php基础设计模式大全(注册树模式、工厂模式、单列模式)

2015/08/31 PHP

php调用自己java程序的方法详解

2016/05/13 PHP

js中substring和substr两者区别和使用方法

2015/11/09 Javascript

JavaScript使用Range调色及透明度实例

2016/09/25 Javascript

VUE引入第三方js包及调用方法讲解

2019/03/01 Javascript

JavaScript原型式继承实现方法

2019/11/06 Javascript

vue全局使用axios的操作

2020/09/08 Javascript

[03:22]DSPL第一期精彩集锦：酷炫到底！

2014/11/07 DOTA

进一步探究Python中的正则表达式

2015/04/28 Python

Python标准库笔记struct模块的使用

2018/02/22 Python

python字符串常用方法

2018/06/14 Python

python GUI库图形界面开发之PyQt5窗口类QMainWindow详细使用方法

2020/02/26 Python

python实现图片素描效果

2020/09/26 Python

Python实现图片指定位置加图片水印（附Pyinstaller打包exe)

2021/03/04 Python

CSS3 clip-path 用法介绍详解

2018/03/01 HTML / CSS

纯css3实现宠物小鸡实例代码

2018/10/08 HTML / CSS

CSS3新增布局之: flex详解

2020/06/18 HTML / CSS

Canvas环形饼图与手势控制的实现代码

2019/11/08 HTML / CSS

如何向接受结构参数的函数传入常数值

2016/02/17 面试题

中学教师自我鉴定

2014/02/07 职场文书

会计专业自我鉴定

2014/02/10 职场文书

春节联欢会主持词

2014/03/24 职场文书

老人祝寿主持词

2014/03/28 职场文书

医学专业自荐信

2014/06/14 职场文书

党员剖析材料范文

2014/12/18 职场文书

学雷锋团日活动总结

2015/05/06 职场文书

爱国影片观后感

2015/06/18 职场文书

大学班干部竞选稿

2015/11/20 职场文书

员工工作心得体会

2019/05/07 职场文书

Beekeeper Studio开源数据库管理工具比Navicat更炫酷

2022/06/21 数据库