对Python3 解析html的几种操作方式小结


Posted in Python onFebruary 16, 2019

解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。

先介绍基础的辅助函数,主要用于获取html并输入解析后的结束

#把传递解析函数,便于下面的修改
def get_html(url, paraser=bs4_paraser):
 headers = {
  'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, sdch',
  'Accept-Language': 'zh-CN,zh;q=0.8',
  'Host': 'www.360kan.com',
  'Proxy-Connection': 'keep-alive',
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
 }
 request = urllib2.Request(url, headers=headers)
 response = urllib2.urlopen(request)
 response.encoding = 'utf-8'
 if response.code == 200:
  data = StringIO.StringIO(response.read())
  gzipper = gzip.GzipFile(fileobj=data)
  data = gzipper.read()
  value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()
  return value
 else:
  pass
 
 
value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)
for row in value:
 print row

1,lxml.html的方式进行解析,

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)

def lxml_parser(page):
 data = []
 doc = etree.HTML(page)
 all_div = doc.xpath('//div[@class="yingping-list-wrap"]')
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   value = {}
   # 获取影评的标题部分
   title = r.xpath('.//div[@class="g-clear title-wrap"][1]')
   value['title'] = title[0].xpath('./a/text()')[0]
   value['title_href'] = title[0].xpath('./a/@href')[0]
   score_text = title[0].xpath('./div/span/span/@style')[0]
   score_text = re.search(r'\d+', score_text).group()
   value['score'] = int(score_text) / 20
   # 时间
   value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]
   # 多少人喜欢
   value['people'] = int(
     re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())
   data.append(value)
 return data

2,使用BeautifulSoup,不多说了,大家网上找资料看看

def bs4_paraser(html):
 all_value = []
 value = {}
 soup = BeautifulSoup(html, 'html.parser')
 # 获取影评的部分
 all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   # 获取影评的标题部分
   title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)
   if title is not None and len(title) > 0:
    value['title'] = title[0].a.string
    value['title_href'] = title[0].a['href']
    score_text = title[0].div.span.span['style']
    score_text = re.search(r'\d+', score_text).group()
    value['score'] = int(score_text) / 20
    # 时间
    value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string
    # 多少人喜欢
    value['people'] = int(
      re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())
   # print r
   all_value.append(value)
   value = {}
 return all_value

3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)

class CommentParaser(SGMLParser):
 def __init__(self):
  SGMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def start_div(self, attrs):
  for k, v in attrs:
   if k == 'class' and v == 'yingping-list-wrap':
    self.__start_div_yingping = True
   elif k == 'class' and v == 'item':
    self.__start_div_item = True
   elif k == 'class' and v == 'g-clear title-wrap':
    self.__start_div_gclear = True
   elif k == 'class' and v == 'rating-wrap g-clear':
    self.__start_div_ratingwrap = True
   elif k == 'class' and v == 'num':
    self.__start_div_num = True
 
 def end_div(self):
  if self.__start_div_yingping:
   if self.__start_div_item:
    if self.__start_div_gclear:
     if self.__start_div_num or self.__start_div_ratingwrap:
      if self.__start_div_num:
       self.__start_div_num = False
      if self.__start_div_ratingwrap:
       self.__start_div_ratingwrap = False
     else:
      self.__start_div_gclear = False
    else:
     self.data.append(self.__value)
     self.__value = {}
     self.__start_div_item = False
   else:
    self.__start_div_yingping = False
 
 def start_a(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   self.__start_a = True
   for k, v in attrs:
    if k == 'href':
     self.__value['href'] = v
 
 def end_a(self):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
   self.__start_a = False
 
 def start_span(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   if self.__start_div_ratingwrap:
    if self.__span_state != 1:
     for k, v in attrs:
      if k == 'class' and v == 'rating':
       self.__span_state = 1
      elif k == 'class' and v == 'time':
       self.__span_state = 2
    else:
     for k, v in attrs:
      if k == 'style':
       score_text = re.search(r'\d+', v).group()
     self.__value['score'] = int(score_text) / 20
     self.__span_state = 3
   elif self.__start_div_num:
    self.__span_state = 4
 
 def end_span(self):
  self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def sgl_parser(html):
 parser = CommentParaser()
 parser.feed(html)
 return parser.data

4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,

class CommentHTMLParser(HTMLParser.HTMLParser):
 def __init__(self):
  HTMLParser.HTMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def handle_starttag(self, tag, attrs):
  if tag == 'div':
   for k, v in attrs:
    if k == 'class' and v == 'yingping-list-wrap':
     self.__start_div_yingping = True
    elif k == 'class' and v == 'item':
     self.__start_div_item = True
    elif k == 'class' and v == 'g-clear title-wrap':
     self.__start_div_gclear = True
    elif k == 'class' and v == 'rating-wrap g-clear':
     self.__start_div_ratingwrap = True
    elif k == 'class' and v == 'num':
     self.__start_div_num = True
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    self.__start_a = True
    for k, v in attrs:
     if k == 'href':
      self.__value['href'] = v
  elif tag == 'span':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    if self.__start_div_ratingwrap:
     if self.__span_state != 1:
      for k, v in attrs:
       if k == 'class' and v == 'rating':
        self.__span_state = 1
       elif k == 'class' and v == 'time':
        self.__span_state = 2
     else:
      for k, v in attrs:
       if k == 'style':
        score_text = re.search(r'\d+', v).group()
      self.__value['score'] = int(score_text) / 20
      self.__span_state = 3
    elif self.__start_div_num:
     self.__span_state = 4
 
 def handle_endtag(self, tag):
  if tag == 'div':
   if self.__start_div_yingping:
    if self.__start_div_item:
     if self.__start_div_gclear:
      if self.__start_div_num or self.__start_div_ratingwrap:
       if self.__start_div_num:
        self.__start_div_num = False
       if self.__start_div_ratingwrap:
        self.__start_div_ratingwrap = False
      else:
       self.__start_div_gclear = False
     else:
      self.data.append(self.__value)
      self.__value = {}
      self.__start_div_item = False
    else:
     self.__start_div_yingping = False
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
    self.__start_a = False
  elif tag == 'span':
   self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def html_parser(html):
 parser = CommentHTMLParser()
 parser.feed(html)
 return parser.data

3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!

以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python实现斐波那契数列的方法示例
Jan 12 Python
python+VTK环境搭建及第一个简单程序代码
Dec 13 Python
浅谈python爬虫使用Selenium模拟浏览器行为
Feb 23 Python
用python统计代码行的示例(包括空行和注释)
Jul 24 Python
[原创]Python入门教程5. 字典基本操作【定义、运算、常用函数】
Nov 01 Python
Python2与Python3的区别实例总结
Apr 17 Python
Python Selenium参数配置方法解析
Jan 19 Python
Python 实现自动完成A4标签排版打印功能
Apr 09 Python
python openCV实现摄像头获取人脸图片
Aug 20 Python
python 进程池pool使用详解
Oct 15 Python
基于python获取本地时间并转换时间戳和日期格式
Oct 27 Python
Python基础之元组与文件知识总结
May 19 Python
Python实现爬取马云的微博功能示例
Feb 16 #Python
对Python3 * 和 ** 运算符详解
Feb 16 #Python
Python docx库用法示例分析
Feb 16 #Python
Python中整数的缓存机制讲解
Feb 16 #Python
Python实现的爬取百度文库功能示例
Feb 16 #Python
对Python3 序列解包详解
Feb 16 #Python
对Python3 pyc 文件的使用详解
Feb 16 #Python
You might like
php之字符串变相相减的代码
2007/03/19 PHP
php自定义函数之递归删除文件及目录
2010/08/08 PHP
作为PHP程序员应该了解MongoDB的五件事
2013/06/03 PHP
浅析Yii中使用RBAC的完全指南(用户角色权限控制)
2013/06/20 PHP
php session_start()出错原因分析及解决方法
2013/10/28 PHP
Yii框架登录流程分析
2014/12/03 PHP
Thinkphp自定义生成缩略图尺寸的方法
2019/08/05 PHP
js中scrollHeight,scrollWidth,scrollLeft,scrolltop等差别介绍
2012/05/16 Javascript
IE6兼容透明背景图片及解决方案
2015/08/19 Javascript
JavaScript获取function所有参数名的方法
2015/10/30 Javascript
jQuery实现ajax调用WCF服务的方法(附带demo下载)
2015/12/04 Javascript
通过Tabs方法基于easyUI+bootstrap制作工作站
2016/03/28 Javascript
webpack进阶——缓存与独立打包的用法
2017/08/02 Javascript
Angular 2.0+ 的数据绑定的实现示例
2017/08/09 Javascript
[01:08:30]DOTA2-DPC中国联赛 正赛 Ehome vs Elephant BO3 第一场 2月28日
2021/03/11 DOTA
Python 变量类型及命名规则介绍
2013/06/08 Python
编写Python脚本来获取mp3文件tag信息的教程
2015/05/04 Python
Python出现segfault错误解决方法
2016/04/16 Python
浅谈Pandas 排序之后索引的问题
2018/06/07 Python
Python 微信爬虫完整实例【单线程与多线程】
2019/07/06 Python
python递归法实现简易连连看小游戏
2020/03/25 Python
python paramiko远程服务器终端操作过程解析
2019/12/14 Python
tensorflow实现读取模型中保存的值 tf.train.NewCheckpointReader
2020/02/10 Python
python数据库开发之MongoDB安装及Python3操作MongoDB数据库详细方法与实例
2020/03/18 Python
中科软测试工程师面试题
2012/06/16 面试题
介绍一下EJB的分类及其各自的功能及应用
2016/08/23 面试题
银行财务部实习生的自我鉴定
2013/11/27 职场文书
商业项目策划方案
2014/06/05 职场文书
拉歌口号大全
2014/06/13 职场文书
党员自我剖析材料(群众路线)
2014/10/06 职场文书
2015年个人现实表现材料
2014/12/10 职场文书
初中思想品德教学反思
2016/02/24 职场文书
Pytorch反向传播中的细节-计算梯度时的默认累加操作
2021/06/05 Python
IDEA使用SpringAssistant插件创建SpringCloud项目
2021/06/23 Java/Android
Java Spring 控制反转(IOC)容器详解
2021/10/05 Java/Android
python游戏开发Pygame框架
2022/04/22 Python