对Python3 解析html的几种操作方式小结


Posted in Python onFebruary 16, 2019

解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。

先介绍基础的辅助函数,主要用于获取html并输入解析后的结束

#把传递解析函数,便于下面的修改
def get_html(url, paraser=bs4_paraser):
 headers = {
  'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, sdch',
  'Accept-Language': 'zh-CN,zh;q=0.8',
  'Host': 'www.360kan.com',
  'Proxy-Connection': 'keep-alive',
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
 }
 request = urllib2.Request(url, headers=headers)
 response = urllib2.urlopen(request)
 response.encoding = 'utf-8'
 if response.code == 200:
  data = StringIO.StringIO(response.read())
  gzipper = gzip.GzipFile(fileobj=data)
  data = gzipper.read()
  value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()
  return value
 else:
  pass
 
 
value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)
for row in value:
 print row

1,lxml.html的方式进行解析,

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)

def lxml_parser(page):
 data = []
 doc = etree.HTML(page)
 all_div = doc.xpath('//div[@class="yingping-list-wrap"]')
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   value = {}
   # 获取影评的标题部分
   title = r.xpath('.//div[@class="g-clear title-wrap"][1]')
   value['title'] = title[0].xpath('./a/text()')[0]
   value['title_href'] = title[0].xpath('./a/@href')[0]
   score_text = title[0].xpath('./div/span/span/@style')[0]
   score_text = re.search(r'\d+', score_text).group()
   value['score'] = int(score_text) / 20
   # 时间
   value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]
   # 多少人喜欢
   value['people'] = int(
     re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())
   data.append(value)
 return data

2,使用BeautifulSoup,不多说了,大家网上找资料看看

def bs4_paraser(html):
 all_value = []
 value = {}
 soup = BeautifulSoup(html, 'html.parser')
 # 获取影评的部分
 all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   # 获取影评的标题部分
   title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)
   if title is not None and len(title) > 0:
    value['title'] = title[0].a.string
    value['title_href'] = title[0].a['href']
    score_text = title[0].div.span.span['style']
    score_text = re.search(r'\d+', score_text).group()
    value['score'] = int(score_text) / 20
    # 时间
    value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string
    # 多少人喜欢
    value['people'] = int(
      re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())
   # print r
   all_value.append(value)
   value = {}
 return all_value

3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)

class CommentParaser(SGMLParser):
 def __init__(self):
  SGMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def start_div(self, attrs):
  for k, v in attrs:
   if k == 'class' and v == 'yingping-list-wrap':
    self.__start_div_yingping = True
   elif k == 'class' and v == 'item':
    self.__start_div_item = True
   elif k == 'class' and v == 'g-clear title-wrap':
    self.__start_div_gclear = True
   elif k == 'class' and v == 'rating-wrap g-clear':
    self.__start_div_ratingwrap = True
   elif k == 'class' and v == 'num':
    self.__start_div_num = True
 
 def end_div(self):
  if self.__start_div_yingping:
   if self.__start_div_item:
    if self.__start_div_gclear:
     if self.__start_div_num or self.__start_div_ratingwrap:
      if self.__start_div_num:
       self.__start_div_num = False
      if self.__start_div_ratingwrap:
       self.__start_div_ratingwrap = False
     else:
      self.__start_div_gclear = False
    else:
     self.data.append(self.__value)
     self.__value = {}
     self.__start_div_item = False
   else:
    self.__start_div_yingping = False
 
 def start_a(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   self.__start_a = True
   for k, v in attrs:
    if k == 'href':
     self.__value['href'] = v
 
 def end_a(self):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
   self.__start_a = False
 
 def start_span(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   if self.__start_div_ratingwrap:
    if self.__span_state != 1:
     for k, v in attrs:
      if k == 'class' and v == 'rating':
       self.__span_state = 1
      elif k == 'class' and v == 'time':
       self.__span_state = 2
    else:
     for k, v in attrs:
      if k == 'style':
       score_text = re.search(r'\d+', v).group()
     self.__value['score'] = int(score_text) / 20
     self.__span_state = 3
   elif self.__start_div_num:
    self.__span_state = 4
 
 def end_span(self):
  self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def sgl_parser(html):
 parser = CommentParaser()
 parser.feed(html)
 return parser.data

4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,

class CommentHTMLParser(HTMLParser.HTMLParser):
 def __init__(self):
  HTMLParser.HTMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def handle_starttag(self, tag, attrs):
  if tag == 'div':
   for k, v in attrs:
    if k == 'class' and v == 'yingping-list-wrap':
     self.__start_div_yingping = True
    elif k == 'class' and v == 'item':
     self.__start_div_item = True
    elif k == 'class' and v == 'g-clear title-wrap':
     self.__start_div_gclear = True
    elif k == 'class' and v == 'rating-wrap g-clear':
     self.__start_div_ratingwrap = True
    elif k == 'class' and v == 'num':
     self.__start_div_num = True
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    self.__start_a = True
    for k, v in attrs:
     if k == 'href':
      self.__value['href'] = v
  elif tag == 'span':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    if self.__start_div_ratingwrap:
     if self.__span_state != 1:
      for k, v in attrs:
       if k == 'class' and v == 'rating':
        self.__span_state = 1
       elif k == 'class' and v == 'time':
        self.__span_state = 2
     else:
      for k, v in attrs:
       if k == 'style':
        score_text = re.search(r'\d+', v).group()
      self.__value['score'] = int(score_text) / 20
      self.__span_state = 3
    elif self.__start_div_num:
     self.__span_state = 4
 
 def handle_endtag(self, tag):
  if tag == 'div':
   if self.__start_div_yingping:
    if self.__start_div_item:
     if self.__start_div_gclear:
      if self.__start_div_num or self.__start_div_ratingwrap:
       if self.__start_div_num:
        self.__start_div_num = False
       if self.__start_div_ratingwrap:
        self.__start_div_ratingwrap = False
      else:
       self.__start_div_gclear = False
     else:
      self.data.append(self.__value)
      self.__value = {}
      self.__start_div_item = False
    else:
     self.__start_div_yingping = False
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
    self.__start_a = False
  elif tag == 'span':
   self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def html_parser(html):
 parser = CommentHTMLParser()
 parser.feed(html)
 return parser.data

3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!

以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python入门教程之if语句的用法
May 14 Python
python3序列化与反序列化用法实例
May 26 Python
python基于phantomjs实现导入图片
May 13 Python
用tensorflow实现弹性网络回归算法
Jan 09 Python
Python中psutil的介绍与用法
May 02 Python
Python集中化管理平台Ansible介绍与YAML简介
Jun 12 Python
Python中BeautifuSoup库的用法使用详解
Nov 15 Python
python利用datetime模块计算程序运行时间问题
Feb 20 Python
Python使用jupyter notebook查看ipynb文件过程解析
Jun 02 Python
简单了解Django项目应用创建过程
Jul 06 Python
python中逻辑与或(and、or)和按位与或异或(&、|、^)区别
Aug 05 Python
python 实现图片特效处理
Apr 03 Python
Python实现爬取马云的微博功能示例
Feb 16 #Python
对Python3 * 和 ** 运算符详解
Feb 16 #Python
Python docx库用法示例分析
Feb 16 #Python
Python中整数的缓存机制讲解
Feb 16 #Python
Python实现的爬取百度文库功能示例
Feb 16 #Python
对Python3 序列解包详解
Feb 16 #Python
对Python3 pyc 文件的使用详解
Feb 16 #Python
You might like
php处理斐波那契数列非递归方法
2012/02/04 PHP
php实例分享之html转为rtf格式
2014/06/02 PHP
Zend Framework 2.0事件管理器(The EventManager)入门教程
2014/08/11 PHP
PHP多个文件上传到服务器实例
2014/10/29 PHP
php使用explode()函数将字符串拆分成数组的方法
2015/02/17 PHP
Laravel 5.3 学习笔记之 错误&日志
2016/08/28 PHP
PHP 接入支付宝即时到账功能
2016/09/18 PHP
PHP _construct()函数讲解
2019/02/03 PHP
DOMAssitant最新版 DOMAssistant 2.5发布
2007/12/25 Javascript
JavaScript CSS修改学习第三章 修改样式表
2010/02/19 Javascript
jQuery getJSON()+.ashx 实现分页(改进版)
2013/03/28 Javascript
使用js dom和jquery分别实现简单增删改
2014/09/11 Javascript
nodejs下打包模块archiver详解
2014/12/03 NodeJs
javascript实现随时变化着的背景颜色
2015/04/02 Javascript
javascript实现在网页中运行本地程序的方法
2016/02/03 Javascript
JavaScript获取当前运行脚本文件所在目录的方法
2016/02/03 Javascript
jQuery实现可以编辑的表格实例详解【附demo源码下载】
2016/07/09 Javascript
一个仿微博登陆邮箱提示框js开发案例
2016/07/28 Javascript
简述vue路由打开一个新的窗口的方法
2018/11/29 Javascript
vscode中eslint插件的配置(prettier配置无效)
2019/09/10 Javascript
[01:07:22]2014 DOTA2华西杯精英邀请赛 5 24 DK VS VG加赛
2014/05/26 DOTA
Python解析最简单的验证码
2016/01/07 Python
浅谈python 里面的单下划线与双下划线的区别
2017/12/01 Python
python读取mysql数据绘制条形图
2020/03/25 Python
Pycharm 使用 Pipenv 新建的虚拟环境(图文详解)
2020/04/16 Python
jupyter notebook tensorflow打印device信息实例
2020/04/20 Python
使用tensorflow框架在Colab上跑通猫狗识别代码
2020/04/26 Python
美国在线自行车商店:Jenson USA
2018/05/22 全球购物
高二化学教学反思
2014/01/30 职场文书
学校安全防火方案
2014/06/07 职场文书
篮球友谊赛通讯稿
2014/10/10 职场文书
实习班主任自我评价
2015/03/11 职场文书
会议通知范文
2015/04/15 职场文书
2015年机关作风和效能建设工作总结
2015/07/23 职场文书
《刺客之王:C罗全景传记》:时代从来不会亏待手艺人
2019/11/28 职场文书
python实现自动清理文件夹旧文件
2021/05/10 Python