对Python3 解析html的几种操作方式小结


Posted in Python onFebruary 16, 2019

解析html是爬虫后的重要的一个处理数据的环节。一下记录解析html的几种方式。

先介绍基础的辅助函数,主要用于获取html并输入解析后的结束

#把传递解析函数,便于下面的修改
def get_html(url, paraser=bs4_paraser):
 headers = {
  'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate, sdch',
  'Accept-Language': 'zh-CN,zh;q=0.8',
  'Host': 'www.360kan.com',
  'Proxy-Connection': 'keep-alive',
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
 }
 request = urllib2.Request(url, headers=headers)
 response = urllib2.urlopen(request)
 response.encoding = 'utf-8'
 if response.code == 200:
  data = StringIO.StringIO(response.read())
  gzipper = gzip.GzipFile(fileobj=data)
  data = gzipper.read()
  value = paraser(data) # open('E:/h5/haPkY0osd0r5UB.html').read()
  return value
 else:
  pass
 
 
value = get_html('http://www.360kan.com/m/haPkY0osd0r5UB.html', paraser=lxml_parser)
for row in value:
 print row

1,lxml.html的方式进行解析,

The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API. The latest release works with all CPython versions from 2.6 to 3.5. See the introduction for more information about background and goals of the lxml project. Some common questions are answered in the FAQ. [官网](http://lxml.de/)

def lxml_parser(page):
 data = []
 doc = etree.HTML(page)
 all_div = doc.xpath('//div[@class="yingping-list-wrap"]')
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.xpath('.//div[@class="item"]') # find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   value = {}
   # 获取影评的标题部分
   title = r.xpath('.//div[@class="g-clear title-wrap"][1]')
   value['title'] = title[0].xpath('./a/text()')[0]
   value['title_href'] = title[0].xpath('./a/@href')[0]
   score_text = title[0].xpath('./div/span/span/@style')[0]
   score_text = re.search(r'\d+', score_text).group()
   value['score'] = int(score_text) / 20
   # 时间
   value['time'] = title[0].xpath('./div/span[@class="time"]/text()')[0]
   # 多少人喜欢
   value['people'] = int(
     re.search(r'\d+', title[0].xpath('./div[@class="num"]/span/text()')[0]).group())
   data.append(value)
 return data

2,使用BeautifulSoup,不多说了,大家网上找资料看看

def bs4_paraser(html):
 all_value = []
 value = {}
 soup = BeautifulSoup(html, 'html.parser')
 # 获取影评的部分
 all_div = soup.find_all('div', attrs={'class': 'yingping-list-wrap'}, limit=1)
 for row in all_div:
  # 获取每一个影评,即影评的item
  all_div_item = row.find_all('div', attrs={'class': 'item'})
  for r in all_div_item:
   # 获取影评的标题部分
   title = r.find_all('div', attrs={'class': 'g-clear title-wrap'}, limit=1)
   if title is not None and len(title) > 0:
    value['title'] = title[0].a.string
    value['title_href'] = title[0].a['href']
    score_text = title[0].div.span.span['style']
    score_text = re.search(r'\d+', score_text).group()
    value['score'] = int(score_text) / 20
    # 时间
    value['time'] = title[0].div.find_all('span', attrs={'class': 'time'})[0].string
    # 多少人喜欢
    value['people'] = int(
      re.search(r'\d+', title[0].find_all('div', attrs={'class': 'num'})[0].span.string).group())
   # print r
   all_value.append(value)
   value = {}
 return all_value

3,使用SGMLParser,主要是通过start、end tag的方式进行了,解析工程比较明朗,但是有点麻烦,而且该案例的场景不太适合该方法,(哈哈)

class CommentParaser(SGMLParser):
 def __init__(self):
  SGMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def start_div(self, attrs):
  for k, v in attrs:
   if k == 'class' and v == 'yingping-list-wrap':
    self.__start_div_yingping = True
   elif k == 'class' and v == 'item':
    self.__start_div_item = True
   elif k == 'class' and v == 'g-clear title-wrap':
    self.__start_div_gclear = True
   elif k == 'class' and v == 'rating-wrap g-clear':
    self.__start_div_ratingwrap = True
   elif k == 'class' and v == 'num':
    self.__start_div_num = True
 
 def end_div(self):
  if self.__start_div_yingping:
   if self.__start_div_item:
    if self.__start_div_gclear:
     if self.__start_div_num or self.__start_div_ratingwrap:
      if self.__start_div_num:
       self.__start_div_num = False
      if self.__start_div_ratingwrap:
       self.__start_div_ratingwrap = False
     else:
      self.__start_div_gclear = False
    else:
     self.data.append(self.__value)
     self.__value = {}
     self.__start_div_item = False
   else:
    self.__start_div_yingping = False
 
 def start_a(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   self.__start_a = True
   for k, v in attrs:
    if k == 'href':
     self.__value['href'] = v
 
 def end_a(self):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
   self.__start_a = False
 
 def start_span(self, attrs):
  if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
   if self.__start_div_ratingwrap:
    if self.__span_state != 1:
     for k, v in attrs:
      if k == 'class' and v == 'rating':
       self.__span_state = 1
      elif k == 'class' and v == 'time':
       self.__span_state = 2
    else:
     for k, v in attrs:
      if k == 'style':
       score_text = re.search(r'\d+', v).group()
     self.__value['score'] = int(score_text) / 20
     self.__span_state = 3
   elif self.__start_div_num:
    self.__span_state = 4
 
 def end_span(self):
  self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def sgl_parser(html):
 parser = CommentParaser()
 parser.feed(html)
 return parser.data

4,HTMLParaer,与3原理相识,就是调用的方法不太一样,基本上可以公用,

class CommentHTMLParser(HTMLParser.HTMLParser):
 def __init__(self):
  HTMLParser.HTMLParser.__init__(self)
  self.__start_div_yingping = False
  self.__start_div_item = False
  self.__start_div_gclear = False
  self.__start_div_ratingwrap = False
  self.__start_div_num = False
  # a
  self.__start_a = False
  # span 3中状态
  self.__span_state = 0
  # 数据
  self.__value = {}
  self.data = []
 
 def handle_starttag(self, tag, attrs):
  if tag == 'div':
   for k, v in attrs:
    if k == 'class' and v == 'yingping-list-wrap':
     self.__start_div_yingping = True
    elif k == 'class' and v == 'item':
     self.__start_div_item = True
    elif k == 'class' and v == 'g-clear title-wrap':
     self.__start_div_gclear = True
    elif k == 'class' and v == 'rating-wrap g-clear':
     self.__start_div_ratingwrap = True
    elif k == 'class' and v == 'num':
     self.__start_div_num = True
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    self.__start_a = True
    for k, v in attrs:
     if k == 'href':
      self.__value['href'] = v
  elif tag == 'span':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear:
    if self.__start_div_ratingwrap:
     if self.__span_state != 1:
      for k, v in attrs:
       if k == 'class' and v == 'rating':
        self.__span_state = 1
       elif k == 'class' and v == 'time':
        self.__span_state = 2
     else:
      for k, v in attrs:
       if k == 'style':
        score_text = re.search(r'\d+', v).group()
      self.__value['score'] = int(score_text) / 20
      self.__span_state = 3
    elif self.__start_div_num:
     self.__span_state = 4
 
 def handle_endtag(self, tag):
  if tag == 'div':
   if self.__start_div_yingping:
    if self.__start_div_item:
     if self.__start_div_gclear:
      if self.__start_div_num or self.__start_div_ratingwrap:
       if self.__start_div_num:
        self.__start_div_num = False
       if self.__start_div_ratingwrap:
        self.__start_div_ratingwrap = False
      else:
       self.__start_div_gclear = False
     else:
      self.data.append(self.__value)
      self.__value = {}
      self.__start_div_item = False
    else:
     self.__start_div_yingping = False
  elif tag == 'a':
   if self.__start_div_yingping and self.__start_div_item and self.__start_div_gclear and self.__start_a:
    self.__start_a = False
  elif tag == 'span':
   self.__span_state = 0
 
 def handle_data(self, data):
  if self.__start_a:
   self.__value['title'] = data
  elif self.__span_state == 2:
   self.__value['time'] = data
  elif self.__span_state == 4:
   score_text = re.search(r'\d+', data).group()
   self.__value['people'] = int(score_text)
  pass
def html_parser(html):
 parser = CommentHTMLParser()
 parser.feed(html)
 return parser.data

3,4对于该案例来说确实是不太适合,趁现在有空记录下来,功学习使用!

以上这篇对Python3 解析html的几种操作方式小结就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python实现查找excel里某一列重复数据并且剔除后打印的方法
May 26 Python
python实现在控制台输入密码不显示的方法
Jul 02 Python
pandas数值计算与排序方法
Apr 12 Python
python爬取淘宝商品销量信息
Nov 16 Python
python 获取utc时间转化为本地时间的方法
Dec 31 Python
浅谈python str.format与制表符\t关于中文对齐的细节问题
Jan 14 Python
Django使用AJAX调用自己写的API接口的方法
Mar 06 Python
Python opencv实现人眼/人脸识别以及实时打码处理
Apr 29 Python
python交易记录整合交易类详解
Jul 03 Python
python os.path.isfile()因参数问题判断错误的解决
Nov 29 Python
对python中list的五种查找方法说明
Jul 13 Python
Python中json.load()和json.loads()有哪些区别
Jun 07 Python
Python实现爬取马云的微博功能示例
Feb 16 #Python
对Python3 * 和 ** 运算符详解
Feb 16 #Python
Python docx库用法示例分析
Feb 16 #Python
Python中整数的缓存机制讲解
Feb 16 #Python
Python实现的爬取百度文库功能示例
Feb 16 #Python
对Python3 序列解包详解
Feb 16 #Python
对Python3 pyc 文件的使用详解
Feb 16 #Python
You might like
php录入页面中动态从数据库中提取数据的实现
2006/10/09 PHP
PHP中防止SQL注入实现代码
2011/02/19 PHP
php广告加载类用法实例
2014/09/23 PHP
基于Swoole实现PHP与websocket聊天室
2016/08/03 PHP
PHP简单实现数字分页功能示例
2016/08/24 PHP
php中final关键字用法分析
2016/12/07 PHP
Javascript - HTML的request类
2007/01/09 Javascript
javascript各浏览器中option元素的表现差异
2011/04/07 Javascript
让innerText在firefox火狐和IE浏览器都能用的写法
2011/05/14 Javascript
js计算字符串长度包含的中文是utf8格式
2013/10/15 Javascript
自定义ExtJS控件之下拉树和下拉表格附源码
2013/10/15 Javascript
js隐式全局变量造成的bug示例代码
2014/04/22 Javascript
js预加载图片方法汇总
2015/06/15 Javascript
js实现文字垂直滚动和鼠标悬停效果
2015/12/31 Javascript
js document.getElementsByClassName的使用介绍与自定义函数
2016/11/25 Javascript
Angularjs通过指令监听ng-repeat渲染完成后执行脚本的方法
2016/12/31 Javascript
详解能在多种前端框架下使用的表格控件
2017/01/11 Javascript
JavaScript html5利用FileReader实现上传功能
2020/03/27 Javascript
Angular2入门--架构总览
2017/03/29 Javascript
详解vee-validate的使用个人小结
2017/06/07 Javascript
Vue 2.5.2下axios + express 本地请求404的解决方法
2018/02/21 Javascript
vue语法自动转typescript(解放双手)
2019/09/18 Javascript
tornado框架blog模块分析与使用
2013/11/21 Python
python使用win32com在百度空间插入html元素示例
2014/02/20 Python
十条建议帮你提高Python编程效率
2016/02/16 Python
通过Python编写一个简单登录功能过程解析
2019/09/04 Python
python读取图像矩阵文件并转换为向量实例
2020/06/18 Python
css3 实现滚动条美化效果的实例代码
2021/01/06 HTML / CSS
世界上最好的旅行夹克:BauBax
2018/12/23 全球购物
Tomcat Mysql datasource数据源配置
2015/12/28 面试题
学生励志演讲稿
2014/01/06 职场文书
月度优秀员工获奖感言
2014/08/16 职场文书
2016年教师节特级教师获奖感言
2015/12/09 职场文书
甜美蛋糕店的创业计划书模板,拿来即用!
2019/08/21 职场文书
redis配置文件中常用配置详解
2021/04/14 Redis
TensorFlow的自动求导原理分析
2021/05/26 Python