Python lxml模块的基本使用方法分析


Posted in Python onDecember 21, 2019

本文实例讲述了Python lxml模块的基本使用方法。分享给大家供大家参考,具体如下:

1 lxml的安装

安装方式:pip install lxml

2 lxml的使用

2.1 lxml模块的入门使用

导入lxml 的 etree 库 (导入没有提示不代表不能用)

from lxml import etree

利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表,能够接受bytes类型的数据和str类型的数据

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把转化后的element对象转化为字符串,返回bytes类型结果 etree.tostring(element)

假设我们现有如下的html字符换,尝试对他进行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 
</ul> </div>
from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

输出为

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以发现,lxml确实能够把确实的标签补充完成,但是请注意lxml是人写的,很多时候由于网页不够规范,或者是lxml的bug,即使参考url地址对应的响应去提取数据,任然获取不到,这个时候我们需要使用etree.tostring的方法,观察etree到底把html转化成了什么样子,即根据转化后的html字符串去进行数据的提取。

2.2 lxml的深入练习

接下来我们继续操作,假设每个class为item-1的li标签是1条新闻数据,如何把这条新闻数据组成一个字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

输出为

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假设在某种情况下,某个新闻的href没有,那么会怎样呢?

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

结果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了,这不是我们想要的,接下来通过2.3小节的学习来解决这个问题

2.3 lxml模块的进阶使用

前面我们取到属性,或者是文本的时候,返回字符串 但是如果我们取到的是一个节点,返回什么呢?

返回的是element对象,可以继续使用xpath方法,对此我们可以在后面的数据提取过程中:先根据某个标签进行分组,分组之后再进行数据的提取

示例如下:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
print(li_list)

结果为:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以发现结果是一个element对象,这个对象能够继续使用xpath方法

先根据li标签进行分组,之后再进行数据的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

结果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代码中,进行数据提取需要判断,可能某些一面不存在数据的情况,对应的可以使用三元运算符来解决

Python 相关文章推荐
Python设计模式之中介模式简单示例
Jan 09 Python
Python实现购物车程序
Apr 16 Python
python 保存float类型的小数的位数方法
Oct 17 Python
python 自定义对象的打印方法
Jan 12 Python
Python基础之文件读取的讲解
Feb 16 Python
深入了解和应用Python 装饰器 @decorator
Apr 02 Python
Python深拷贝与浅拷贝用法实例分析
May 05 Python
python中tkinter的应用:修改字体的实例讲解
Jul 17 Python
python针对mysql数据库的连接、查询、更新、删除操作示例
Sep 11 Python
python文件和文件夹复制函数
Feb 07 Python
使用Keras中的ImageDataGenerator进行批次读图方式
Jun 17 Python
Python 如何利用ffmpeg 处理视频素材
Nov 27 Python
python Manager 之dict KeyError问题的解决
Dec 21 #Python
tornado+celery的简单使用详解
Dec 21 #Python
Python selenium的基本使用方法分析
Dec 21 #Python
Flask框架搭建虚拟环境的步骤分析
Dec 21 #Python
Django restframework 框架认证、权限、限流用法示例
Dec 21 #Python
python支持多线程的爬虫实例
Dec 21 #Python
Python 实现try重新执行
Dec 21 #Python
You might like
PHP连接access数据库
2008/03/27 PHP
提高php运行速度的一些小技巧分享
2012/07/03 PHP
PHP使用pcntl_fork实现多进程下载图片的方法
2014/12/16 PHP
如何解决PHP使用mysql_query查询超大结果集超内存问题
2016/03/14 PHP
php抽象方法和抽象类实例分析
2016/12/07 PHP
有趣的JavaScript数组长度问题代码说明
2011/01/20 Javascript
Jquery实现搜索框提示功能示例代码
2013/08/13 Javascript
JavaScript实现维吉尼亚(Vigenere)密码算法实例
2013/11/22 Javascript
Javascript 构造函数详解
2014/10/22 Javascript
JavaScript判断变量是否为空的自定义函数分享
2015/01/31 Javascript
jQuery中的pushStack实现原理和应用实例
2015/02/03 Javascript
JavaScript将当前时间转换成UTC标准时间的方法
2015/04/06 Javascript
模拟javascript中的sort排序(简单实例)
2016/08/17 Javascript
浅谈jquery之on()绑定事件和off()解除绑定事件
2016/10/26 Javascript
详解angularJs指令的3种绑定策略
2017/04/13 Javascript
React简单介绍
2017/05/24 Javascript
nodejs 图片预览和上传的示例代码
2017/09/30 NodeJs
JS中创建自定义类型的常用模式总结【工厂模式,构造函数模式,原型模式,动态原型模式等】
2019/01/19 Javascript
微信小程序录音实现功能并上传(使用node解析接收)
2020/02/26 Javascript
react-router-dom 嵌套路由的实现
2020/05/02 Javascript
Vue+Bootstrap收藏(点赞)功能逻辑与具体实现
2020/10/22 Javascript
[01:56]无止竞 再出发——中国军团出征2017年DOTA2国际邀请赛
2017/07/05 DOTA
[54:26]完美世界DOTA2联赛PWL S3 Forest vs Rebirth 第一场 12.10
2020/12/12 DOTA
Python中的choice()方法使用详解
2015/05/15 Python
python3抓取中文网页的方法
2015/07/28 Python
python实现DEM数据的阴影生成的方法
2019/07/23 Python
linux下python中文乱码解决方案详解
2019/08/28 Python
普通PHP程序员笔试题
2016/01/01 面试题
志愿者活动总结报告
2014/06/27 职场文书
汽车检测与维修专业求职信
2014/07/04 职场文书
培养联系人考察意见
2015/06/01 职场文书
法律意见书范文
2015/06/04 职场文书
《去年的树》教学反思
2016/02/18 职场文书
如何在pycharm中快捷安装pip命令(如pygame)
2021/05/31 Python
解决vue $http的get和post请求跨域问题
2021/06/07 Vue.js
SpringBoot项目多数据源及mybatis 驼峰失效的问题解决方法
2022/07/07 Java/Android