Python lxml模块的基本使用方法分析


Posted in Python onDecember 21, 2019

本文实例讲述了Python lxml模块的基本使用方法。分享给大家供大家参考,具体如下:

1 lxml的安装

安装方式:pip install lxml

2 lxml的使用

2.1 lxml模块的入门使用

导入lxml 的 etree 库 (导入没有提示不代表不能用)

from lxml import etree

利用etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表,能够接受bytes类型的数据和str类型的数据

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把转化后的element对象转化为字符串,返回bytes类型结果 etree.tostring(element)

假设我们现有如下的html字符换,尝试对他进行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此处缺少一个 </li> 闭合标签 
</ul> </div>
from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

输出为

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以发现,lxml确实能够把确实的标签补充完成,但是请注意lxml是人写的,很多时候由于网页不够规范,或者是lxml的bug,即使参考url地址对应的响应去提取数据,任然获取不到,这个时候我们需要使用etree.tostring的方法,观察etree到底把html转化成了什么样子,即根据转化后的html字符串去进行数据的提取。

2.2 lxml的深入练习

接下来我们继续操作,假设每个class为item-1的li标签是1条新闻数据,如何把这条新闻数据组成一个字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#获取href的列表和title的列表
href_list = html.xpath("//li[@class='item-1']/a/@href")
title_list = html.xpath("//li[@class='item-1']/a/text()")
#组装成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

输出为

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假设在某种情况下,某个新闻的href没有,那么会怎样呢?

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

结果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

数据的对应全部错了,这不是我们想要的,接下来通过2.3小节的学习来解决这个问题

2.3 lxml模块的进阶使用

前面我们取到属性,或者是文本的时候,返回字符串 但是如果我们取到的是一个节点,返回什么呢?

返回的是element对象,可以继续使用xpath方法,对此我们可以在后面的数据提取过程中:先根据某个标签进行分组,分组之后再进行数据的提取

示例如下:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
print(li_list)

结果为:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以发现结果是一个element对象,这个对象能够继续使用xpath方法

先根据li标签进行分组,之后再进行数据的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根据li标签进行分组
html = etree.HTML(text)
li_list = html.xpath("//li[@class='item-1']")
#在每一组中继续进行数据的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

结果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代码中,进行数据提取需要判断,可能某些一面不存在数据的情况,对应的可以使用三元运算符来解决

Python 相关文章推荐
跟老齐学Python之使用Python查询更新数据库
Nov 25 Python
python检测远程端口是否打开的方法
Mar 14 Python
学习python之编写简单简单连接数据库并执行查询操作
Feb 27 Python
Python的Flask框架及Nginx实现静态文件访问限制功能
Jun 27 Python
python中实现迭代器(iterator)的方法示例
Jan 19 Python
Python字典中的键映射多个值的方法(列表或者集合)
Oct 17 Python
Python程序打包工具py2exe和PyInstaller详解
Jun 28 Python
python shutil文件操作工具使用实例分析
Dec 25 Python
tensorflow之tf.record实现存浮点数数组
Feb 17 Python
Python常用编译器原理及特点解析
Mar 23 Python
django xadmin action兼容自定义model权限教程
Mar 30 Python
浅析Python打包时包含静态文件处理方法
Jan 15 Python
python Manager 之dict KeyError问题的解决
Dec 21 #Python
tornado+celery的简单使用详解
Dec 21 #Python
Python selenium的基本使用方法分析
Dec 21 #Python
Flask框架搭建虚拟环境的步骤分析
Dec 21 #Python
Django restframework 框架认证、权限、限流用法示例
Dec 21 #Python
python支持多线程的爬虫实例
Dec 21 #Python
Python 实现try重新执行
Dec 21 #Python
You might like
PHP异步调用socket实现代码
2012/01/12 PHP
RSA实现JS前端加密与PHP后端解密功能示例
2019/08/05 PHP
Jquery Ajax请求代码(2)
2011/01/07 Javascript
基于jquery自定义图片热区效果
2012/07/21 Javascript
可兼容IE的获取及设置cookie的jquery.cookie函数方法
2013/09/02 Javascript
Ext修改GridPanel数据和字体颜色、css属性等
2014/06/13 Javascript
js读取json的两种常用方法示例介绍
2014/10/19 Javascript
js实现文本框宽度自适应文本宽度的方法
2015/08/13 Javascript
javascript实现checkbox复选框实例代码
2016/01/10 Javascript
基于javascript实现checkbox复选框实例代码
2016/01/28 Javascript
JavaScript中removeChild 方法开发示例代码
2016/08/15 Javascript
聊聊Vue.js的template编译的问题
2017/10/09 Javascript
Node Puppeteer图像识别实现百度指数爬虫的示例
2018/02/22 Javascript
详解jQuery获取特殊属性的值以及设置内容
2018/11/14 jQuery
原生js生成图片验证码
2020/10/11 Javascript
ant design的table组件实现全选功能以及自定义分页
2020/11/17 Javascript
[01:03:27]NAVI vs EG 2019国际邀请赛小组赛 BO2 第一场 8.15
2019/08/17 DOTA
Python实现计算字符串中出现次数最多的字符示例
2019/01/21 Python
pyside+pyqt实现鼠标右键菜单功能
2020/12/08 Python
Python之NumPy(axis=0 与axis=1)区分详解
2019/05/27 Python
Python初学者常见错误详解
2019/07/02 Python
Django 路由控制的实现
2019/07/17 Python
python PyAutoGUI 模拟鼠标键盘操作和截屏功能
2019/08/04 Python
在Python中使用MongoEngine操作数据库教程实例
2019/12/03 Python
CSS3 渐变(Gradients)之CSS3 径向渐变
2016/07/08 HTML / CSS
Shopping happy life西班牙:以最优惠的价格提供最好的时尚配饰
2020/03/13 全球购物
工作推荐信范文
2014/05/10 职场文书
2014年小学数学教师工作总结
2014/12/03 职场文书
先进教师个人总结
2015/02/11 职场文书
担保贷款承诺书
2015/04/30 职场文书
2016小学教师读书心得体会
2016/01/13 职场文书
MySQL8.0无法启动3534的解决方法
2021/06/03 MySQL
MySQL中datetime时间字段的四舍五入操作
2021/10/05 MySQL
JS中如何优雅的使用async await详解
2021/10/05 Javascript
MySQL数据库完全卸载的方法
2022/03/03 MySQL
Python使用MapReduce进行简单的销售统计
2022/04/22 Python