python爬虫之xpath的基本使用详解


Posted in Python onApril 18, 2018

一、简介

XPath 是一门在 XML 文档中查找信息的语言。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 标准的主要元素,并且 XQuery 和 XPointer 都构建于 XPath 表达之上。 

二、安装

pip3 install lxml

三、使用

1、导入

from lxml import etree

2、基本使用

from lxml import etree
wb_data = """
    <div>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
       </ul>
     </div>

    """
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))

从下面的结果来看,我们打印机html其实就是一个python对象,etree.tostring(html)则是不全里html的基本写法,补全了缺胳膊少腿的标签。

<Element html at 0x39e58f0>
<html><body><div>
      <ul>
         <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

         <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

         <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

         <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

         <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>

       </li></ul>
     </div>
    </body></html>

3、获取某个标签的内容(基本使用),注意,获取a标签的所有内容,a后面就不用再加正斜杠,否则报错。

写法一

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a')

print(html)

for i in html_data:

  print(i.text)

<Element html at 0x12fe4b8>

first item

second item

third item

fourth item

fifth item

写法二(直接在需要查找内容的标签后面加一个/text()就行)

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a/text()')

print(html)

for i in html_data:

  print(i) 

<Element html at 0x138e4b8>

first item

second item

third item

fourth item

fifth item

4、打开读取html文件

#使用parse打开html的文件

html = etree.parse('test.html')

html_data = html.xpath('//*')<br>#打印是一个列表,需要遍历

print(html_data)

for i in html_data:

  print(i.text)
html = etree.parse('test.html')

html_data = etree.tostring(html,pretty_print=True)

res = html_data.decode('utf-8')

print(res)

 

打印:

<div>

   <ul>

     <li class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

     <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

     <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>

     <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

     <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a></li>

   </ul>

</div>

5、打印指定路径下a标签的属性(可以通过遍历拿到某个属性的值,查找标签的内容)

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a/@href')

for i in html_data:

  print(i)

打印:

link1.html

link2.html

link3.html

link4.html

link5.html

6、我们知道我们使用xpath拿到得都是一个个的ElementTree对象,所以如果需要查找内容的话,还需要遍历拿到数据的列表。

查到绝对路径下a标签属性等于link2.html的内容。

html = etree.HTML(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]/text()')

print(html_data)

for i in html_data:

  print(i)

打印:

['second item']

second item

7、上面我们找到全部都是绝对路径(每一个都是从根开始查找),下面我们查找相对路径,例如,查找所有li标签下的a标签内容。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a/text()')

print(html_data)

for i in html_data:

  print(i)

打印:

['first item', 'second item', 'third item', 'fourth item', 'fifth item']

first item

second item

third item

fourth item

fifth item

8、上面我们使用绝对路径,查找了所有a标签的属性等于href属性值,利用的是/---绝对路径,下面我们使用相对路径,查找一下l相对路径下li标签下的a标签下的href属性的值,注意,a标签后面需要双//。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a//@href')

print(html_data)

for i in html_data:

  print(i)

打印:

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

link1.html

link2.html

link3.html

link4.html

link5.html

9、相对路径下跟绝对路径下查特定属性的方法类似,也可以说相同。

html = etree.HTML(wb_data)

html_data = html.xpath('//li/a[@href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')

print(html_data)

for i in html_data:

  print(i.text)

打印:

[<Element a at 0x216e468>]

second item

10、查找最后一个li标签里的a标签的href属性

html = etree.HTML(wb_data)

html_data = html.xpath('//li[last()]/a/text()')

print(html_data)

for i in html_data:

  print(i)

打印:

['fifth item']

fifth item

11、查找倒数第二个li标签里的a标签的href属性 

html = etree.HTML(wb_data)

html_data = html.xpath('//li[last()-1]/a/text()')

print(html_data)

for i in html_data:

  print(i)

打印:

['fourth item']

fourth item

12、如果在提取某个页面的某个标签的xpath路径的话,可以如下图:

//*[@id="kw"]

解释:使用相对路径查找所有的标签,属性id等于kw的标签。

python爬虫之xpath的基本使用详解

常用

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
  <head lang="en">
    <meta charset="UTF-8">
    <title></title>
  </head>
  <body>
    <ul>
      <li class="item-"><a id='i1' href="link.html" rel="external nofollow" rel="external nofollow" >first item</a></li>
      <li class="item-0"><a id='i2' href="llink.html" rel="external nofollow" >first item</a></li>
      <li class="item-1"><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item<span>vv</span></a></li>
    </ul>
    <div><a href="llink2.html" rel="external nofollow" rel="external nofollow" >second item</a></div>
  </body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html" rel="external nofollow" rel="external nofollow" ][@id="i1"]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#   v = item.xpath('./a/span')
#   # 或
#   # v = item.xpath('a/span')
#   # 或
#   # v = item.xpath('*/a/span')
#   print(v)

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python中实现定制类的特殊方法总结
Sep 28 Python
举例讲解如何在Python编程中进行迭代和遍历
Jan 19 Python
python文件与目录操作实例详解
Feb 22 Python
python bottle框架支持jquery ajax的RESTful风格的PUT和DELETE方法
May 24 Python
Python实现购物系统(示例讲解)
Sep 13 Python
python+pyqt实现右下角弹出框
Oct 26 Python
python hook监听事件详解
Oct 25 Python
Python提取频域特征知识点浅析
Mar 04 Python
python调用pyaudio使用麦克风录制wav声音文件的教程
Jun 26 Python
python 通过视频url获取视频的宽高方式
Dec 10 Python
python的链表基础知识点
Sep 13 Python
Python pygame实现中国象棋单机版源码
Jun 20 Python
基于python list对象中嵌套元组使用sort时的排序方法
Apr 18 #Python
python购物车程序简单代码
Apr 18 #Python
python list元素为tuple时的排序方法
Apr 18 #Python
详谈Python中列表list,元祖tuple和numpy中的array区别
Apr 18 #Python
Python3实现购物车功能
Apr 18 #Python
Python numpy 点数组去重的实例
Apr 18 #Python
对numpy中轴与维度的理解
Apr 18 #Python
You might like
PHP中UNIX时间戳和日期间的转换与计算实例
2014/11/19 PHP
php数组去除空值函数分享
2015/02/02 PHP
PHP抓取淘宝商品的用户晒单评论+图片+搜索商品列表实例
2016/04/14 PHP
PHP+MySQL实现消息队列的方法分析
2018/05/09 PHP
laravel 验证错误信息到 blade模板的方法
2019/09/29 PHP
jQuery UI Dialog 创建友好的弹出对话框实现代码
2012/04/12 Javascript
jquery mobile changepage的三种传参方法介绍
2013/09/13 Javascript
jQuery实现页面内锚点平滑跳转特效的方法总结
2015/05/11 Javascript
关于javascript中dataset的问题小结
2015/11/16 Javascript
js实现自定义进度条效果
2017/03/15 Javascript
关于meta viewport中target-densitydpi属性详解(推荐)
2017/08/18 Javascript
详解JavaScript中的数组合并方法和对象合并方法
2018/05/11 Javascript
使用layui 渲染table数据表格的实例代码
2018/08/19 Javascript
微信小程序页面间传值与页面取值操作实例分析
2019/04/30 Javascript
微信小程序实现圆形进度条动画
2020/11/18 Javascript
layui动态表头的实现代码
2019/08/22 Javascript
浅谈Vue3.0新版API之composition-api入坑指南
2020/04/30 Javascript
[02:40]DOTA2英雄基础教程 巨牙海民
2013/12/23 DOTA
[34:44]Liquid vs TNC Supermajor 胜者组 BO3 第二场 6.4
2018/06/05 DOTA
Django实现全文检索的方法(支持中文)
2018/05/14 Python
Python基于Tkinter模块实现的弹球小游戏
2018/12/27 Python
Python求解正态分布置信区间教程
2019/11/20 Python
python ubplot使用方法解析
2020/01/10 Python
Python图像处理库PIL的ImageFilter模块使用介绍
2020/02/26 Python
Python 使用 PyQt5 开发的关机小工具分享
2020/07/16 Python
PyTorch中clone()、detach()及相关扩展详解
2020/12/09 Python
房地产活动策划方案
2014/05/14 职场文书
啦啦队口号大全
2014/06/16 职场文书
电话客服工作职责
2014/07/27 职场文书
元旦趣味活动方案
2014/08/22 职场文书
党员志愿者活动方案
2014/08/28 职场文书
社区元宵节活动总结
2015/02/06 职场文书
投资申请报告
2015/05/19 职场文书
中国梦宣传标语口号
2015/12/26 职场文书
Python包管理工具pip的15 个使用小技巧
2021/05/17 Python
德劲DE1105机评
2022/04/05 无线电