使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解


Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性:将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children:通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)
<title>The Dormouse's story</title>
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点

title_tag.string
"The Dormouse's story"

如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string
"The Dormouse's story"

如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)
None

6.strings和stripped_strings

如果tag包含多个字符串,可以使用.strings循环获取

for string in soup.strings:
  print(string)
The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent:获得某个元素的父节点

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents:递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
',\n'
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

Python 相关文章推荐
线程和进程的区别及Python代码实例
Feb 04 Python
python中字典(Dictionary)用法实例详解
May 30 Python
使用Python生成XML的方法实例
Mar 21 Python
Python基于Matplotlib库简单绘制折线图的方法示例
Aug 14 Python
Python 判断是否为质数或素数的实例
Oct 30 Python
python2.6.6如何升级到python2.7.14
Apr 08 Python
Python中断多重循环的思路总结
Oct 04 Python
Python+OpenCV+图片旋转并用原底色填充新四角的例子
Dec 12 Python
Python -m参数原理及使用方法解析
Aug 21 Python
python 利用Pyinstaller打包Web项目
Oct 23 Python
Pyqt助手安装PyQt5帮助文档过程图解
Nov 20 Python
python自动化调用百度api解决验证码
Apr 13 Python
Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
Jan 25 #Python
Python爬虫库BeautifulSoup的介绍与简单使用实例
Jan 25 #Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
You might like
php 获取完整url地址
2008/12/20 PHP
PHP支持多种格式图片上传(支持jpg、png、gif)
2011/11/03 PHP
深入PHP购物车模块功能分析(函数讲解,附源码)
2013/06/25 PHP
[原创]PHP实现字节数Byte转换为KB、MB、GB、TB的方法
2017/08/31 PHP
服务端 VBScript 与 JScript 几个相同特性的写法 By shawl.qiu
2007/03/06 Javascript
javascript 运算数的求值顺序
2011/08/23 Javascript
jquery.tmpl JQuery模板插件
2011/10/10 Javascript
分享一则JavaScript滚动条插件源码
2015/03/03 Javascript
Angularjs全局变量被作用域监听的正确姿势
2016/02/06 Javascript
Nodejs学习item【入门手上】
2016/05/05 NodeJs
Bootstrap Table服务器分页与在线编辑应用总结
2016/08/08 Javascript
使用JS实现图片展示瀑布流效果(简单实例)
2016/09/06 Javascript
AngularJS获取json数据的方法详解
2017/05/27 Javascript
nodejs+mongodb aggregate级联查询操作示例
2018/03/17 NodeJs
vue.js中实现登录控制的方法示例
2018/04/23 Javascript
Vue 组件传值几种常用方法【总结】
2018/05/28 Javascript
vue+elementUI实现简单日历功能
2020/09/24 Javascript
[03:11]2014DOTA2国际邀请赛-VG掉入败者组 独家专访357
2014/07/19 DOTA
[41:37]DOTA2北京网鱼队选拔赛——冲击职业之路
2015/04/13 DOTA
python if not in 多条件判断代码
2016/09/21 Python
Python获取本机所有网卡ip,掩码和广播地址实例代码
2018/01/22 Python
python 使用正则表达式按照多个空格分割字符的实例
2018/12/20 Python
python树莓派红外反射传感器
2019/01/21 Python
Python下简易的单例模式详解
2019/04/08 Python
Python调用SMTP服务自动发送Email的实现步骤
2021/02/07 Python
CSS3实现曲线阴影和翘边阴影
2016/05/03 HTML / CSS
HTML5实现视频直播功能思路详解
2017/11/16 HTML / CSS
教师师德承诺书
2014/03/26 职场文书
物流管理专业自荐信
2014/06/23 职场文书
教师自查自纠材料
2014/10/14 职场文书
奖学金感谢信
2015/01/21 职场文书
前台接待岗位职责范本
2015/04/03 职场文书
毕业晚宴祝酒词
2015/08/11 职场文书
详解如何在Canvas中添加事件的方法
2021/04/17 Javascript
如何利用python创作字符画
2022/06/25 Python
mysql数据库如何转移到oracle
2022/12/24 MySQL