使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解


Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)
<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性:将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents
[<title>The Dormouse's story</title>]
title_tag = head_tag.contents[0]
title_tag
<title>The Dormouse's story</title>
title_tag.contents
["The Dormouse's story"]

3.children:通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)
The Dormouse's story

4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)
<title>The Dormouse's story</title>
for child in head_tag.descendants:
  print(child)
<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点

title_tag.string
"The Dormouse's story"

如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string
"The Dormouse's story"

如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)
None

6.strings和stripped_strings

如果tag包含多个字符串,可以使用.strings循环获取

for string in soup.strings:
  print(string)
The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent:获得某个元素的父节点

title_tag = soup.title
title_tag.parent
<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent
<title>The Dormouse's story</title>

2.parents:递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)
p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())
<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling
<c>text2</c>
sibling_soup.c.previous_sibling
<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n
',\n'
soup.a.next_sibling.next_sibling
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))
',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'
for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))
' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)
;
and they lived at the bottom of a well.
Tillie
last_a_tag.previous_element
' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))
'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

Python 相关文章推荐
python pandas dataframe 行列选择,切片操作方法
Apr 10 Python
Django实现登录随机验证码的示例代码
Jun 20 Python
python实现傅里叶级数展开的实现
Jul 21 Python
win10下tensorflow和matplotlib安装教程
Sep 19 Python
python合并已经存在的sheet数据到新sheet的方法
Dec 11 Python
python3通过selenium爬虫获取到dj商品的实例代码
Apr 25 Python
Django数据库类库MySQLdb使用详解
Apr 28 Python
Python入门Anaconda和Pycharm的安装和配置详解
Jul 16 Python
python GUI库图形界面开发之PyQt5不规则窗口实现与显示GIF动画的详细方法与实例
Mar 09 Python
Python配置pip国内镜像源的实现
Aug 20 Python
Python 实现PS滤镜中的径向模糊特效
Dec 03 Python
Python学习开发之图形用户界面详解
Aug 23 Python
Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释
Jan 25 #Python
Python爬虫库BeautifulSoup的介绍与简单使用实例
Jan 25 #Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
You might like
PHP详解ASCII码对照表与字符转换
2011/12/05 PHP
php中使用__autoload()自动加载未定义类的实现代码
2013/02/06 PHP
PHP+jquery实时显示网站在线人数的方法
2015/01/04 PHP
PHP实现上传多文件示例代码
2017/02/20 PHP
微信公众平台开发教程①获取用户Openid及个人信息图文详解
2019/04/10 PHP
[原创]后缀就扩展名为js的文件是什么文件
2007/12/06 Javascript
Html中JS脚本执行顺序简单举例说明
2010/06/19 Javascript
基于dom编程中 动态创建与删除元素的使用
2013/04/17 Javascript
js的toLowerCase方法用法实例
2015/01/27 Javascript
AngularJS 整理一些优化的小技巧
2016/08/18 Javascript
vue实现简单loading进度条
2018/06/06 Javascript
js数组去重的方法总结
2019/01/18 Javascript
JS中getElementsByClassName与classList兼容性问题解决方案分析
2019/08/07 Javascript
JS回调函数简单易懂的入门实例分析
2019/09/29 Javascript
jquery实现轮播图特效
2020/04/12 jQuery
JavaScript对象字面量和构造函数原理与用法详解
2020/04/18 Javascript
JavaScript 引用类型实例详解【数组、对象、严格模式等】
2020/05/13 Javascript
vue使用screenfull插件实现全屏功能
2020/09/17 Javascript
梳理一下vue中的生命周期
2020/12/30 Vue.js
[02:28]DOTA2 2015国际邀请赛中国区预选赛首日现场百态
2015/05/26 DOTA
Python中xrange与yield的用法实例分析
2017/12/26 Python
Python输出各行命令详解
2018/02/01 Python
Python PIL图片添加字体的例子
2019/08/22 Python
pytorch 批次遍历数据集打印数据的例子
2019/12/30 Python
css3 线性渐变和径向渐变示例附图
2014/04/08 HTML / CSS
美国著名童装品牌:OshKosh B’gosh
2016/08/05 全球购物
科沃斯机器人官网商城:Ecovacs
2016/08/29 全球购物
Booking.com荷兰:全球酒店网上预订
2017/08/22 全球购物
英国二手iPhone、音乐、电影和游戏商店:musicMagpie
2018/10/26 全球购物
毕业生的求职信范文分享
2013/12/04 职场文书
公务员试用期满考核材料
2014/05/22 职场文书
考试作弊检讨书范文
2015/01/27 职场文书
应届生求职自荐信范文
2015/03/04 职场文书
导游词之湖州-太湖
2019/10/11 职场文书
源码解读Spring-Integration执行过程
2021/06/11 Java/Android
Python可视化神器pyecharts之绘制地理图表练习
2022/07/07 Python