编程 Python

使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

Posted in Python onJanuary 25, 2020

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例，都是最基础的内容

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag，这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)
print(soup.title)

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag，如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性：将Tag的子节点通过列表的方式返回

head_tag = soup.head
head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag

<title>The Dormouse's story</title>

title_tag.contents

["The Dormouse's story"]

3.children：通过该属性对子节点进行循环

for child in title_tag.children:
  print(child)

The Dormouse's story

4.descendants：不论是contents还是children都是返回直接子节点，而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:
  print(child)

<title>The Dormouse's story</title>

for child in head_tag.descendants:
  print(child)

<title>The Dormouse's story</title>
The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点，那么tag可以使用.string得到该子节点

title_tag.string

"The Dormouse's story"

如果一个tag只有一个子节点，那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string

"The Dormouse's story"

如果tag有多个子节点，tag无法确定.string对应的是那个子结点的内容，故返回None

print(soup.html.string)

None

6.strings和stripped_strings

如果tag包含多个字符串，可以使用.strings循环获取

for string in soup.strings:
  print(string)

The Dormouse's story


The Dormouse's story


Once upon a time there were three little sisters; and their names were

Elsie
,

Lacie
 and

Tillie
;
and they lived at the bottom of a well.


...

.string输出的内容包含了许多空格和空行，使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:
  print(string)

The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie
,
Lacie
and
Tillie
;
and they lived at the bottom of a well.
...

二、父节点

1.parent：获得某个元素的父节点

title_tag = soup.title
title_tag.parent

<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent

<title>The Dormouse's story</title>

2.parents：递归的获得所有父辈节点

link = soup.a
for parent in link.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)

p
body
html
[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')
print(sibling_soup.prettify())

<html>
 <body>
 <a>
  <b>
  text1
  </b>
  <c>
  text2
  </c>
 </a>
 </body>
</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling

<c>text2</c>

sibling_soup.c.previous_sibling

<b>text1</b>

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

soup.a.next_sibling # 第一个<a></a>的next_sibling是,\n

',\n'

soup.a.next_sibling.next_sibling

<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:
  print(repr(sibling))

',\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
' and\n'
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
';\nand they lived at the bottom of a well.'

for sibling in soup.find(id="link3").previous_siblings:
  print(repr(sibling))

' and\n'
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>
',\n'
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
'Once upon a time there were three little sisters; and their names were\n'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag)，即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")
print(last_a_tag.next_sibling)
print(last_a_tag.next_element)

;
and they lived at the bottom of a well.
Tillie

last_a_tag.previous_element

' and\n'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容，就好像文档正在被解析一样

for element in last_a_tag.next_elements:
  print(repr(element))

'Tillie'
';\nand they lived at the bottom of a well.'
'\n'
<p class="story">...</p>
'...'
'\n'

更多关于使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的方法与文章大家可以点击下面的相关文章

使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

- Author -

BQW_

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python 多线程实现检测服务器在线情况

Nov 25 Python

详解Python中heapq模块的用法

Jun 28 Python

Python中的id()函数指的什么

Oct 17 Python

解决Django migrate No changes detected 不能创建表的问题

May 27 Python

解决python给列表里添加字典时被最后一个覆盖的问题

Jan 21 Python

Python常用特殊方法实例总结

Mar 22 Python

Python 监测文件是否更新的方法

Jun 10 Python

python科学计算之narray对象用法

Nov 25 Python

Python批量安装卸载1000个apk的方法

Apr 10 Python

Python 判断时间是否在时间区间内的实例

May 16 Python

python编程的核心知识点总结

Feb 08 Python

正确的理解和使用Django信号(Signals)

Apr 14 Python

Python爬虫库BeautifulSoup获取对象(标签)名,属性,内容,注释

Jan 25 #Python

Python爬虫库BeautifulSoup的介绍与简单使用实例

Jan 25 #Python

使用Python爬虫库requests发送表单数据和JSON数据

Jan 25 #Python

Python爬虫库requests获取响应内容、响应状态码、响应头

Jan 25 #Python

使用Python爬虫库requests发送请求、传递URL参数、定制headers

Jan 25 #Python

flask框架自定义url转换器操作详解

Jan 25 #Python

常用python爬虫库介绍与简要说明

Jan 25 #Python

You might like

destoon常用的安全设置概述

2014/06/21 PHP

php实现数组中索引关联数据转换成json对象的方法

2015/07/08 PHP

在IIS下安装PHP扩展的方法（超简单）

2017/04/10 PHP

jValidate 基于jQuery的表单验证插件

2009/12/12 Javascript

通过Javascript将数据导出到外部Excel文档的函数代码

2012/06/15 Javascript

浏览器加载、渲染和解析过程黑箱简析

2012/11/29 Javascript

javascript中的document.open()方法使用介绍

2013/10/09 Javascript

javascript创建动态表单的方法

2015/07/25 Javascript

jQuery无刷新切换主题皮肤实例讲解

2015/10/21 Javascript

实例详解jQuery表单验证插件validate

2016/01/18 Javascript

超赞的jQuery图片滑块动画特效代码汇总

2016/01/25 Javascript

清除浏览器缓存的几种方法总结(必看)

2016/12/09 Javascript

jQuery插件Echarts实现的渐变色柱状图

2017/03/23 jQuery

vue实现手机端省市区区域选择

2019/09/27 Javascript

vue输入框使用模糊搜索功能的实现代码

2020/05/26 Javascript

python爬虫常用的模块分析

2014/08/29 Python

Python实现周期性抓取网页内容的方法

2015/11/04 Python

对python 各种删除文件失败的处理方式分享

2018/04/24 Python

python读写csv文件实例代码

2019/07/05 Python

Python cookie的保存与读取、SSL讲解

2020/02/17 Python

Pytorch maxpool的ceil_mode用法

2020/02/18 Python

python爬虫实现爬取同一个网站的多页数据的实例讲解

2021/01/18 Python

伦敦最著名的老字号百货公司：Selfridges（塞尔福里奇百货）

2016/07/25 全球购物

设计师家具购买和委托在线市场：Viyet

2016/11/16 全球购物

美国领先的精品家居照明和装饰产品在线零售商：LightsOnline.com

2018/01/23 全球购物

Elemental Herbology官网：英国美容品牌

2019/04/27 全球购物

英国排名第一的LED灯泡网站：LED Bulbs

2019/09/03 全球购物

英国买鞋网站：Charles Clinkard

2019/11/14 全球购物

大学军训感想

2014/02/12 职场文书

应聘销售主管的求职信

2014/04/26 职场文书

部门活动策划方案

2014/08/16 职场文书

运动会演讲稿50字

2014/08/25 职场文书

工作批评与自我批评范文

2014/10/16 职场文书

2016重阳节红领巾广播稿

2015/12/18 职场文书

MySQL系列之一 MariaDB-server安装

2021/07/02 MySQL

基于Python编写一个监控CPU的应用系统

2022/06/25 Python