编程 Python

Python BS4库的安装与使用详解

Posted in Python onAugust 08, 2018

Beautiful Soup 库一般被称为bs4库，支持Python3，是我们写爬虫非常好的第三方库。因用起来十分的简便流畅。所以也被人叫做“美味汤”。目前bs4库的最新版本是4.60。下文会介绍该库的最基本的使用，具体详细的细节还是要看：[官方文档](Beautiful Soup Documentation)

bs4库的安装

Python的强大之处就在于他作为一个开源的语言，有着许多的开发者为之开发第三方库，这样我们开发者在想要实现某一个功能的时候，只要专心实现特定的功能，其他细节与基础的部分都可以交给库来做。bs4库就是我们写爬虫强有力的帮手。

安装的方式非常简单：我们用pip工具在命令行里进行安装

$ pip install beautifulsoup4

接着我们看一下是否成功安装了bs4库

$ pip list

这样我们就成功安装了 bs4 库

Python BS4库的安装与使用详解

bs4库的简单使用

这里我们先简单的讲解一下bs4库的使用，

暂时不去考虑如何从web上抓取网页，

假设我们需要爬取的html是如下这么一段：

下面的一段HTML代码将作为例子被多次用到.这是爱丽丝梦游仙境的的一段内容(以后内容中简称为爱丽丝的文档):

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
  
<p class="story">Once upon a time there were three little sisters; and their names were
http://example.com/elsie" class="sister" id="link1">Elsie,
http://example.com/lacie" class="sister" id="link2">Lacie and
http://example.com/tillie" class="sister" id="link3">Tillie;
and they lived at the bottom of a well.</p>
  
<p class="story">...</p>
</html>

下面我们开始用bs4库解析这一段html网页代码。

#导入bs4模块
from bs4 import BeautifulSoup
#做一个美味汤
soup = BeautifulSoup(html，'html.parser')
#输出结果
print(soup.prettify())
  
'''
OUT:
  
# <html>
# <head>
#  <title>
#  The Dormouse's story
#  </title>
# </head>
# <body>
#  <p class="title">
#  <b>
#   The Dormouse's story
#  </b>
#  </p>
#  <p class="story">
#  Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" rel="external nofollow" id="link1">
#   Elsie
#  </a>
#  ,
#  <a class="sister" href="http://example.com/lacie" rel="external nofollow" id="link2">
#   Lacie
#  </a>
#  and
#  <a class="sister" href="http://example.com/tillie" rel="external nofollow" id="link2">
#   Tillie
#  </a>
#  ; and they lived at the bottom of a well.
#  </p>
#  <p class="story">
#  ...
#  </p>
# </body>
# </html>
'''

可以看到bs4库将网页文件变成了一个soup的类型，

事实上，bs4库是解析、遍历、维护、“标签树“的功能库。

通俗一点说就是： bs4库把html源代码重新进行了格式化，

从而方便我们对其中的节点、标签、属性等进行操作。

下面是几个简单的浏览结构化数据的方式：

请仔细观察最前面的html文件

# 找到文档的title
soup.title
# <title>The Dormouse's story</title>
  
#title的name值
soup.title.name
# u'title'
  
#title中的字符串String
soup.title.string
# u'The Dormouse's story'
  
#title的父亲节点的name属性
soup.title.parent.name
# u'head'
  
#文档的第一个找到的段落
soup.p
# <p class="title"><b>The Dormouse's story</b></p>
  
#找到的p的class属性值
soup.p['class']
# u'title'
  
#找到a标签
soup.a
# http://example.com/elsie" id="link1">Elsie
  
#找到所有的a标签
soup.find_all('a')
# [http://example.com/elsie" id="link1">Elsie,
# http://example.com/lacie" id="link2">Lacie,
# http://example.com/tillie" id="link3">Tillie]
  
#找到id值等于3的a标签
soup.find(id="link3")
# http://example.com/tillie" id="link3">Tillie

通过上面的例子我们知道bs4库是这样理解一个html源文件的：

首先把html源文件转换为soup类型

接着从中通过特定的方式抓取内容

更高级点的用法？

从文档中找到所有<a>标签的链接:

#发现了没有，find_all方法返回的是一个可以迭代的列表
for link in soup.find_all('a'):
  print(link.get('href'))
  # http://example.com/elsie
  # http://example.com/lacie
  # http://example.com/tillie

从文档中获取所有文字内容:

#我们可以通过get_text 方法 快速得到源文件中的所有text内容。
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...

bs4库的入门使用我们就先进行到这。

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

Python BS4库的安装与使用详解

- Author -

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python中使用摄像头实现简单的延时摄影技术

Mar 27 Python

利用Python绘制数据的瀑布图的教程

Apr 07 Python

让python在hadoop上跑起来

Jan 27 Python

matplotlib绘制符合论文要求的图片实例(必看篇)

Jun 02 Python

Python实现的矩阵类实例

Aug 22 Python

Python使用OpenCV进行标定

May 08 Python

Python RabbitMQ消息队列实现rpc

May 30 Python

pygame游戏之旅添加碰撞效果的方法

Nov 20 Python

Python基础学习之函数方法实例详解

Jun 18 Python

Python实现微信中找回好友、群聊用户撤回的消息功能示例

Aug 23 Python

Python龙贝格法求积分实例

Feb 29 Python

python 还原梯度下降算法实现一维线性回归

Oct 22 Python

python特性语法之遍历、公共方法、引用

Aug 08 #Python

用Python shell简化开发

Aug 08 #Python

在Python中使用gRPC的方法示例

Aug 08 #Python

Python实现购物评论文本情感分析操作【基于中文文本挖掘库snownlp】

Aug 07 #Python

python实现彩票系统

Jun 28 #Python

django框架自定义用户表操作示例

Aug 07 #Python

Python实现基于POS算法的区块链

Aug 07 #Python

You might like

CPU步进是什么意思？i3-9100F B0步进和U0步进区别知识科普

2020/03/17 数码科技

PHP函数spl_autoload_register()用法和__autoload()介绍

2012/02/04 PHP

php利用腾讯ip分享计划获取地理位置示例分享

2014/01/20 PHP

PHP面向对象五大原则之依赖倒置原则(DIP)详解

2018/04/08 PHP

PDO::beginTransaction讲解

2019/01/27 PHP

JavaScript 5 新增 Array 方法实现介绍

2012/02/06 Javascript

网页中可关闭的漂浮窗口实现可自行调节

2013/08/20 Javascript

jquery获取html元素的绝对位置和相对位置的方法

2014/06/20 Javascript

javascript引擎长时间独占线程造成卡顿的解决方案

2014/12/03 Javascript

Javascript基础教程之if条件语句

2015/01/18 Javascript

JS+CSS实现下拉列表框美化效果（3款）

2015/08/15 Javascript

jQuery插件实现文件上传功能（支持拖拽）

2020/08/27 Javascript

vue.js分页中单击页码更换页面内容的方法(配合spring springmvc)

2018/02/10 Javascript

vue数据传递--我有特殊的实现技巧

2018/03/20 Javascript

vue 使用vue-i18n做全局中英文切换的方法

2018/10/29 Javascript

微信小程序实现时间预约功能

2018/11/27 Javascript

详解javascript 变量提升（Hoisting）

2019/03/12 Javascript

JavaScript实现五子棋小游戏

2020/10/26 Javascript

Vue页面渲染中key的应用实例教程

2021/01/12 Vue.js

使用Pyinstaller的最新踩坑实战记录

2017/11/08 Python

Python利用multiprocessing实现最简单的分布式作业调度系统实例

2017/11/14 Python

对python中Matplotlib的坐标轴的坐标区间的设定实例讲解

2018/05/25 Python

对python以16进制打印字节数组的方法详解

2019/01/24 Python

利用anaconda保证64位和32位的python共存

2021/03/09 Python

Puppeteer使用示例详解

2019/06/20 Python

解决django后台管理界面添加中文内容乱码问题

2019/11/15 Python

TensorFlow加载模型时出错的解决方式

2020/02/06 Python

使用Puppeteer爬取微信文章的实现

2020/02/11 Python

python 负数取模运算实例

2020/06/03 Python

使用keras实现Precise, Recall, F1-socre方式

2020/06/15 Python

Canvas 文本转粒子效果的实现代码

2019/02/14 HTML / CSS

美国爆米花工厂：The Popcorn Factory

2019/09/14 全球购物

《晚上的太阳》教学反思

2014/04/23 职场文书

会计专业求职信

2014/08/10 职场文书

体育教师研修感悟

2015/11/18 职场文书

详解PHP设计模式之依赖注入模式

2021/05/25 PHP