Python爬虫库BeautifulSoup的介绍与简单使用实例


Posted in Python onJanuary 25, 2020

一、介绍

BeautifulSoup库是灵活又方便的网页解析库,处理高效,支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

Python常用解析库

解析器 使用方法 优势 劣势
Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差
lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库
lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库
html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展

二、快速开始

给定html文档,产生BeautifulSoup对象

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

输出完整文本

print(soup.prettify())
<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

浏览结构化数据

print(soup.title) #<title>标签及内容
print(soup.title.name) #<title>name属性
print(soup.title.string) #<title>内的字符串
print(soup.title.parent.name) #<title>的父标签name属性(head)
print(soup.p) # 第一个<p></p>
print(soup.p['class']) #第一个<p></p>的class
print(soup.a) # 第一个<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的标签
<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有标签内的链接

for link in soup.find_all('a'):
  print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

获得所有文字内容

print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自动补全标签并进行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.prettify())#格式化代码,自动补全
print(soup.title.string)#得到title标签里的内容

标签选择器

选择元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.title)#选择了title标签
print(type(soup.title))#查看类型
print(soup.head)

获取标签名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.title.name)

获取标签属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.attrs['name'])#获取p标签中,name这个属性的值
print(soup.p['name'])#另一种写法,比较直接

获取标签内容

print(soup.p.string)

标签嵌套选择

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.head.title.string)

子节点和子孙节点

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.contents)#获取指定标签的子节点,类型是list

另一个方法,child:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.children)#获取指定标签的子节点的迭代器对象
for i,children in enumerate(soup.p.children):#i接受索引,children接受内容
	print(i,children)

输出结果与上面的一样,多了一个索引。注意,只能用循环来迭代出子节点的信息。因为直接返回的只是一个迭代器对象。

获取子孙节点:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.p.descendants)#获取指定标签的子孙节点的迭代器对象
for i,child in enumerate(soup.p.descendants):#i接受索引,child接受内容
	print(i,child)

父节点和祖先节点

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(soup.a.parent)#获取指定标签的父节点

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(list(enumerate(soup.a.parents)))#获取指定标签的祖先节点

兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#传入解析器:lxml
print(list(enumerate(soup.a.next_siblings)))#获取指定标签的后面的兄弟节点
print(list(enumerate(soup.a.previous_siblings)))#获取指定标签的前面的兄弟节点

标准选择器

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul标签下的内容
print(type(soup.find_all('ul')[0]))#查看其类型

下面的例子就是查找所有ul标签下的li标签:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs(属性)

通过属性进行元素的查找

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#传入的是一个字典类型,也就是想要查找的属性
print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同样的内容,因为这两个属性是在同一个标签里面的。

特殊类型的参数查找:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是个特殊的属性,可以直接使用
print(soup.find_all(class_='element')) #class是关键字所以要用class_

text

根据文本内容来进行选择:

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本为Foo的内容,但是返回的不是标签

所以说这个text在做内容匹配的时候比较方便,但是在做内容查找的时候并不是太方便。

方法

find

find用法和findall一模一样,但是返回的是找到的第一个符合条件的内容输出。

ind_parents(), find_parent()

find_parents()返回所有祖先节点,find_parent()返回直接父节点。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟节点,find_next_sibling()返回后面的第一个兄弟节点

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点

find_all_next(),find_next()

find_all_next()返回节点后所有符合条件的节点,find_next()返回后面第一个符合条件的节点

find_all_previous(),find_previous()

find_all_previous()返回节点前所有符合条件的节点,find_previous()返回前面第一个符合条件的节点

CSS选择器 通过select()直接传入CSS选择器即可完成选择

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class,中间需要空格来分隔
print(soup.select('ul li')) #选择ul标签下面的li标签
print(soup.select('#list-2 .element')) #'#'代表id。这句的意思是查找id为"list-2"的标签下的,class=element的元素
print(type(soup.select('ul')[0]))#打印节点类型

再看看层层嵌套的选择:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可获取属性
  print(ul.attrs['id'])#另一种写法

获取内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text()方法就能获取内容了。

总结

推荐使用lxml解析库,必要时使用html.parser

标签选择筛选功能弱但是速度快 建议使用find()、find_all() 查询匹配单个结果或者多个结果

如果对CSS选择器熟悉建议使用select()

记住常用的获取属性和文本值的方法

更多关于Python爬虫库BeautifulSoup的介绍与简单使用实例请点击下面的相关链接

Python 相关文章推荐
盘点提高 Python 代码效率的方法
Jul 03 Python
python监控网站运行异常并发送邮件的方法
Mar 13 Python
《Python之禅》中对于Python编程过程中的一些建议
Apr 03 Python
python 数据清洗之数据合并、转换、过滤、排序
Feb 12 Python
Pandas探索之高性能函数eval和query解析
Oct 28 Python
python3 cvs将数据读取为字典的方法
Dec 22 Python
Python列表对象实现原理详解
Jul 01 Python
python pillow模块使用方法详解
Aug 30 Python
python GUI库图形界面开发之PyQt5 Qt Designer工具(Qt设计师)详细使用方法及Designer ui文件转py文件方法
Feb 26 Python
python数据分析工具之 matplotlib详解
Apr 09 Python
Python3如何使用多线程升程序运行速度
Aug 11 Python
浅谈python锁与死锁问题
Aug 14 Python
使用Python爬虫库requests发送表单数据和JSON数据
Jan 25 #Python
Python爬虫库requests获取响应内容、响应状态码、响应头
Jan 25 #Python
使用Python爬虫库requests发送请求、传递URL参数、定制headers
Jan 25 #Python
flask框架自定义url转换器操作详解
Jan 25 #Python
常用python爬虫库介绍与简要说明
Jan 25 #Python
flask框架url与重定向操作实例详解
Jan 25 #Python
flask框架蓝图和子域名配置详解
Jan 25 #Python
You might like
完美解决PHP中文乱码
2009/11/26 PHP
php性能分析之php-fpm慢执行日志slow log用法浅析
2016/10/17 PHP
PHP二维数组去重实例分析
2016/11/18 PHP
B/S开发中常用javaScript技术与代码
2007/03/09 Javascript
静态页面下用javascript操作ACCESS数据库(读增改删)的代码
2007/05/14 Javascript
javascript 模拟JQuery的Ready方法实现并出现的问题
2009/12/06 Javascript
div当滚动到页面顶部的时候固定在顶部实例代码
2013/05/27 Javascript
js函数排序的实例代码
2013/07/01 Javascript
JS图像无缝滚动脚本非常好用
2014/02/10 Javascript
jQuery中on()方法用法实例
2015/01/19 Javascript
javascript判断复选框是否选中的方法
2015/10/16 Javascript
AngularJS利用Controller完成URL跳转
2016/08/09 Javascript
jquery 属性选择器(匹配具有指定属性的元素)
2016/09/06 Javascript
Vuex 进阶之模块化组织详解
2018/01/12 Javascript
Spring Boot/VUE中路由传递参数的实现代码
2018/03/02 Javascript
node中modules.exports与exports导出的区别
2018/06/08 Javascript
Angular2使用SVG自定义图表(条形图、折线图)组件示例
2019/05/10 Javascript
JavaScript交换两个变量方法实例
2019/11/25 Javascript
JavaScript实现捕获鼠标坐标
2020/04/12 Javascript
Openlayers实现距离面积测量
2020/09/28 Javascript
[01:02:46]VGJ.S vs NB 2018国际邀请赛小组赛BO2 第二场 8.18
2018/08/19 DOTA
用python代码做configure文件
2014/07/20 Python
Python之Scrapy爬虫框架安装及使用详解
2017/11/16 Python
Python爬虫实现爬取百度百科词条功能实例
2019/04/05 Python
PyQt5固定窗口大小的方法
2019/06/18 Python
Pandas之排序函数sort_values()的实现
2019/07/09 Python
对Django外键关系的描述
2019/07/26 Python
Python 使用 environs 库定义环境变量的方法
2020/02/25 Python
使用keras实现孪生网络中的权值共享教程
2020/06/11 Python
html5/css3响应式页面开发总结
2018/10/16 HTML / CSS
世界各地的当地人的食物体验:Eatwith
2019/07/26 全球购物
俄罗斯护发和专业化妆品购物网站:Hihair
2019/09/28 全球购物
社区工作感言
2014/02/21 职场文书
先进员工事迹材料
2014/12/20 职场文书
用Python实现屏幕截图详解
2022/01/22 Python
世界十大评分最高的动漫,CLANNAD上榜,第八赚足人们眼泪
2022/03/18 日漫