Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】


Posted in Python onApril 05, 2019

本文实例讲述了Python HTML解析器BeautifulSoup用法。分享给大家供大家参考,具体如下:

BeautifulSoup简介

我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的汤),它是一个第三方库。简单来说,BeautifulSoup最主要的功能是从网页抓取数据。本文我们来感受一下BeautifulSoup的优雅而强大的功能吧!

BeautifulSoup安装

BeautifulSoup3 目前已经停止开发,推荐在现在的项目中使用BeautifulSoup4,不过它已经被移植到bs4了,也就是说导入时我们需要 import bs4 。可以利用 pip 或者 easy_install 两种方法来安装。下面采用pip安装。

pip install beautifulsoup4
pip install lxml

建议同时安装"lxml"模块,BeautifulSoup支持Python标准库中的HTML解析器(HTMLParser),还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

创建对象

安装后,创建对象:

soup = BeautifulSoup(markup='html文件', 'lxml')

格式化输出:

soup.prettify()

BeautifulSoup四大对象类型

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

  • Tag(标签)
  • NavigableString(内容)
  • BeautifulSoup(文档)
  • Comment(注释)

1.Tag类型

即HTML的整个标签,如获取<title>标签:

print soup.title
#<title>The Dormouse's story</title>

Tag有两个重要属性:name,attrs。

name

即HTML的标签名称:

print soup.name
#[document]
print soup.head.name
#head

attrs

即HTML的标签属性字典:

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

如果想要单独获取某个属性:

print soup.p['class']
#['title']

2.NavigableString类型

既然我们已经得到了整个标签,那么问题来了,我们要想获取标签内部的文字内容怎么办呢?很简单,用 string 即可:

print soup.p.string
#The Dormouse's story

3.BeautifulSoup类型

BeautifulSoup 对象表示的是一个文档的全部内容.:

print soup.name
# [document]

4.Comment类型

HTML的注释内容,注意的是,不包含注释符号。我们首先判断它的类型,是否为 Comment 类型,然后再进行其他操作,如打印输出:

if type(soup.a.string)==bs4.element.Comment:
  print soup.a.string
#<!-- Elsie -->

遍历文档树

1.子节点

contents

获取所有子节点,返回列表:

print soup.head.contents
#[<title>The Dormouse's story</title>]

children

获取所有子节点,返回列表生成器:

print soup.head.children
#<listiterator object at 0x7f71457f5710>
## 需要遍历
for child in soup.body.children:
  print child
## 结果
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

2.节点内容

string

返回单个文本内容。如果一个标签里面没有标签了,那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string 方法应该调用哪个子节点的内容,string 的输出结果是 None。例如:

print soup.head.string
print soup.title.string
#The Dormouse's story
#The Dormouse's story
print soup.html.string
# None

strings

返回多个文本内容,且包含空行和空格。

stripped_strings

返回多个文本内容,且不包含空行和空格:

for string in soup.stripped_strings:
  print(repr(string))
  # u"The Dormouse's story"
  # u"The Dormouse's story"
  # u'Once upon a time there were three little sisters; and their names were'
  # u'Elsie'
  # u','
  # u'Lacie'
  # u'and'
  # u'Tillie'
  # u';\nand they lived at the bottom of a well.'
  # u'...'

get_text()方法

返回当前节点和子节点的文本内容。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister1" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister2" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister3" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
  </p>
  <p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(markup=html_doc,features='lxml')
node_p_text=soup.find('p',class_='story').get_text()
# 注意class_带下划线
print(node_p_text)
# 结果
Once upon a time there were three little sisters; and their names were
    Elsie,
    Lacie and
    Tillie;
    and they lived at the bottom of a well.

3.父节点

parent

返回某节点的直接父节点:

p = soup.p
print p.parent.name
#body

parents

返回某节点的所有父辈及以上辈的节点:

content = soup.head.title.string
for parent in content.parents:
  print parent.name
## 结果
title
head
html
[document]

4.兄弟节点

next_sibling

next_sibling 属性获取该节点的下一个兄弟节点,结果通常是字符串或空白,因为空白或者换行也可以被视作一个节点。

previous_sibling

previous_sibling 属性获取该节点的上一个兄弟节点。

print soup.p.next_sibling
#    实际该处为空白
print soup.p.prev_sibling
#None  没有前一个兄弟节点,返回 None
print soup.p.next_sibling.next_sibling
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>
#下一个节点的下一个兄弟节点是我们可以看到的节点

next_siblingsprevious_siblings

迭代获取全部兄弟节点。

5.前后节点

next_elementprevious_element

不是针对于兄弟节点,而是在于所有节点,不分层次的前一个和后一个节点。

next_elementsprevious_elements

迭代获取所有前和后节点。

搜索文档树

1.find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

参数说明

name参数

name参数很强大,可以传多种方式的参数,查找所有名字为 name 的tag,字符串对象会被自动忽略掉。

(a)传标签名

最简单的过滤器是标签名。在搜索方法中传入一个标签名参数,BeautifulSoup会查找与标签名完整匹配的内容,下面的例子用于查找文档中所有的<a>标签:

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

返回结果列表中的元素仍然是BeautifulSoup对象。

(b)传正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

(c)传列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和<b>标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

(d)传True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点:

for tag in soup.find_all(True):
  print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a

(e)传函数

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]

keyword参数

注意的是,如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,BeautifulSoup会搜索每个tag的”id”属性:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

如果传入 href 参数,Beautiful Soup会搜索每个tag的"href"属性:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>]

在这里我们想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

attrs参数

有些tag属性在搜索不能使用,比如HTML5中的 " data-* " 自定义属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
## 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

text参数

通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样,text 参数接受字符串 、正则表达式 、列表、True。

soup.find_all(text="Elsie")
# [u'Elsie']
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(text=re.compile("Dormouse"))# 模糊查找
[u"The Dormouse's story", u"The Dormouse's story"]

limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

recursive参数

调用tag的 find_all() 方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
# []

2.find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果。

3.find_parents() 和 find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等。find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容。

4.find_next_siblings() 和 find_next_sibling()  

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点。

5.find_previous_siblings() 和 find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点,find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。

6.find_all_next() 和 find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点。

7.find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点。

CSS选择器

我们在写 CSS 时,标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list。

通过标签名查找

print soup.select('title')
#[<title>The Dormouse's story</title>]
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('b')
#[<b>The Dormouse's story</b>]

通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

组合查找

组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开。

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

直接子标签查找:

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

属性查找

查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格:

print soup.select('p a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

以上的 select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 string或get_text() 方法来获取它的内容:

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()
for title in soup.select('title'):
  print title.get_text()

更多关于Python相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
python实现ipsec开权限实例
Nov 11 Python
python函数式编程学习之yield表达式形式详解
Mar 25 Python
Python面向对象基础入门之设置对象属性
Dec 11 Python
python接口自动化(十六)--参数关联接口后传(详解)
Apr 16 Python
django框架模型层功能、组成与用法分析
Jul 30 Python
python requests模拟登陆github的实现方法
Dec 26 Python
python中round函数如何使用
Jun 19 Python
Python3爬虫中关于Ajax分析方法的总结
Jul 10 Python
python import 上级目录的导入
Nov 03 Python
Django自带的用户验证系统实现
Dec 18 Python
Python的scikit-image模块实例讲解
Dec 30 Python
Python3中PyQt5简单实现文件打开及保存
Jun 10 Python
Python HTML解析模块HTMLParser用法分析【爬虫工具】
Apr 05 #Python
Python爬虫实现爬取百度百科词条功能实例
Apr 05 #Python
Python3.5多进程原理与用法实例分析
Apr 05 #Python
Python选择网卡发包及接收数据包
Apr 04 #Python
详解Python的数据库操作(pymysql)
Apr 04 #Python
python dlib人脸识别代码实例
Apr 04 #Python
python图像处理入门(一)
Apr 04 #Python
You might like
PHP文件锁定写入实例解析
2014/07/14 PHP
PHP批量生成图片缩略图的方法
2015/06/18 PHP
PHP基于imagick扩展实现合成图片的两种方法【附imagick扩展下载】
2017/11/14 PHP
Laravel基础_关于view共享数据的示例讲解
2019/10/14 PHP
iframe 父窗口和子窗口相互的调用方法集锦
2010/12/15 Javascript
解析javascript 浏览器关闭事件
2013/07/08 Javascript
javascript转换静态图片,增加粒子动画效果
2015/05/28 Javascript
jquery插件splitScren实现页面分屏切换模板特效
2015/06/16 Javascript
使用AmplifyJS组件配合JavaScript进行编程的指南
2015/07/28 Javascript
JavaScript定义函数_动力节点Java学院整理
2017/06/27 Javascript
Vue页面骨架屏注入方法
2018/05/13 Javascript
JS实现面向对象继承的5种方式分析
2018/07/21 Javascript
Vue2实时监听表单变化的示例讲解
2018/08/30 Javascript
vue组件命名和props命名代码详解
2019/09/01 Javascript
node.js开发辅助工具nodemon安装与配置详解
2020/02/06 Javascript
mpvue实现微信小程序快递单号查询代码
2020/04/03 Javascript
小谈angular ng deploy的实现
2020/04/07 Javascript
python创建进程fork用法
2015/06/04 Python
python通过加号运算符操作列表的方法
2015/07/28 Python
python 修改本地网络配置的方法
2019/08/14 Python
在python Numpy中求向量和矩阵的范数实例
2019/08/26 Python
python GUI库图形界面开发之PyQt5动态(可拖动控件大小)布局控件QSplitter详细使用方法与实例
2020/03/06 Python
浅谈html5增强的页面元素
2016/06/14 HTML / CSS
Canvas引入跨域的图片导致toDataURL()报错的问题的解决
2018/09/19 HTML / CSS
工程招投标邀请书
2014/01/30 职场文书
2014迎新年晚会策划方案
2014/02/23 职场文书
人力资源主管职责范本
2014/03/05 职场文书
机电职业生涯规划书范文
2014/03/08 职场文书
励志演讲稿800字
2014/08/21 职场文书
金融保险专业求职信
2014/09/03 职场文书
勿忘国耻9.18演讲稿(经典篇)
2014/09/14 职场文书
2014年消防工作总结
2014/11/21 职场文书
2015年勤工助学工作总结
2015/04/29 职场文书
授权协议书范本(3篇)
2019/10/15 职场文书
Python爬虫之爬取最新更新的小说网站
2021/05/06 Python
python 实现图片特效处理
2022/04/03 Python