Python HTML解析器BeautifulSoup用法实例详解【爬虫解析器】


Posted in Python onApril 05, 2019

本文实例讲述了Python HTML解析器BeautifulSoup用法。分享给大家供大家参考,具体如下:

BeautifulSoup简介

我们知道,Python拥有出色的内置HTML解析器模块——HTMLParser,然而还有一个功能更为强大的HTML或XML解析工具——BeautifulSoup(美味的汤),它是一个第三方库。简单来说,BeautifulSoup最主要的功能是从网页抓取数据。本文我们来感受一下BeautifulSoup的优雅而强大的功能吧!

BeautifulSoup安装

BeautifulSoup3 目前已经停止开发,推荐在现在的项目中使用BeautifulSoup4,不过它已经被移植到bs4了,也就是说导入时我们需要 import bs4 。可以利用 pip 或者 easy_install 两种方法来安装。下面采用pip安装。

pip install beautifulsoup4
pip install lxml

建议同时安装"lxml"模块,BeautifulSoup支持Python标准库中的HTML解析器(HTMLParser),还支持一些第三方的解析器,如果我们不安装它,则 Python 会使用 Python默认的解析器,lxml 解析器更加强大,速度更快,推荐安装。

创建对象

安装后,创建对象:

soup = BeautifulSoup(markup='html文件', 'lxml')

格式化输出:

soup.prettify()

BeautifulSoup四大对象类型

BeautifulSoup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

  • Tag(标签)
  • NavigableString(内容)
  • BeautifulSoup(文档)
  • Comment(注释)

1.Tag类型

即HTML的整个标签,如获取<title>标签:

print soup.title
#<title>The Dormouse's story</title>

Tag有两个重要属性:name,attrs。

name

即HTML的标签名称:

print soup.name
#[document]
print soup.head.name
#head

attrs

即HTML的标签属性字典:

print soup.p.attrs
#{'class': ['title'], 'name': 'dromouse'}

如果想要单独获取某个属性:

print soup.p['class']
#['title']

2.NavigableString类型

既然我们已经得到了整个标签,那么问题来了,我们要想获取标签内部的文字内容怎么办呢?很简单,用 string 即可:

print soup.p.string
#The Dormouse's story

3.BeautifulSoup类型

BeautifulSoup 对象表示的是一个文档的全部内容.:

print soup.name
# [document]

4.Comment类型

HTML的注释内容,注意的是,不包含注释符号。我们首先判断它的类型,是否为 Comment 类型,然后再进行其他操作,如打印输出:

if type(soup.a.string)==bs4.element.Comment:
  print soup.a.string
#<!-- Elsie -->

遍历文档树

1.子节点

contents

获取所有子节点,返回列表:

print soup.head.contents
#[<title>The Dormouse's story</title>]

children

获取所有子节点,返回列表生成器:

print soup.head.children
#<listiterator object at 0x7f71457f5710>
## 需要遍历
for child in soup.body.children:
  print child
## 结果
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

2.节点内容

string

返回单个文本内容。如果一个标签里面没有标签了,那么 string 就会返回标签里面的内容。如果标签里面只有唯一的一个标签了,那么 string 也会返回最里面的内容。如果tag包含了多个子节点,tag就无法确定,string 方法应该调用哪个子节点的内容,string 的输出结果是 None。例如:

print soup.head.string
print soup.title.string
#The Dormouse's story
#The Dormouse's story
print soup.html.string
# None

strings

返回多个文本内容,且包含空行和空格。

stripped_strings

返回多个文本内容,且不包含空行和空格:

for string in soup.stripped_strings:
  print(repr(string))
  # u"The Dormouse's story"
  # u"The Dormouse's story"
  # u'Once upon a time there were three little sisters; and their names were'
  # u'Elsie'
  # u','
  # u'Lacie'
  # u'and'
  # u'Tillie'
  # u';\nand they lived at the bottom of a well.'
  # u'...'

get_text()方法

返回当前节点和子节点的文本内容。

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
  <p class="title"><b>The Dormouse's story</b></p>
  <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister1" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister2" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister3" id="link3">Tillie</a>;
    and they lived at the bottom of a well.
  </p>
  <p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(markup=html_doc,features='lxml')
node_p_text=soup.find('p',class_='story').get_text()
# 注意class_带下划线
print(node_p_text)
# 结果
Once upon a time there were three little sisters; and their names were
    Elsie,
    Lacie and
    Tillie;
    and they lived at the bottom of a well.

3.父节点

parent

返回某节点的直接父节点:

p = soup.p
print p.parent.name
#body

parents

返回某节点的所有父辈及以上辈的节点:

content = soup.head.title.string
for parent in content.parents:
  print parent.name
## 结果
title
head
html
[document]

4.兄弟节点

next_sibling

next_sibling 属性获取该节点的下一个兄弟节点,结果通常是字符串或空白,因为空白或者换行也可以被视作一个节点。

previous_sibling

previous_sibling 属性获取该节点的上一个兄弟节点。

print soup.p.next_sibling
#    实际该处为空白
print soup.p.prev_sibling
#None  没有前一个兄弟节点,返回 None
print soup.p.next_sibling.next_sibling
#<p class="story">Once upon a time there were three little sisters; and their names were
#<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>,
#<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a> and
#<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>;
#and they lived at the bottom of a well.</p>
#下一个节点的下一个兄弟节点是我们可以看到的节点

next_siblingsprevious_siblings

迭代获取全部兄弟节点。

5.前后节点

next_elementprevious_element

不是针对于兄弟节点,而是在于所有节点,不分层次的前一个和后一个节点。

next_elementsprevious_elements

迭代获取所有前和后节点。

搜索文档树

1.find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

find_all()方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件。

参数说明

name参数

name参数很强大,可以传多种方式的参数,查找所有名字为 name 的tag,字符串对象会被自动忽略掉。

(a)传标签名

最简单的过滤器是标签名。在搜索方法中传入一个标签名参数,BeautifulSoup会查找与标签名完整匹配的内容,下面的例子用于查找文档中所有的<a>标签:

print soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

返回结果列表中的元素仍然是BeautifulSoup对象。

(b)传正则表达式

如果传入正则表达式作为参数,BeautifulSoup会通过正则表达式的 match() 来匹配内容。下面例子中找出所有以b开头的标签,这表示<body>和<b>标签都应该被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

(c)传列表

如果传入列表参数,BeautifulSoup会将与列表中任一元素匹配的内容返回。下面代码找到文档中所有<a>标签和<b>标签:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

(d)传True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点:

for tag in soup.find_all(True):
  print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a

(e)传函数

如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数。如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
# <p class="story">Once upon a time there were...</p>,
# <p class="story">...</p>]

keyword参数

注意的是,如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作指定名字tag的属性来搜索,如果包含一个名字为 id 的参数,BeautifulSoup会搜索每个tag的”id”属性:

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

如果传入 href 参数,Beautiful Soup会搜索每个tag的"href"属性:

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">three</a>]

在这里我们想用 class 过滤,不过 class 是 python 的关键词,这怎么办?加个下划线就可以:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

attrs参数

有些tag属性在搜索不能使用,比如HTML5中的 " data-* " 自定义属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
## 但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

text参数

通过 text 参数可以搜搜文档中的字符串内容。与 name 参数的可选值一样,text 参数接受字符串 、正则表达式 、列表、True。

soup.find_all(text="Elsie")
# [u'Elsie']
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(text=re.compile("Dormouse"))# 模糊查找
[u"The Dormouse's story", u"The Dormouse's story"]

limit参数

find_all() 方法返回全部的搜索结构,如果文档树很大那么搜索会很慢。如果我们不需要全部结果,可以使用 limit 参数限制返回结果的数量。效果与SQL中的limit关键字类似,当搜索到的结果数量达到 limit 的限制时,就停止搜索返回结果。

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>]

recursive参数

调用tag的 find_all() 方法时,BeautifulSoup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]
soup.html.find_all("title", recursive=False)
# []

2.find( name , attrs , recursive , text , **kwargs )

它与 find_all() 方法唯一的区别是 find_all() 方法的返回结果是值包含一个元素的列表,而 find() 方法直接返回结果。

3.find_parents() 和 find_parent()

find_all() 和 find() 只搜索当前节点的所有子节点,孙子节点等。find_parents() 和 find_parent() 用来搜索当前节点的父辈节点,搜索方法与普通tag的搜索方法相同,搜索文档搜索文档包含的内容。

4.find_next_siblings() 和 find_next_sibling()  

这2个方法通过 .next_siblings 属性对当 tag 的所有后面解析的兄弟 tag 节点进行迭代, find_next_siblings() 方法返回所有符合条件的后面的兄弟节点,find_next_sibling() 只返回符合条件的后面的第一个tag节点。

5.find_previous_siblings() 和 find_previous_sibling()

这2个方法通过 .previous_siblings 属性对当前 tag 的前面解析的兄弟 tag 节点进行迭代, find_previous_siblings() 方法返回所有符合条件的前面的兄弟节点,find_previous_sibling() 方法返回第一个符合条件的前面的兄弟节点。

6.find_all_next() 和 find_next()

这2个方法通过 .next_elements 属性对当前 tag 的之后的 tag 和字符串进行迭代, find_all_next() 方法返回所有符合条件的节点, find_next() 方法返回第一个符合条件的节点。

7.find_all_previous() 和 find_previous()

这2个方法通过 .previous_elements 属性对当前节点前面的 tag 和字符串进行迭代,find_all_previous() 方法返回所有符合条件的节点, find_previous()方法返回第一个符合条件的节点。

CSS选择器

我们在写 CSS 时,标签名不加任何修饰,类名前加点,id名前加 #,在这里我们也可以利用类似的方法来筛选元素,用到的方法是 soup.select(),返回类型是 list。

通过标签名查找

print soup.select('title')
#[<title>The Dormouse's story</title>]
print soup.select('a')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('b')
#[<b>The Dormouse's story</b>]

通过类名查找

print soup.select('.sister')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

通过 id 名查找

print soup.select('#link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

组合查找

组合查找即和写 class 文件时,标签名与类名、id名进行的组合原理是一样的,例如查找 p 标签中,id 等于 link1的内容,二者需要用空格分开。

print soup.select('p #link1')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

直接子标签查找:

print soup.select("head > title")
#[<title>The Dormouse's story</title>]

属性查找

查找时还可以加入属性元素,属性需要用中括号括起来,注意属性和标签属于同一节点,所以中间不能加空格,否则会无法匹配到。

print soup.select('a[class="sister"]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
print soup.select('a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

同样,属性仍然可以与上述查找方式组合,不在同一节点的空格隔开,同一节点的不加空格:

print soup.select('p a[href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" ]')
#[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"><!-- Elsie --></a>]

以上的 select 方法返回的结果都是列表形式,可以遍历形式输出,然后用 string或get_text() 方法来获取它的内容:

soup = BeautifulSoup(html, 'lxml')
print type(soup.select('title'))
print soup.select('title')[0].get_text()
for title in soup.select('title'):
  print title.get_text()

更多关于Python相关内容可查看本站专题:《Python Socket编程技巧总结》、《Python正则表达式用法总结》、《Python数据结构与算法教程》、《Python函数使用技巧总结》、《Python字符串操作技巧汇总》、《Python入门与进阶经典教程》及《Python文件与目录操作技巧汇总》

希望本文所述对大家Python程序设计有所帮助。

Python 相关文章推荐
Python代码调试的几种方法总结
Apr 15 Python
简单介绍Python中利用生成器实现的并发编程
May 04 Python
Python实现识别手写数字大纲
Jan 29 Python
Python实现简单求解给定整数的质因数算法示例
Mar 25 Python
详解js文件通过python访问数据库方法
Mar 03 Python
Python中查看变量的类型内存地址所占字节的大小
Jun 26 Python
Python3实现zip分卷压缩过程解析
Oct 09 Python
python各层级目录下import方法代码实例
Jan 20 Python
在python中使用pymysql往mysql数据库中插入(insert)数据实例
Mar 02 Python
Python3 获取文件属性的方式(时间、大小等)
Mar 12 Python
python 逐步回归算法
Apr 06 Python
python使用pycharm安装pyqt5以及相关配置
Apr 22 Python
Python HTML解析模块HTMLParser用法分析【爬虫工具】
Apr 05 #Python
Python爬虫实现爬取百度百科词条功能实例
Apr 05 #Python
Python3.5多进程原理与用法实例分析
Apr 05 #Python
Python选择网卡发包及接收数据包
Apr 04 #Python
详解Python的数据库操作(pymysql)
Apr 04 #Python
python dlib人脸识别代码实例
Apr 04 #Python
python图像处理入门(一)
Apr 04 #Python
You might like
php中显示数组与对象的实现代码
2011/04/18 PHP
thinkphp学习笔记之多表查询
2014/07/28 PHP
微信公众平台网页授权获取用户基本信息中授权回调域名设置的变动
2014/10/21 PHP
php删除数组中重复元素的方法
2015/12/22 PHP
PHP下载远程图片的几种方法总结
2017/04/07 PHP
PHP检查网站是否宕机的方法示例
2017/07/24 PHP
PHP实现微信退款的方法示例
2019/03/26 PHP
jquery构造器的实现代码小结
2011/05/16 Javascript
js与jquery中获取当前鼠标的x、y坐标位置的代码
2011/05/23 Javascript
js 获取屏幕各种宽高的方法(浏览器兼容)
2013/05/15 Javascript
仿百度的关键词匹配搜索示例
2013/09/25 Javascript
Extjs4实现两个GridPanel之间数据拖拽功能具体方法
2013/11/21 Javascript
Js实现当前点击a标签变色突出显示其他a标签回复原色
2013/11/27 Javascript
Js表格万条数据瞬间加载实现代码
2014/02/20 Javascript
jQuery+CSS实现简单切换菜单示例
2016/07/27 Javascript
Vue实现商品详情页的评价列表功能
2019/09/04 Javascript
JS实现移动端双指缩放和旋转方法
2019/12/13 Javascript
JS 逻辑判断不要只知道用 if-else 和 switch条件判断(小技巧)
2020/05/27 Javascript
Vue包大小优化的实现(从1.72M到94K)
2021/02/18 Vue.js
[50:02]完美世界DOTA2联赛循环赛 Magma vs IO BO2第一场 11.01
2020/11/02 DOTA
Python处理RSS、ATOM模块FEEDPARSER介绍
2015/02/18 Python
在Python中操作字符串之replace()方法的使用
2015/05/19 Python
python实现神经网络感知器算法
2017/12/20 Python
python实现俄罗斯方块游戏
2020/03/25 Python
在Pycharm中自动添加时间日期作者等信息的方法
2019/01/16 Python
python儿童学游戏编程知识点总结
2019/06/03 Python
pytorch三层全连接层实现手写字母识别方式
2020/01/14 Python
matlab、python中矩阵的互相导入导出方式
2020/06/01 Python
Python析构函数__del__定义原理解析
2020/11/20 Python
ET Mall东森购物网:东森严选
2017/03/06 全球购物
戛纳奢侈品商店:Jacques Loup法国
2019/11/04 全球购物
2014副局长群众路线对照检查材料思想汇报
2014/09/22 职场文书
小平您好观后感
2015/06/09 职场文书
2015年学校远程教育工作总结
2015/07/20 职场文书
Redis Cluster集群动态扩容的实现
2021/07/15 Redis
Python使用Web框架Flask开发项目
2022/06/01 Python