编程 Python

python利用beautifulSoup实现爬虫

Posted in Python onSeptember 29, 2014

以前讲过利用phantomjs做爬虫抓网页 https://3water.com/article/55789.htm 是配合选择器做的

利用 beautifulSoup(文档：http://www.crummy.com/software/BeautifulSoup/bs4/doc/)这个python模块，可以很轻松的抓取网页内容

# coding=utf-8
import urllib
from bs4 import BeautifulSoup

url ='http://www.baidu.com/s'
values ={'wd':'网球'}
encoded_param = urllib.urlencode(values)
full_url = url +'?'+ encoded_param
response = urllib.urlopen(full_url)
soup =BeautifulSoup(response)
alinks = soup.find_all('a')

上面可以抓取百度搜出来结果是网球的记录。

beautifulSoup内置了很多非常有用的方法。

几个比较好用的特性：

构造一个node元素

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')

tag = soup.b

type(tag)

# <class 'bs4.element.Tag'>

属性可以使用attr拿到，结果是字典

tag.attrs

# {u'class': u'boldest'}

或者直接tag.class取属性也可。

也可以自由操作属性

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>

tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

还可以随便操作，查找dom元素，比如下面的例子

1.构建一份文档

html_doc = """
<html><head><title>The Dormouse's story</title></head>

<p><b>The Dormouse's story</b></p>

<p>Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" id="link1">Elsie</a>,
<a href="http://example.com/lacie" id="link2">Lacie</a> and
<a href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p>...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)

2.各种搞

soup.head
# <head><title>The Dormouse's story</title></head>
soup.title
# <title>The Dormouse's story</title>
soup.body.b
# <b>The Dormouse's story</b>
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
len(soup.contents)
# 1
soup.contents[0].name
# u'html'
text = title_tag.contents[0]
text.contents

for child in title_tag.children:
  print(child)
head_tag.contents
# [<title>The Dormouse's story</title>]
for child in head_tag.descendants:
  print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

len(list(soup.children))
# 1
len(list(soup.descendants))
# 25
title_tag.string
# u'The Dormouse's story'

python利用beautifulSoup实现爬虫

- Author -

mdxy-dxy

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python中尾递归用法实例详解

Apr 28 Python

关于python pyqt5安装失败问题的解决方法

Aug 08 Python

浅谈用VSCode写python的正确姿势

Dec 16 Python

Python多进程与服务器并发原理及用法实例分析

Aug 21 Python

python发送告警邮件脚本

Sep 17 Python

python 检查文件mime类型的方法

Dec 08 Python

python使用selenium登录QQ邮箱（附带滑动解锁）

Jan 23 Python

Python实现决策树并且使用Graphviz可视化的例子

Aug 09 Python

如何基于Python实现电子邮件的发送

Dec 16 Python

python里的单引号和双引号的有什么作用

Jun 17 Python

keras分类模型中的输入数据与标签的维度实例

Jul 03 Python

python二维图制作的实例代码

Dec 03 Python

Python中为feedparser设置超时时间避免堵塞

Sep 28 #Python

跟老齐学Python之从格式化表达式到方法

Sep 28 #Python

跟老齐学Python之print详解

Sep 28 #Python

跟老齐学Python之正规地说一句话

Sep 28 #Python

跟老齐学Python之玩转字符串(2)更新篇

Sep 28 #Python

跟老齐学Python之不要红头文件(2)

Sep 28 #Python

跟老齐学Python之不要红头文件(1)

Sep 28 #Python

You might like

php自定义时间转换函数示例

2016/12/07 PHP

PHP mysqli事务操作常用方法分析

2017/07/22 PHP

php empty 函数判断结果为空但实际值却为非空的原因解析

2018/05/28 PHP

laravel中的一些简单实用功能

2018/11/03 PHP

PHP大文件切割上传功能实例分析

2019/07/01 PHP

Javascript中的this绑定介绍

2011/09/22 Javascript

基于JQUERY的多级联动代码

2012/01/24 Javascript

jQuery中after的两种用法实例

2013/07/03 Javascript

js 实现的可折叠留言板(附源码下载)

2014/07/01 Javascript

json属性名为什么要双引号(个人猜测)

2014/07/31 Javascript

全面解析Bootstrap表单使用方法（表单控件）

2015/11/24 Javascript

简单讲解jQuery中的子元素过滤选择器

2016/04/18 Javascript

AngularJS使用自定义指令替代ng-repeat的方法

2016/09/17 Javascript

jQuery插件HighCharts实现的2D堆条状图效果示例【附demo源码下载】

2017/03/14 Javascript

js实现1,2,3,5数字按照概率生成

2017/09/12 Javascript

详解vue-cli之webpack3构建全面提速优化

2017/12/25 Javascript

监听element-ui table滚动事件的方法

2019/03/26 Javascript

纯js实现无缝滚动功能代码实例

2020/02/21 Javascript

JavaScript 如何在浏览器中使用摄像头

2020/12/02 Javascript

python编程实现希尔排序

2017/04/13 Python

python 简单搭建阻塞式单进程,多进程,多线程服务的实例

2017/11/01 Python

python如何读写json数据

2018/03/21 Python

利用Python代码实现数据可视化的5种方法详解

2018/03/25 Python

Python3字符串encode与decode的讲解

2019/04/02 Python

Python整数对象实现原理详解

2019/07/01 Python

django admin 添加自定义链接方式

2020/03/11 Python

Python如何使用正则表达式爬取京东商品信息

2020/06/01 Python

使用python实现下载我们想听的歌曲,速度超快

2020/07/09 Python

如何让PyQt5中QWebEngineView与JavaScript交互

2020/10/21 Python

Pycharm-community-2020.2.3 社区版安装教程图文详解

2020/12/08 Python

一款纯css3制作的2015年元旦雪人动画特效教程

2014/12/29 HTML / CSS

La Senza官网：北美顶尖性感内衣品牌

2018/08/03 全球购物

团员的自我评价

2013/12/01 职场文书

区域销售经理职责

2013/12/22 职场文书

运动会邀请函范文

2014/01/31 职场文书

大学生党员个人总结

2015/02/13 职场文书