编程 Python

Python使用Beautiful Soup包编写爬虫时的一些关键点

Posted in Python onJanuary 20, 2016

1.善于利用soup节点的parent属性

比如对于已经得到了如下html代码：

<td style="padding-left:0" width="60%"><label>November</label>
<input type="Hidden" id="cboMonth1" name="cboMonth1" value="11">
</td><td style="padding-right:0;" width="40%">
  <label>2012</label>
  <input type="Hidden" id="cboYear1" name="cboYear1" value="2012">
</td>

的soup变量eachMonthHeader了。

想要提取其中的

Month的label的值：November

和Year的label的值：2012

最简单，也是最省事的办法是，直接搜两个label，然后肯定会找到这两个label，然后分别对应着Month和Year的label，然后获得对应的string即可：

foundTwoLabel = eachMonthHeader.findAll("label");
print "foundTwoLabel=",foundTwoLabel;
monthLabel = foundTwoLabel[0];
yearLabel = foundTwoLabel[1];
 
monthStr = monthLabel.string;
yearStr = yearLabel.string;
 
print "monthStr=",monthStr; # monthStr= November
print "yearStr=",yearStr; # yearStr= 2012

但是很明显，这样的逻辑性很不好，而且万一处理多个这样的soup变量，而且两者的顺便颠倒了，那么结果也就错误了。

此时，可以考虑利用soup变量的parent属性，从一个soup变量本身，获得其上一级的soup变量。
示例代码如下：

# <td style="padding-left:0" width="60%"><label>November</label>
# <input type="Hidden" id="cboMonth1" name="cboMonth1" value="11">
# </td><td style="padding-right:0;" width="40%">
  # <label>2012</label>
  # <input type="Hidden" id="cboYear1" name="cboYear1" value="2012">
# </td>
foundCboMonth = eachMonthHeader.find("input", {"id":re.compile("cboMonth\d+")});
#print "foundCboMonth=",foundCboMonth;
tdMonth = foundCboMonth.parent;
#print "tdMonth=",tdMonth;
tdMonthLabel = tdMonth.label;
#print "tdMonthLabel=",tdMonthLabel;
monthStr = tdMonthLabel.string;
print "monthStr=",monthStr;
 
foundCboYear = eachMonthHeader.find("input", {"id":re.compile("cboYear\d+")});
#print "foundCboYear=",foundCboYear;
tdYear = foundCboYear.parent;
#print "tdYear=",tdYear;
tdYearLabel = tdYear.label;
#print "tdYearLabel=",tdYearLabel;
yearStr = tdYearLabel.string;
print "yearStr=",yearStr;

我们再来看一个例子：

from BeautifulSoup import BeautifulSoup 
doc = ['<html><head><title>Page title</title></head>',
    '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
    '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
    '</html>']
soup = BeautifulSoup(''.join(doc))

print soup.prettify()
# <html>
# <head>
#  <title>
#  Page title
#  </title>
# </head>
# <body>
#  <p id="firstpara" align="center">
#  This is paragraph
#  <b>
#   one
#  </b>
#  .
#  </p>
#  <p id="secondpara" align="blah">
#  This is paragraph
#  <b>
#   two
#  </b>
#  .
#  </p>
# </body>
# </html>

这个例子中，<HEAD> Tag的parent是<HTML> Tag. <HTML> Tag 的parent是BeautifulSoup 剖析对象自己。剖析对象的parent是None. 利用parent，你可以向前遍历剖析树。

soup.head.parent.name
# u'html'
soup.head.parent.parent.__class__.__name__
# 'BeautifulSoup'
soup.parent == None
# True

2.当解析非UTF-8或ASCII编码类型的HTML时，需要指定对应的字符编码

当html为ASCII或UTF-8编码时，可以不指定html字符编码，便可正确解析html为对应的soup：

#这里respHtml是ASCII或UTF-8编码，此时可以不指定编码类型，即可正确解析出对应的soup
soup = BeautifulSoup(respHtml);

当html为其他类型编码，比如GB2312的话，则需要指定相应的字符编码，BeautifulSoup才能正确解析出对应的soup：

比如：

#此处respHtml是GB2312编码的，所以要指定该编码类型，BeautifulSoup才能解析出对应的soup
htmlCharset = "GB2312";
soup = BeautifulSoup(respHtml, fromEncoding=htmlCharset);

Python使用Beautiful Soup包编写爬虫时的一些关键点

- Author -

crifan

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python返回昨天日期的方法

May 13 Python

python创建进程fork用法

Jun 04 Python

python 连接sqlite及简单操作

Jun 30 Python

Python 将RGB图像转换为Pytho灰度图像的实例

Nov 14 Python

python实现学生信息管理系统

Apr 05 Python

wtfPython—Python中一组有趣微妙的代码【收藏】

Aug 31 Python

python实现给scatter设置颜色渐变条colorbar的方法

Dec 13 Python

TensorFlow实现批量归一化操作的示例

Apr 22 Python

Django创建一个后台的基本步骤记录

Oct 02 Python

Python实现LR1文法的完整实例代码

Oct 25 Python

Python调用Redis的示例代码

Nov 24 Python

如何在Python中妥善使用进度条详解

Apr 05 Python

Python制作爬虫抓取美女图

Jan 20 #Python

编写Python爬虫抓取豆瓣电影TOP100及用户头像的方法

Jan 20 #Python

以视频爬取实例讲解Python爬虫神器Beautiful Soup用法

Jan 20 #Python

使用Python的urllib和urllib2模块制作爬虫的实例教程

Jan 20 #Python

使用python实现省市三级菜单效果

Jan 20 #Python

八大排序算法的Python实现

Jan 28 #Python

详解C++编程中一元运算符的重载

Jan 19 #Python

You might like

在Windows版的PHP中使用ADO

2006/10/09 PHP

mysql 性能的检查和优化方法

2009/06/21 PHP

PHP 关于访问控制的和运算符优先级介绍

2013/07/08 PHP

php中Session的生成机制、回收机制和存储机制探究

2014/08/19 PHP

PHP判断是否是微信打开还是浏览器打开的方法

2019/02/27 PHP

繁简字转换功能

2006/07/19 Javascript

用jscript启动sqlserver

2007/06/21 Javascript

javaScript 读取和设置文档元素的样式属性

2009/04/14 Javascript

JavaScript中Date.toSource()方法的使用教程

2015/06/12 Javascript

jquery实现具有嵌套功能的选项卡

2016/02/12 Javascript

js上传图片预览的实现方法

2017/05/09 Javascript

基于vue+canvas的excel-like组件实例详解

2017/11/28 Javascript

layui实现数据表格点击搜索功能

2020/03/26 Javascript

Net微信网页开发使用微信JS-SDK获取当前地理位置过程详解

2019/08/26 Javascript

Vue中使用Lodop插件实现打印功能的简单方法

2019/12/19 Javascript

[03:59]DOTA2英雄梦之声_第07期_水晶室女

2014/06/23 DOTA

[08:53]DOTA2每周TOP10 精彩击杀集锦vol.9

2014/06/26 DOTA

[01:06:42]VP vs NewBee Supermajor 胜者组 BO3 第二场 6.5

2018/06/06 DOTA

使用python提取html文件中的特定数据的实现代码

2013/03/24 Python

Python计算三角函数之asin()方法的使用

2015/05/15 Python

Jupyter中直接显示Matplotlib的图形方法

2018/05/24 Python

解决使用export_graphviz可视化树报错的问题

2019/08/09 Python

django框架创建应用操作示例

2019/09/26 Python

Python第三方包PrettyTable安装及用法解析

2020/07/08 Python

Python 利用OpenCV给照片换底色的示例代码

2020/08/03 Python

超30万乐谱下载：Musicnotes.com

2016/09/24 全球购物

SCHIESSER荷兰官方网站：德国内衣专家

2020/10/09 全球购物

关于递归的一道.NET面试题

2013/05/12 面试题

简述网络文件系统NFS，并说明其作用

2016/10/19 面试题

网上常见的一份Linux面试题(多项选择部分)

2014/09/09 面试题

老师推荐信

2013/10/28 职场文书

公司租房协议书

2014/10/14 职场文书

抢劫罪辩护词

2015/05/21 职场文书

幼儿园元旦主持词

2015/07/06 职场文书

2016幼儿园毕业感言

2015/12/08 职场文书

app场景下uniapp的扫码记录

2022/07/23 Java/Android