编程 Python

实例讲解Python爬取网页数据

Posted in Python onJuly 08, 2018

一、利用webbrowser.open()打开一个网站：

>>> import webbrowser 
>>> webbrowser.open('http://i.firefoxchina.cn/?from=worldindex') 
True

实例：使用脚本打开一个网页。

所有Python程序的第一行都应以#!python开头，它告诉计算机想让Python来执行这个程序。（我没带这行试了试，也可以，可能这是一种规范吧）

1.从sys.argv读取命令行参数：打开一个新的文件编辑器窗口，输入下面的代码，将其保存为map.py。

2.读取剪贴板内容：

3.调用webbrowser.open()函数打开外部浏览：

#! python3 
import webbrowser, sys, pyperclip 
if len(sys.argv) > 1: 
 mapAddress = ''.join(sys.argv[1:]) 
else: 
 mapAddress = pyperclip.paste() 
webbrowser.open('http://map.baidu.com/?newmap=1&ie=utf-8&s=s%26wd%3D' + mapAddress

注：不清楚sys.argv用法的，请参考这里；不清楚.join()用法的，请参考这里。sys.argv是字符串的列表，所以将它传递给join()方法返回一个字符串。

好了，现在选中'天安门广场'这几个字并复制，然后到桌面双击你的程序。当然你也可以在命令行找到你的程序，然后输入地点。

二、用requests模块从Web下载文件：requests模块不是Python自带的，通过命令行运行pip install request安装。没翻墙是很难安装成功的，手动安装可以参考这里。

>>> import requests 
>>> res = requests.get('http://i.firefoxchina.cn/?from=worldindex') #向get中传入一个网址 
>>> type(res) #响应对象 
<class 'requests.models.Response'> 
>>> print(res.status_code) #响应码 
200 
>>> res.text #返回的文本

requests中查看网上下载的文件内容的方法还有很多，如果以后的博客用的到，会做说明，在此不再一一介绍。在下载文件的过程中，用raise_for_status()方法可以确保下载确实成功，然后再让程序继续做其他事情。

import requests 
res = requests.get('http://i.firefoxchina.cn/?from=worldindex') 
try: 
 res.raise_for_status() 
except Exception as exc: 
 print('There was a problem: %s' % (exc))

三、将下载的文件保存到本地：

>>> import requests 
>>> res = requests.get('http://tech.firefox.sina.com/17/0820/10/6DKQALVRW5JHGE1I.html##0-tsina-1-13074-397232819ff9a47a7b7e80a40613cfe1') 
>>> res.raise_for_status() 
>>> file = open('1.txt', 'wb') #以写二进制模式打开文件，目的是保存文本中的“Unicode编码” 
>>> for word in res.iter_content(100000): #<span class="fontstyle0"><span class="fontstyle0">iter_content()</span><span class="fontstyle1">方法在循环的每次迭代中返回一段</span><span class="fontstyle0">bytes</span><span class="fontstyle1">数据</span><span class="fontstyle1">类型的内容，你需要指定其包含的字节数</span></span> 
 file.write(word) 
 
 
16997 
>>> file.close()

四、用BeautifulSoup模块解析HTML：在命令行中用pip install beautifulsoup4安装它。
1.bs4.BeautifulSoup()函数可以解析HTML网站链接requests.get()，也可以解析本地保存的HTML文件，直接open()一个本地HTML页面。

>>> import requests, bs4 
>>> res = requests.get('http://i.firefoxchina.cn/?from=worldindex') 
>>> res.raise_for_status() 
>>> soup = bs4.BeautifulSoup(res.text) 
 
Warning (from warnings module): 
 File "C:\Users\King\AppData\Local\Programs\Python\Python36-32\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\__init__.py", line 181 
 markup_type=markup_type)) 
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 
 
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this: 
 
 BeautifulSoup(YOUR_MARKUP}) 
 
to this: 
 
 BeautifulSoup(YOUR_MARKUP, "html.parser") 
 
>>> soup = bs4.BeautifulSoup(res.text, 'html.parser') 
>>> type(soup) 
<class 'bs4.BeautifulSoup'>

我这里有错误提示，所以加了第二个参数。

>>> import bs4 
>>> html = open('C:\\Users\\King\\Desktop\\1.htm') 
>>> exampleSoup = bs4.BeautifulSoup(html) 
>>> exampleSoup = bs4.BeautifulSoup(html, 'html.parser') 
>>> type(exampleSoup) 
<class 'bs4.BeautifulSoup'>

2.用select()方法寻找元素：需传入一个字符串作为CSS“选择器”来取得Web页面相应元素，例如：
soup.select('div')：所有名为<div>的元素；

soup.select('#author')：带有id属性为author的元素；

soup.select('.notice')：所有使用CSS class属性名为notice的元素；

soup.select('div span')：所有在<div>元素之内的<span>元素；

soup.select('input[name]')：所有名为<input>并有一个name属性，其值无所谓的元素；

soup.select('input[type="button"]')：所有名为<input>并有一个type属性，其值为button的元素。

想查看更多的解析器，请参看这里。

>>> import requests, bs4 
>>> res = requests.get('http://i.firefoxchina.cn/?from=worldindex') 
>>> res.raise_for_status() 
>>> soup = bs4.BeautifulSoup(res.text, 'html.parser') 
>>> author = soup.select('#author') 
>>> print(author) 
[] 
>>> type(author) 
<class 'list'> 
>>> link = soup.select('link ') 
>>> print(link) 
[<link href="css/mozMainStyle-min.css?v=20170705" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>, <link href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" moz-skin" rel="stylesheet" type="text/css"/>, <link href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" moz-dir" rel="stylesheet" type="text/css"/>, <link href="" id=" rel="external nofollow" rel="external nofollow" rel="external nofollow" moz-ver" rel="stylesheet" type="text/css"/>] 
>>> type(link) 
<class 'list'> 
>>> len(link) 
4 
>>> type(link[0]) 
<class 'bs4.element.Tag'> 
>>> link[0] 
<link href="css/mozMainStyle-min.css?v=20170705" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/> 
>>> link[0].attrs 
{'rel': ['stylesheet'], 'type': 'text/css', 'href': 'css/mozMainStyle-min.css?v=20170705'}

3.通过元素的属性获取数据：接着上面的代码写。

>>> link[0].get('href') 
'css/mozMainStyle-min.css?v=20170705

上面这些方法也算是对“网络爬虫”的一些初探。

实例讲解Python爬取网页数据

- Author -

hzp666

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python中的__init__ 、__new__、__call__小结

Apr 25 Python

github配置使用指南

Nov 18 Python

Python中encode()方法的使用简介

May 18 Python

Django框架中render_to_response()函数的使用方法

Jul 16 Python

Python解惑之整数比较详解

Apr 24 Python

Python实现求两个csv文件交集的方法

Sep 06 Python

对numpy中数组元素的统一赋值实例

Apr 04 Python

django将网络中的图片,保存成model中的ImageField的实例

Aug 07 Python

用python画一只可爱的皮卡丘实例

Nov 21 Python

Python动态导入模块和反射机制详解

Feb 18 Python

Pyecharts 动态地图 geo()和map()的安装与用法详解

Mar 25 Python

python常量折叠基础知识点讲解

Feb 28 Python

python十进制和二进制的转换方法(含浮点数)

Jul 07 #Python

Python3+django2.0+apache2+ubuntu14部署网站上线的方法

Jul 07 #Python

python3实现字符串的全排列的方法(无重复字符)

Jul 07 #Python

python3 kmp 字符串匹配的方法

Jul 07 #Python

vue.js实现输入框输入值内容实时响应变化示例

Jul 07 #Python

详解Python最长公共子串和最长公共子序列的实现

Jul 07 #Python

python求最大连续子数组的和

Jul 07 #Python

You might like

PHP脚本的10个技巧(2)

2006/10/09 PHP

怎样给PHP源代码加密?PHP二进制加密与解密的解决办法

2013/04/22 PHP

php获取CSS文件中图片地址并下载到本地的方法

2014/12/02 PHP

php利用ob_start()清除输出和选择性输出的方法

2018/01/18 PHP

jQuery解决iframe高度自适应代码

2009/12/20 Javascript

复制js对象方法(详解)

2013/07/08 Javascript

jQuery动态显示和隐藏datagrid中的某一列的方法

2013/12/11 Javascript

javascript实现验证身份证号的有效性并提示

2015/04/30 Javascript

js canvas实现QQ拨打电话特效

2017/05/10 Javascript

基于jQuery实现手风琴菜单、层级菜单、置顶菜单、无缝滚动效果

2017/07/20 jQuery

vue.js声明式渲染和条件与循环基础知识

2017/07/31 Javascript

简单实现vue验证码60秒倒计时功能

2017/10/11 Javascript

vue-resource + json-server模拟数据的方法

2017/11/02 Javascript

Angular 如何使用第三方库的方法

2018/04/18 Javascript

python通过apply使用元祖和列表调用函数实例

2015/05/26 Python

Django中传递参数到URLconf的视图函数中的方法

2015/07/18 Python

浅谈python中的__init__、__new__和__call__方法

2017/07/18 Python

python3 自动识别usb连接状态,即对usb重连的判断方法

2019/07/03 Python

Pytorch 实现focal_loss 多类别和二分类示例

2020/01/14 Python

python 操作mysql数据中fetchone()和fetchall()方式

2020/05/15 Python

CSS3毛玻璃效果(blur)有白边问题的解决方法

2016/11/15 HTML / CSS

elf彩妆英国官网：e.l.f. Cosmetics英国（美国平价彩妆品牌）

2017/11/02 全球购物

意大利制造的男鞋和女鞋：SCAROSSO

2018/03/07 全球购物

数字天堂软件测试面试题

2012/12/23 面试题

工地安全检查制度

2014/02/04 职场文书

教师自我反思材料

2014/02/14 职场文书

优秀护士获奖感言

2014/02/20 职场文书

2014年幼儿园植树节活动方案

2014/03/02 职场文书

中学生操行评语

2014/04/24 职场文书

亲子运动会的活动方案

2014/08/17 职场文书

2014基层党员批评与自我批评范文

2014/09/24 职场文书

团代会闭幕词

2015/01/28 职场文书

奔腾年代观后感

2015/06/09 职场文书

Python机器学习算法之决策树算法的实现与优缺点

2021/05/13 Python

python中的装饰器该如何使用

2021/06/18 Python

javascript函数式编程基础

2021/09/15 Javascript