编程 Python

python爬虫开发之urllib模块详细使用方法与实例全解

Posted in Python onMarch 09, 2020

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar

在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request

urllib是Python自带的标准库，无需安装，直接可以用。

urllib模块提供了如下功能：

网页请求(urllib.request)
URL解析(urllib.parse)
代理和cookie设置
异常处理(urllib.error)
robots.txt解析模块(urllib.robotparser)

urllib包中urllib.request模块

1、urllib.request.urlopen

urlopen一般常用的有三个参数，它的参数如下：

r = urllib.requeset.urlopen(url,data,timeout)

url：链接格式：协议://主机名:[端口]/路径

data：附加参数必须是字节流编码格式的内容(bytes类型)，可通过bytes()函数转化，如果要传递这个参数，请求方式就不再是GET方式请求，而是POST方式

timeout: 超时单位为秒

get请求

import urllib
r = urllib.urlopen('//3water.com/')
datatLine = r.readline() #读取html页面的第一行
data=file.read() #读取全部
f=open("./1.html","wb") # 网页保存在本地
f.write(data)
f.close()

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样 info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息 getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到 geturl()：返回请求的url

urllib.quote(url)和urllib.quote_plus(url)，对关键字进行编码可使得urlopen能够识别

POST请求

import urllib.request
import urllib.parse
url = 'https://passport.3water.com/user/signin?'
post = {
'username': 'xxx',
'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)

我们在进行注册、登录等操作时，会通过POST表单传递信息

这时，我们需要分析页面结构，构建表单数据post，使用urlencode()进行编码处理，返回字符串，再指定'utf-8'的编码格式，这是因为POSTdata只能是bytes或者file object。最后通过Request()对象传递postdata，使用urlopen()发送请求。

2、urllib.request.Request

urlopen()方法可以实现最基本请求的发起，但这几个简单的参数并不足以构建一个完整的请求，如果请求中需要加入headers（请求头）等信息模拟浏览器，我们就可以利用更强大的Request类来构建一个请求。

import urllib.request
import urllib.parse
url = 'https://passport.3water.com/user/signin?'
post = {
'username': 'xxx',
'password': 'xxxx'
}
postdata = urllib.parse.urlencode(post).encode('utf-8')
req = urllib.request.Request(url, postdata)
r = urllib.request.urlopen(req)

3、urllib.request.BaseHandler

在上面的过程中，我们虽然可以构造Request ，但是一些更高级的操作，比如 Cookies处理，代理该怎样来设置？

接下来就需要更强大的工具 Handler 登场了基本的urlopen()函数不支持验证、cookie、代理或其他HTTP高级功能。要支持这些功能，必须使用build_opener()函数来创建自己的自定义opener对象。

首先介绍下 urllib.request.BaseHandler ，它是所有其他 Handler 的父类，它提供了最基本的 Handler 的方法。

HTTPDefaultErrorHandler 用于处理HTTP响应错误，错误都会抛出 HTTPError 类型的异常。

HTTPRedirectHandler 用于处理重定向

HTTPCookieProcessor 用于处理 Cookie 。

ProxyHandler 用于设置代理，默认代理为空。

HTTPPasswordMgr用于管理密码，它维护了用户名密码的表。

HTTPBasicAuthHandler 用于管理认证，如果一个链接打开时需要认证，那么可以用它来解决认证问题。

代理服务器设置

def use_proxy(proxy_addr,url):
import urllib.request
#构建代理
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
# 构建opener对象
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
# 安装到全局
# urllib.request.install_opener(opener)
# data=urllib.request.urlopen(url).read().decode('utf8') 以全局方式打开
data=opener.open(url) # 直接用句柄方式打开
return data
proxy_addr='61.163.39.70:9999'
data=use_proxy(proxy_addr,'//3water.com')
print(len(data))
## 异常处理以及日输出

opener通常是build_opener()创建的opener对象。

install_opener(opener) 安装opener作为urlopen()使用的全局URL opener

cookie的使用

获取Cookie保存到变量

import http.cookiejar, urllib.request
#使用http.cookiejar.CookieJar()创建CookieJar对象
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
#使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
opener = urllib.request.build_opener(handler)
#将opener安装为全局
urllib.request.install_opener(opener)
response = urllib.request.urlopen('//3water.com')
#response = opener.open('//3water.com')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value

首先我们必须声明一个 CookieJar 对象，接下来我们就需要利用 HTTPCookieProcessor 来构建一个 handler ，最后利用 build_opener 方法构建出 opener ，执行 open() 即可。最后循环输出cookiejar

获取Cookie保存到本地

import cookielib
import urllib
#设置保存cookie的文件，同级目录下的cookie.txt
filename = 'cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib.request.build_opener(handler)
#创建一个请求，原理同urllib2的urlopen
response = opener.open("//3water.com")
#保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)

异常处理

异常处理结构如下

try:
# 要执行的代码
print(...)
except:
#try代码块里的代码如果抛出异常了，该执行什么内容
print(...)
else:
#try代码块里的代码如果没有跑出异常，就执行这里
print(...)
finally:
#不管如何，finally里的代码，是总会执行的
print(...)

URLerror产生原因：

1、网络未连接（即不能上网）

from urllib import request, error
try:
r=request.urlopen('//3water.com')
except error.URLError as e:
print(e.reason)

2、访问页面不存(HTTPError)

客户端向服务器发送请求，如果成功地获得请求的资源，则返回的状态码为200，表示响应成功。如果请求的资源不存在，则通常返回404错误。

from urllib imort request, error
try:
response = request.urlopen('//3water.com')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
else:
print("Request Successfully')
# 加入 hasattr属性提前对属性,进行判断原因
from urllib import request,error
try:
response=request.urlopen('http://blog.3water.com')
except error.HTTPError as e:
if hasattr(e,'code'):
print('the server couldn\'t fulfill the request')
print('Error code:',e.code)
elif hasattr(e,'reason'):
print('we failed to reach a server')
print('Reason:',e.reason)
else:
print('no exception was raised')
# everything is ok

下面为大家列出几个urllib模块很有代表性的实例

1、引入urllib模块

import urllib.request
response = urllib.request.urlopen('http://3water.com/')
html = response.read()

2、使用 Request

import urllib.request
req = urllib.request.Request('http://3water.com/')
response = urllib.request.urlopen(req)
the_page = response.read()

3、发送数据

#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : 'yzhang@i9i8.com',
'login[password]' : '123456'
}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', '//3water.com/')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : 'yzhang@i9i8.com',
'login[password]' : '123456'
}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3
import urllib.request
req = urllib.request.Request('//3water.com ')
try:
urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.code)
print(e.read().decode("utf8"))

6、异常处理

#! /usr/bin/env python3
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("//3water.com /")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("good!")
print(response.read().decode("utf8"))

7、异常处理

from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("//3water.com /")
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
else:
print("good!")
print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3
import urllib.request
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://3water.com /"
password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
a_url = "https://3water.com /"
x = opener.open(a_url)
print(x.read())
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)

9、使用代理

#! /usr/bin/env python3
import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a = urllib.request.urlopen("//3water.com ").read().decode("utf8")
print(a)

10、超时

#! /usr/bin/env python3
import socket
import urllib.request
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('//3water.com /')
a = urllib.request.urlopen(req).read()
print(a)

11.自己创建build_opener

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
response =urllib.request.urlopen('//3water.com/')
buff = response.read()
html = buff .decode("utf8")
response.close()
print(the_page)

12.urlib.resquest.urlretrieve远程下载

header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
#下载文件到当前文件夹
urllib.request.urlretrieve('//3water.com/','baidu.html')
#清除urlretrieve产生的缓存
urlib.resquest.urlcleanup()

13.post请求

import urllib.request
import urllib.parse
url='//3water.com/mypost/'
#将数据使用urlencode编码处理后，使用encode()设置为utf-8编码
postdata=urllib.parse.urlencode({name:'测试名',pass:"123456"}).encode('utf-8')
#urllib.request.quote()接受字符串，
#urllib.parse.urlencode()接受字典或者列表中的二元组[(a,b),(c,d)],将URL中的键值对以连接符&划分
req=urllib.request.Request(url,postdata)
#urllib.request.Request(url, data=None, header={}, origin_req_host=None, unverifiable=False, #method=None)
#url：包含URL的字符串。
#data：http request中使用，如果指定，则发送POST而不是GET请求。
#header：是一个字典。
#后两个参数与第三方cookie有关。
req.add_header('user-agent','User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/
537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
//urlopen（）的data参数默认为None，当data参数不为空的时候，urlopen（）提交方式为Post。

14.cookie的使用

1.获取Cookie保存到变量

import urllib.request
 import http.cookie
 # 声明一个CookieJar对象实例来保存cookie
 cookie = cookielib.CookieJar()
 # 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
 handler = urllib.request.HTTPCookieProcessor(cookie)
 # 通过handler来构建opener
 opener = urllib.request.build_opener(handler)
 # 此处的open方法同urllib.request的urlopen方法，也可以传入request
 urllib.request.install_opener(opener)
 #使用opener或者urlretrieve方法来获取需要的网站cookie
 urllib.request.urlretrieve('//3water.com/','baidu.html')
 # data=urllib.request.urlopen('//3water.com/')

2.保存cookies到文件

import http.cookie
import urllib.request
# 设置保存cookie的文件，同级目录下的cookie.txt
filename = 'cookie.txt'
# 声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = http.cookie.MozillaCookieJar(filename)
# 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 通过handler来构建opener
opener = urllib.request.build_opener(handler)
# 创建一个请求，原理同urllib的urlopen
response = opener.open("//3water.com")
# 保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)

3.从文件中获取cookies并访问

import http.cookielib
import urllib.request
# 创建MozillaCookieJar实例对象
cookie = http.cookie.MozillaCookieJar()
# 从文件中读取cookie内容到变量
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
# 创建请求的request
req = urllib.Request("//3water.com")
# 利用urllib的build_opener方法创建一个opener
opener = urllib.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open(req)
print (response.read())

15.代理服务器设置

import socket
#设置Socket连接超时时间,同时决定了urlopen的超时时间
socket.setdefaulttimeout(1)
import urllib.request
#代理服务器信息，http代理使用地址
startime = time.time()
#设置http和https代理
proxy=request.ProxyHandler({'https':'175.155.25.91:808','http':'175.155.25.91:808'})
opener=request.build_opener(proxy)
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'),
# ("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
# ("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),
# ("Accept-Encoding","gzip, deflate, br"),
# ("Connection","keep-alive"),
# ("Pragma","no-cache"),
# ("Cache-Control","no-cache")
  ]
request.install_opener(opener)
# data = request.urlopen('https://3water.com/find-ip-address').read()
data = request.urlopen( 'http://www.ipip.net/' ).read().decode('utf-8')
# data=gzip.decompress(data).decode('utf-8','ignore')
endtime = time.time()
delay = endtime-startime
print(data)

有时在urlopen的data数据直接decode(‘utf-8')会失败，必须要使用gzip.decompress(‘utf-8','ignore')才能打开，猜测应该是header的问题，换一个有时会好

本文主要讲解了python爬虫模块urllib详细使用方法与实例全解，更多关于python爬虫模块urllib详细使用方法与实例请查看下面的相关链接

python爬虫开发之urllib模块详细使用方法与实例全解

- Author -

jia666666

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python实现simhash算法实例

Apr 25 Python

简单的抓取淘宝图片的Python爬虫

Dec 25 Python

Python实现批量下载文件

May 17 Python

详解Python编程中time模块的使用

Nov 20 Python

Python中的字符串操作和编码Unicode详解

Jan 18 Python

解决Python中定时任务线程无法自动退出的问题

Feb 18 Python

Python 等分切分数据及规则命名的实例代码

Aug 16 Python

Python3.7安装keras和TensorFlow的教程图解

Jun 18 Python

Jupyter Notebook 文件默认目录的查看以及更改步骤

Apr 14 Python

浅析Python 责任链设计模式

Sep 11 Python

用python监控服务器的cpu,磁盘空间,内存,超过邮件报警

Jan 29 Python

Python OpenCV 图像平移的实现示例

Jun 04 Python

在Python IDLE 下调用anaconda中的库教程

Mar 09 #Python

python shell命令行中import多层目录下的模块操作

Mar 09 #Python

使用Python获取当前工作目录和执行命令的位置

Mar 09 #Python

python爬虫开发之Request模块从安装到详细使用方法与实例全解

Mar 09 #Python

Python如何存储数据到json文件

Mar 09 #Python

找Python安装目录,设置环境路径以及在命令行运行python脚本实例

Mar 09 #Python

Python运行异常管理解决方案

Mar 09 #Python