编程 Python

python3实现抓取网页资源的 N 种方法

Posted in Python onMay 02, 2017

这两天学习了python3实现抓取网页资源的方法，发现了很多种方法，所以，今天添加一点小笔记。

1、最简单

import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()

2、使用 Request

import urllib.request
 
req = urllib.request.Request('http://python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()

3、发送数据

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
     'act' : 'login',
     'login[email]' : 'yzhang@i9i8.com',
     'login[password]' : '123456'
     }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', 'http://www.python.org/')
response = urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

4、发送数据和header

#! /usr/bin/env python3
 
import urllib.parse
import urllib.request
 
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
     'act' : 'login',
     'login[email]' : 'yzhang@i9i8.com',
     'login[password]' : '123456'
     }
headers = { 'User-Agent' : user_agent }
 
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
 
print(the_page.decode("utf8"))

5、http 错误

#! /usr/bin/env python3
 
import urllib.request
 
req = urllib.request.Request('http://www.python.org/fish.html')
try:
  urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
  print(e.code)
  print(e.read().decode("utf8"))

6、异常处理1

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("http://twitter.com/")
try:
  response = urlopen(req)
except HTTPError as e:
  print('The server couldn\'t fulfill the request.')
  print('Error code: ', e.code)
except URLError as e:
  print('We failed to reach a server.')
  print('Reason: ', e.reason)
else:
  print("good!")
  print(response.read().decode("utf8"))

7、异常处理2

#! /usr/bin/env python3
 
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("http://twitter.com/")
try:
  response = urlopen(req)
except URLError as e:
  if hasattr(e, 'reason'):
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
  elif hasattr(e, 'code'):
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
else:
  print("good!")
  print(response.read().decode("utf8"))

8、HTTP 认证

#! /usr/bin/env python3
 
import urllib.request
 
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
 
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://cms.tetx.com/"
password_mgr.add_password(None, top_level_url, 'yzhang', 'cccddd')
 
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
 
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
 
# use the opener to fetch a URL
a_url = "https://cms.tetx.com/"
x = opener.open(a_url)
print(x.read())
 
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
 
a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)

9、使用代理

#! /usr/bin/env python3
 
import urllib.request
 
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)

 
a = urllib.request.urlopen("http://g.cn").read().decode("utf8")
print(a)

10、超时

#! /usr/bin/env python3
 
import socket
import urllib.request
 
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
 
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://twitter.com/')
a = urllib.request.urlopen(req).read()
print(a)

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python3实现抓取网页资源的 N 种方法

- Author -

方倍工作室

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

详解Python函数作用域的LEGB顺序

May 14 Python

2018年Python值得关注的开源库、工具和开发者(总结篇)

Jan 04 Python

浅析Python装饰器以及装饰器模式

May 28 Python

浅谈dataframe中更改列属性的方法

Jul 10 Python

Python开发的十个小贴士和技巧及长常犯错误

Sep 27 Python

浅谈Django中view对数据库的调用方法

Jul 18 Python

利用matplotlib实现根据实时数据动态更新图形

Dec 13 Python

基于pandas中expand的作用详解

Dec 17 Python

Python异常继承关系和自定义异常实现代码实例

Feb 20 Python

Pycharm如何运行.py文件的方法步骤

Mar 03 Python

Pygame如何使用精灵和碰撞检测

Nov 17 Python

python如何利用cv2.rectangle()绘制矩形框

Dec 24 Python

Pycharm学习教程（2）代码风格

May 02 #Python

Pycharm学习教程（1）定制外观

May 02 #Python

pycharm安装图文教程

May 02 #Python

python安装教程 Pycharm安装详细教程

May 02 #Python

python处理xml文件的方法小结

May 02 #Python

python实现的AES双向对称加密解密与用法分析

May 02 #Python

python中安装模块包版本冲突问题的解决

May 02 #Python

You might like

通过ODBC连接的SQL SERVER实例

2006/10/09 PHP

php中看实例学正则表达式

2006/12/25 PHP

set_include_path和get_include_path使用及注意事项

2013/02/02 PHP

PHP flush 函数使用注意事项

2016/08/26 PHP

页面回到顶部的三种实现(锚标记，js)

2012/10/01 Javascript

JSON+Jquery省市区三级联动

2016/01/13 Javascript

学习JavaScript设计模式之责任链模式

2016/01/18 Javascript

AngularJS入门之动画

2016/07/27 Javascript

新入门node.js必须要知道的概念(必看篇)

2016/08/10 Javascript

etmvc+jQuery EasyUI+combobox多值操作实现角色授权实例

2016/11/09 Javascript

开源免费天气预报接口API及全国所有地区代码(国家气象局提供）

2016/12/26 Javascript

AngularJS 霸道的过滤器小结

2017/04/26 Javascript

详解react-router如何实现按需加载

2017/06/15 Javascript

JavaScript中使用Async实现异步控制

2017/08/15 Javascript

vue-cli2 构建速度优化的实现方法

2019/01/08 Javascript

vue项目创建并引入饿了么elementUI组件的步骤

2019/04/11 Javascript

JS中的算法与数据结构之栈(Stack)实例详解

2019/08/20 Javascript

解决VUE双向绑定失效的问题

2019/10/29 Javascript

基于vue实现简易打地鼠游戏

2020/08/21 Javascript

[59:08]Ti4 冒泡赛第二天 NEWBEE vs Titan 2

2014/07/15 DOTA

python运行时间的几种方法

2016/06/17 Python

Python实现将json文件中向量写入Excel的方法

2018/03/26 Python

Pytorch修改ResNet模型全连接层进行直接训练实例

2019/09/10 Python

python输入错误后删除的方法

2019/10/12 Python

python3.8 微信发送服务器监控报警消息代码实现

2019/11/05 Python

python队列原理及实现方法示例

2019/11/27 Python

Python实现仿射密码的思路详解

2020/04/23 Python

Windows 下更改 jupyterlab 默认启动位置的教程详解

2020/05/18 Python

Pandas实现一列数据分隔为两列

2020/05/18 Python

吉列剃须刀英国官网：Gillette英国

2019/03/28 全球购物

Hello Molly美国：女性时尚在线

2019/08/26 全球购物

秋季校运动会广播稿

2014/02/23 职场文书

中学生纪念九一八事变演讲稿

2014/09/14 职场文书

走近毛泽东观后感

2015/06/04 职场文书

学校少先队工作总结

2015/08/12 职场文书

浅谈golang 中time.After释放的问题

2021/05/05 Golang