讲解Python的Scrapy爬虫框架使用代理进行采集的方法


Posted in Python onFebruary 18, 2016

1.在Scrapy工程下新建“middlewares.py”

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]
Python 相关文章推荐
python解析xml文件实例分享
Dec 04 Python
python绘图库Matplotlib的安装
Jul 03 Python
Python利用Nagios增加微信报警通知的功能
Feb 18 Python
python中itertools模块zip_longest函数详解
Jun 12 Python
Python使用分布式锁的代码演示示例
Jul 30 Python
详解Numpy中的广播原则/机制
Sep 20 Python
Python 合并多个TXT文件并统计词频的实现
Aug 23 Python
python带参数打包exe及调用方式
Dec 21 Python
python小项目之五子棋游戏
Dec 26 Python
Python求两个字符串最长公共子序列代码实例
Mar 05 Python
关于Kotlin中SAM转换的那些事
Sep 15 Python
详解如何使用Pytest进行自动化测试
Jan 14 Python
使用Python的PIL模块来进行图片对比
Feb 18 #Python
使用Python来编写HTTP服务器的超级指南
Feb 18 #Python
python装饰器与递归算法详解
Feb 18 #Python
Python利用Nagios增加微信报警通知的功能
Feb 18 #Python
Python多线程、异步+多进程爬虫实现代码
Feb 17 #Python
玩转python爬虫之爬取糗事百科段子
Feb 17 #Python
玩转python爬虫之正则表达式
Feb 17 #Python
You might like
PHP读取大文件的类SplFileObject使用介绍
2014/04/09 PHP
PHP实现图片旋转效果实例代码
2014/10/01 PHP
php保存二进制原始数据为图片的程序代码
2014/10/14 PHP
PHP的mysqli_sqlstate()函数讲解
2019/01/23 PHP
PHP结合jquery ajax实现上传多张图片,并限制图片大小操作示例
2019/03/01 PHP
网页里控制图片大小的相关代码
2006/06/13 Javascript
自己开发Dojo的建议框架
2008/09/24 Javascript
JS 自动安装exe程序
2008/11/30 Javascript
jQueryUI写一个调整分类的拖放效果实现代码
2012/05/10 Javascript
原生JS可拖动弹窗效果实例代码
2013/11/09 Javascript
javascript间隔刷新的简单实例
2013/11/14 Javascript
IE6-IE9中tbody的innerHTML不能赋值的解决方法
2014/06/05 Javascript
Linux下编译安装php libevent扩展实例
2015/02/14 Javascript
jQuery Ajax 异步加载显示等待效果代码分享
2016/08/01 Javascript
form+iframe解决跨域上传文件的方法
2016/11/18 Javascript
简单实现JavaScript图片切换效果
2016/11/28 Javascript
Bootstrap 3浏览器兼容性问题及解决方案
2017/04/11 Javascript
Nodejs连接mysql并实现增、删、改、查操作的方法详解
2018/01/04 NodeJs
详解三种方式解决vue中v-html元素中标签样式
2018/11/22 Javascript
vscode 开发Vue项目的方法步骤
2018/11/25 Javascript
easyUI 实现的后台分页与前台显示功能示例
2020/06/01 Javascript
js实现tab栏切换效果
2020/08/02 Javascript
python的keyword模块用法实例分析
2015/06/30 Python
Python字符串切片操作知识详解
2016/03/28 Python
Python的Flask框架应用调用Redis队列数据的方法
2016/06/06 Python
Python 专题六 局部变量、全局变量global、导入模块变量
2017/03/20 Python
python实现决策树分类算法
2017/12/21 Python
django2 快速安装指南分享
2018/01/05 Python
基于python if 判断选择结构的实例详解
2019/05/06 Python
如何查看Django ORM执行的SQL语句的实现
2020/04/20 Python
Expedia意大利旅游网站:酒店、机票和租车预订
2017/10/30 全球购物
师范毕业生自我鉴定
2014/01/15 职场文书
八一建军节演讲稿
2014/09/10 职场文书
学生会副主席竞选稿
2015/11/19 职场文书
争做文明公民倡议书
2019/06/24 职场文书
SpringBoot2零基础到精通之数据库专项精讲
2022/03/22 Java/Android