讲解Python的Scrapy爬虫框架使用代理进行采集的方法


Posted in Python onFebruary 18, 2016

1.在Scrapy工程下新建“middlewares.py”

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]
Python 相关文章推荐
教你安装python Django(图文)
Nov 04 Python
Python浅拷贝与深拷贝用法实例
May 09 Python
利用python获取当前日期前后N天或N月日期的方法示例
Jul 30 Python
Python tkinter模块中类继承的三种方式分析
Aug 08 Python
pycham查看程序执行的时间方法
Nov 29 Python
对python中的try、except、finally 执行顺序详解
Feb 18 Python
python 使用while写猜年龄小游戏过程解析
Oct 07 Python
Python日志syslog使用原理详解
Feb 18 Python
Python xpath表达式如何实现数据处理
Jun 13 Python
Python3如何在服务器打印资产信息
Aug 27 Python
python3中编码获取网页的实例方法
Nov 16 Python
python常量折叠基础知识点讲解
Feb 28 Python
使用Python的PIL模块来进行图片对比
Feb 18 #Python
使用Python来编写HTTP服务器的超级指南
Feb 18 #Python
python装饰器与递归算法详解
Feb 18 #Python
Python利用Nagios增加微信报警通知的功能
Feb 18 #Python
Python多线程、异步+多进程爬虫实现代码
Feb 17 #Python
玩转python爬虫之爬取糗事百科段子
Feb 17 #Python
玩转python爬虫之正则表达式
Feb 17 #Python
You might like
简单的过滤字符串中的HTML标记
2006/12/25 PHP
隐藏你的.php文件的实现方法
2007/03/19 PHP
PHP中文件读、写、删的操作(PHP中对文件和目录操作)
2012/03/06 PHP
xml在joomla表单中的应用详解分享
2012/07/19 PHP
PHP下通过QRCode类库创建中间带网站LOGO的二维码
2014/07/12 PHP
PHP获取指定时间段之间的 年,月,天,时,分,秒
2016/06/05 PHP
基于php(Thinkphp)+jquery 实现ajax多选反选不选删除数据功能
2017/02/24 PHP
PHP中localeconv()函数的用法
2019/03/26 PHP
[原创]后缀就扩展名为js的文件是什么文件
2007/12/06 Javascript
用jquery实现学校的校历(asp.net+jquery ui 1.72)
2010/01/01 Javascript
jQuery DOM操作实例
2014/03/05 Javascript
Javascript中Array用法实例分析
2015/06/13 Javascript
一个字符串中出现次数最多的字符 统计这个次数【实现代码】
2016/04/29 Javascript
基于Bootstrap实现图片轮播效果
2016/05/22 Javascript
JS中script标签defer和async属性的区别详解
2016/08/12 Javascript
jQuery+ajax的资源回收处理机制分析
2017/01/07 Javascript
浅谈angularjs依赖服务注入写法的注意点
2017/04/24 Javascript
js案例之鼠标跟随jquery版(实例讲解)
2017/07/21 jQuery
Vue核心概念Action的总结
2019/01/18 Javascript
VuePress 快速踩坑小结
2019/02/14 Javascript
JS+HTML实现自定义上传图片按钮并显示图片功能的方法分析
2020/02/12 Javascript
跟老齐学Python之玩转字符串(2)
2014/09/14 Python
Python中endswith()函数的基本使用
2015/04/07 Python
Python的IDEL增加清屏功能实例
2017/06/19 Python
Django中STATIC_ROOT和STATIC_URL及STATICFILES_DIRS浅析
2018/05/08 Python
Python从ZabbixAPI获取信息及实现Zabbix-API 监控的方法
2018/09/17 Python
python环形单链表的约瑟夫问题详解
2018/09/27 Python
python解释器安装教程的方法步骤
2020/07/02 Python
用CSS3实现Win8风格的方格导航菜单效果
2013/04/10 HTML / CSS
苹果Mac升级:MacSales.com
2017/11/20 全球购物
HR喜欢的自荐信格式
2013/10/08 职场文书
营销总监岗位职责范本
2014/02/26 职场文书
仓库主管岗位职责
2014/03/02 职场文书
三提三创主题教育活动查摆整改措施
2014/10/25 职场文书
报名委托书
2015/01/29 职场文书
Go gRPC进阶教程gRPC转换HTTP
2022/06/16 Golang