讲解Python的Scrapy爬虫框架使用代理进行采集的方法


Posted in Python onFebruary 18, 2016

1.在Scrapy工程下新建“middlewares.py”

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]
Python 相关文章推荐
跟老齐学Python之有点简约的元组
Sep 24 Python
Python中使用dom模块生成XML文件示例
Apr 05 Python
Python中基础的socket编程实战攻略
Jun 01 Python
python 表达式和语句及for、while循环练习实例
Jul 07 Python
python+selenium开发环境搭建图文教程
Aug 11 Python
python实现批量解析邮件并下载附件
Jun 19 Python
Python拼接微信好友头像大图的实现方法
Aug 01 Python
python之mock模块基本使用方法详解
Jun 27 Python
python itsdangerous模块的具体使用方法
Feb 17 Python
python selenium操作cookie的实现
Mar 18 Python
Tensorflow实现将标签变为one-hot形式
May 22 Python
keras model.fit 解决validation_spilt=num 的问题
Jun 19 Python
使用Python的PIL模块来进行图片对比
Feb 18 #Python
使用Python来编写HTTP服务器的超级指南
Feb 18 #Python
python装饰器与递归算法详解
Feb 18 #Python
Python利用Nagios增加微信报警通知的功能
Feb 18 #Python
Python多线程、异步+多进程爬虫实现代码
Feb 17 #Python
玩转python爬虫之爬取糗事百科段子
Feb 17 #Python
玩转python爬虫之正则表达式
Feb 17 #Python
You might like
用phpmyadmin更改mysql5.0登录密码
2008/03/25 PHP
用Zend Studio+PHPnow+Zend Debugger搭建PHP服务器调试环境步骤
2014/01/19 PHP
php实现的短网址算法分享
2014/06/20 PHP
php通过前序遍历树实现无需递归的无限极分类
2015/07/10 PHP
JS动画效果代码3
2008/04/03 Javascript
javascript String 的扩展方法集合
2008/06/01 Javascript
JQUERY CHECKBOX全选,取消全选,反选方法三
2008/08/30 Javascript
jquery的ajax从纯真网(cz88.net)获取IP地址对应地区名
2009/12/02 Javascript
解决IE下select标签innerHTML插入option的BUG(兼容IE,FF,Opera,Chrome,Safari)
2010/05/13 Javascript
jquery 弹出层注册页面等(asp.net后台)
2010/06/17 Javascript
基于Unit PNG Fix.js有时候在ie6下不正常的解决办法
2013/06/26 Javascript
JavaScript中的console.assert()函数介绍
2014/12/29 Javascript
js实现照片墙功能实例
2015/02/05 Javascript
jQuery的promise与deferred对象在异步回调中的作用
2016/05/03 Javascript
JQuery组件基于Bootstrap的DropDownList(完整版)
2016/07/05 Javascript
jQuery Easyui使用(二)之可折叠面板动态加载无效果的解决方法
2016/08/17 Javascript
微信小程序实现滚动消息通知
2018/02/02 Javascript
详解微信小程序的 request 封装示例
2018/08/21 Javascript
Vue CLI3基础学习之pages构建多页应用
2019/06/02 Javascript
Layui 导航默认展开和菜单栏选中高亮设置的方法
2019/09/04 Javascript
nodejs一个简单的文件服务器的创建方法
2019/09/13 NodeJs
javascript设计模式 ? 装饰模式原理与应用实例分析
2020/04/14 Javascript
python 实现将txt文件多行合并为一行并将中间的空格去掉方法
2018/12/20 Python
Python的log日志功能及设置方法
2019/07/11 Python
详解Anconda环境下载python包的教程(图形界面+命令行+pycharm安装)
2019/11/11 Python
python 字典访问的三种方法小结
2019/12/05 Python
python爬虫scrapy基于CrawlSpider类的全站数据爬取示例解析
2021/02/20 Python
印尼在线精品店:Berrybenka.com
2016/10/22 全球购物
服装创业计划书范文
2014/02/05 职场文书
医药营销个人求职信范文
2014/02/07 职场文书
文员岗位职责范本
2014/03/08 职场文书
企业消防安全责任书
2014/07/23 职场文书
班子个人四风问题整改措施
2014/10/04 职场文书
党小组评议意见
2015/06/02 职场文书
SQL Server一个字符串拆分多行显示或者多行数据合并成一个字符串
2022/05/25 SQL Server
vue ant design 封装弹窗表单的使用
2022/06/01 Vue.js