讲解Python的Scrapy爬虫框架使用代理进行采集的方法


Posted in Python onFebruary 18, 2016

1.在Scrapy工程下新建“middlewares.py”

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]
Python 相关文章推荐
python的tkinter布局之简单的聊天窗口实现方法
Sep 03 Python
Python中实现对Timestamp和Datetime及UTC时间之间的转换
Apr 08 Python
Python求两个文本文件以行为单位的交集、并集与差集的方法
Jun 17 Python
Python实现简易端口扫描器代码实例
Mar 15 Python
Python urls.py的三种配置写法实例详解
Apr 28 Python
python生成式的send()方法(详解)
May 08 Python
Python Learning 列表的更多操作及示例代码
Aug 22 Python
Python笔试面试题小结
Sep 07 Python
使用pytorch搭建AlexNet操作(微调预训练模型及手动搭建)
Jan 18 Python
django数据模型中null和blank的区别说明
Sep 02 Python
详解pycharm的python包opencv(cv2)无代码提示问题的解决
Jan 29 Python
Python面向对象之内置函数相关知识总结
Jun 24 Python
使用Python的PIL模块来进行图片对比
Feb 18 #Python
使用Python来编写HTTP服务器的超级指南
Feb 18 #Python
python装饰器与递归算法详解
Feb 18 #Python
Python利用Nagios增加微信报警通知的功能
Feb 18 #Python
Python多线程、异步+多进程爬虫实现代码
Feb 17 #Python
玩转python爬虫之爬取糗事百科段子
Feb 17 #Python
玩转python爬虫之正则表达式
Feb 17 #Python
You might like
PHP 采集程序中常用的函数
2009/12/09 PHP
php 上传功能实例代码
2010/04/13 PHP
PHP获取用户的浏览器与操作系统信息的代码
2012/09/04 PHP
PHP框架性能测试报告
2016/05/08 PHP
phpmailer绑定邮箱的实现方法
2016/12/01 PHP
PHP获取当前执行php文件名的代码
2017/03/02 PHP
PHP实现表单提交数据的验证处理功能【防SQL注入和XSS攻击等】
2017/07/21 PHP
在云虚拟主机部署thinkphp5项目的步骤详解
2017/12/21 PHP
bcastr2.0 通用的图片浏览器
2006/11/22 Javascript
js 兼容多浏览器的回车和鼠标焦点事件代码(IE6/7/8,firefox,chrome)
2010/04/14 Javascript
JQUBAR1.1 jQuery 柱状图插件发布
2010/11/28 Javascript
初识JQuery 实例一(first)
2011/03/16 Javascript
读jQuery之十一 添加事件核心方法
2011/07/31 Javascript
mvc中form表单提交的三种方式(推荐)
2016/08/10 Javascript
微信小程序实现多宫格抽奖活动
2020/04/15 Javascript
validform表单验证的实现方法
2019/03/08 Javascript
微信小程序Flex布局用法深入浅出分析
2019/04/25 Javascript
详解Vue前端生产环境发布配置实战篇
2019/05/07 Javascript
Taro UI框架开发小程序实现左滑喜欢右滑不喜欢效果的示例代码
2020/05/18 Javascript
python中查找excel某一列的重复数据 剔除之后打印
2013/02/10 Python
Python编程语言的35个与众不同之处(语言特征和使用技巧)
2014/07/07 Python
python之DataFrame实现excel合并单元格
2021/02/22 Python
Python爬虫框架scrapy实现的文件下载功能示例
2018/08/04 Python
Pandas库之DataFrame使用的学习笔记
2019/06/21 Python
Python pandas DataFrame操作的实现代码
2019/06/21 Python
Python函数参数分类原理详解
2020/05/28 Python
Python更改pip镜像源的方法示例
2020/12/01 Python
Appium+Python实现简单的自动化登录测试的实现
2021/01/26 Python
ASP.NET Core中的配置详解
2021/02/05 Python
英国高档百货连锁店:John Lewis
2017/11/20 全球购物
Hunkemöller瑞士网上商店:欧洲最大的内衣品牌之一
2018/12/03 全球购物
行政工作试用期自我评价
2014/09/14 职场文书
运动会表扬稿范文
2015/05/05 职场文书
创业计划书之牛肉汤快餐店
2019/10/08 职场文书
python实现股票历史数据可视化分析案例
2021/06/10 Python
react使用antd的上传组件实现文件表单一起提交功能(完整代码)
2021/06/29 Javascript