讲解Python的Scrapy爬虫框架使用代理进行采集的方法


Posted in Python onFebruary 18, 2016

1.在Scrapy工程下新建“middlewares.py”

# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication
import base64

# Start your middleware class
class ProxyMiddleware(object):
 # overwrite process request
 def process_request(self, request, spider):
  # Set the location of the proxy
  request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

  # Use the following lines if your proxy requires authentication
  proxy_user_pass = "USERNAME:PASSWORD"
  # setup basic authentication for the proxy
  encoded_user_pass = base64.encodestring(proxy_user_pass)
  request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

2.在项目配置文件里(./project_name/settings.py)添加

DOWNLOADER_MIDDLEWARES = {
 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
 'project_name.middlewares.ProxyMiddleware': 100,
}

只要两步,现在请求就是通过代理的了。测试一下^_^

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request

class TestSpider(CrawlSpider):
 name = "test"
 domain_name = "whatismyip.com"
 # The following url is subject to change, you can get the last updated one from here :
 # http://www.whatismyip.com/faq/automation.asp
 start_urls = ["http://xujian.info"]

 def parse(self, response):
  open('test.html', 'wb').write(response.body)

3.使用随机user-agent

默认情况下scrapy采集时只能使用一种user-agent,这样容易被网站屏蔽,下面的代码可以从预先定义的user- agent的列表中随机选择一个来采集不同的页面

在settings.py中添加以下代码

DOWNLOADER_MIDDLEWARES = {
  'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
  'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400
 }

注意: Crawler; 是你项目的名字 ,通过它是一个目录的名称 下面是蜘蛛的代码

#!/usr/bin/python
#-*-coding:utf-8-*-

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
 def __init__(self, user_agent=''):
  self.user_agent = user_agent

 def process_request(self, request, spider):
  #这句话用于随机选择user-agent
  ua = random.choice(self.user_agent_list)
  if ua:
   request.headers.setdefault('User-Agent', ua)

 #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
 #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
 user_agent_list = [\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
  ]
Python 相关文章推荐
python爬虫实战之爬取京东商城实例教程
Apr 24 Python
一个简单的python爬虫程序 爬取豆瓣热度Top100以内的电影信息
Apr 17 Python
Python之读取TXT文件的方法小结
Apr 27 Python
python 识别图片中的文字信息方法
May 10 Python
python3学习之Splash的安装与实例教程
Jul 09 Python
Python文件读写常见用法总结
Feb 22 Python
Python迭代器模块itertools使用原理解析
Dec 11 Python
Python面向对象编程基础实例分析
Jan 17 Python
python通用读取vcf文件的类(复制粘贴即可用)
Feb 29 Python
Python实现Wordcloud生成词云图的示例
Mar 30 Python
解决Jupyter Notebook开始菜单栏Anaconda下消失的问题
Apr 13 Python
matplotlib源码解析标题实现(窗口标题,标题,子图标题不同之间的差异)
Feb 22 Python
使用Python的PIL模块来进行图片对比
Feb 18 #Python
使用Python来编写HTTP服务器的超级指南
Feb 18 #Python
python装饰器与递归算法详解
Feb 18 #Python
Python利用Nagios增加微信报警通知的功能
Feb 18 #Python
Python多线程、异步+多进程爬虫实现代码
Feb 17 #Python
玩转python爬虫之爬取糗事百科段子
Feb 17 #Python
玩转python爬虫之正则表达式
Feb 17 #Python
You might like
sphinx增量索引的一个问题
2011/06/14 PHP
PHP实现的json类实例
2015/07/28 PHP
PHP执行linux命令6个函数代码实例
2020/11/24 PHP
建议大家看下JavaScript重要知识更新
2007/07/08 Javascript
jquery 分页控件实现代码
2009/11/30 Javascript
javascript学习笔记(八) js内置对象
2012/06/19 Javascript
两个listbox实现选项的添加删除和搜索
2013/03/01 Javascript
jQuery调用RESTful WCF示例代码(GET方法/POST方法)
2014/01/26 Javascript
jQuery实现回车键(Enter)切换文本框焦点的代码实例
2014/05/05 Javascript
jQuery取得设置清空select选择的文本与值
2014/07/08 Javascript
JS获取文件大小方法小结
2015/12/08 Javascript
jQuery同步提交示例代码
2015/12/12 Javascript
深入理解JavaScript中的并行处理
2016/09/22 Javascript
Select2.js下拉框使用小结
2016/10/24 Javascript
JavaScript实现两个select下拉框选项左移右移
2017/03/09 Javascript
vue插件开发之使用pdf.js实现手机端在线预览pdf文档的方法
2018/07/12 Javascript
微信小程序-可移动菜单的实现过程详解
2019/06/24 Javascript
解决$store.getters调用不执行的问题
2019/11/08 Javascript
微信sdk实现禁止微信分享(使用原生php实现)
2019/11/15 Javascript
JS实现百度搜索框关键字推荐
2020/02/17 Javascript
springboot+vue实现文件上传下载
2020/11/17 Vue.js
javascript实现移动端轮播图
2020/12/09 Javascript
100行python代码实现跳一跳辅助程序
2018/01/15 Python
Python使用progressbar模块实现的显示进度条功能
2018/05/31 Python
python利用selenium进行浏览器爬虫
2019/04/25 Python
Giglio英国站:意大利奢侈品购物网
2018/03/06 全球购物
彪马日本官网:PUMA日本
2019/01/31 全球购物
学生打架检讨书大全
2014/01/23 职场文书
材料工程专业毕业生求职信
2014/03/04 职场文书
销售经理竞聘书
2014/03/31 职场文书
2014学习优秀共产党员先进事迹材料思想汇报
2014/09/14 职场文书
党员先进事迹材料
2014/12/19 职场文书
会计岗位职责范本
2015/04/02 职场文书
党内外群众意见范文
2015/06/02 职场文书
2015初一年级组工作总结
2015/07/24 职场文书
mysql数据库入门第一步之创建表
2021/05/14 MySQL