Python爬虫代理池搭建的方法步骤


Posted in Python onSeptember 28, 2020

一、为什么要搭建爬虫代理池

在众多的网站防爬措施中,有一种是根据ip的访问频率进行限制,即在某一时间段内,当某个ip的访问次数达到一定的阀值时,该ip就会被拉黑、在一段时间内禁止访问。

应对的方法有两种:

1. 降低爬虫的爬取频率,避免IP被限制访问,缺点显而易见:会大大降低爬取的效率。

2. 搭建一个IP代理池,使用不同的IP轮流进行爬取。

二、搭建思路

1、从代理网站(如:西刺代理、快代理、云代理、无忧代理)爬取代理IP;

2、验证代理IP的可用性(使用代理IP去请求指定URL,根据响应验证代理IP是否生效);

3、将可用的代理IP保存到数据库;

常用代理网站:西刺代理 、云代理 、IP海 、无忧代理 、飞蚁代理 、快代理

三、代码实现

工程结构如下:

Python爬虫代理池搭建的方法步骤

ipproxy.py

IPProxy代理类定义了要爬取的IP代理的字段信息和一些基础方法。

# -*- coding: utf-8 -*-
import re
import time
from settings import PROXY_URL_FORMATTER

schema_pattern = re.compile(r'http|https$', re.I)
ip_pattern = re.compile(r'^([0-9]{1,3}.){3}[0-9]{1,3}$', re.I)
port_pattern = re.compile(r'^[0-9]{2,5}$', re.I)

class IPProxy:
  '''
  {
    "schema": "http", # 代理的类型
    "ip": "127.0.0.1", # 代理的IP地址
    "port": "8050", # 代理的端口号
    "used_total": 11, # 代理的使用次数
    "success_times": 5, # 代理请求成功的次数
    "continuous_failed": 3, # 使用代理发送请求,连续失败的次数
    "created_time": "2018-05-02" # 代理的爬取时间
  }
  '''

  def __init__(self, schema, ip, port, used_total=0, success_times=0, continuous_failed=0,
         created_time=None):
    """Initialize the proxy instance"""
    if schema == "" or schema is None:
      schema = "http"
    self.schema = schema.lower()
    self.ip = ip
    self.port = port
    self.used_total = used_total
    self.success_times = success_times
    self.continuous_failed = continuous_failed
    if created_time is None:
      created_time = time.strftime('%Y-%m-%d', time.localtime(time.time()))
    self.created_time = created_time

  def _get_url(self):
    ''' Return the proxy url'''
    return PROXY_URL_FORMATTER % {'schema': self.schema, 'ip': self.ip, 'port': self.port}

  def _check_format(self):
    ''' Return True if the proxy fields are well-formed,otherwise return False'''
    if self.schema is not None and self.ip is not None and self.port is not None:
      if schema_pattern.match(self.schema) and ip_pattern.match(self.ip) and port_pattern.match(self.port):
        return True
    return False

  def _is_https(self):
    ''' Return True if the proxy is https,otherwise return False'''
    return self.schema == 'https'

  def _update(self, successed=False):
    ''' Update proxy based on the result of the request's response'''
    self.used_total = self.used_total + 1
    if successed:
      self.continuous_failed = 0
      self.success_times = self.success_times + 1
    else:
      print(self.continuous_failed)
      self.continuous_failed = self.continuous_failed + 1

if __name__ == '__main__':
  proxy = IPProxy('HTTPS', '192.168.2.25', "8080")
  print(proxy._get_url())
  print(proxy._check_format())
  print(proxy._is_https())

settings.py

settings.py中汇聚了工程所需要的配置信息。

# 指定Redis的主机名和端口
REDIS_HOST = 'localhost'
REDIS_PORT = 6379

# 代理保存到Redis key 格式化字符串
PROXIES_REDIS_FORMATTER = 'proxies::{}'
# 已经存在的HTTP代理和HTTPS代理集合
PROXIES_REDIS_EXISTED = 'proxies::existed'

# 最多连续失败几次
MAX_CONTINUOUS_TIMES = 3
# 代理地址的格式化字符串
PROXY_URL_FORMATTER = '%(schema)s://%(ip)s:%(port)s'

USER_AGENT_LIST = [
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
  "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
  "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
# 爬取到的代理保存前先检验是否可用,默认True
PROXY_CHECK_BEFOREADD = True
# 检验代理可用性的请求地址,支持多个
PROXY_CHECK_URLS = {'https':['https://icanhazip.com'],'http':['http://icanhazip.com']}

proxy_util.py

proxy_util.py 中主要定义了一些实用方法,例如:proxy_to_dict(proxy)用来将IPProxy代理实例转换成字典;proxy_from_dict(d)用来将字典转换为IPProxy实例;request_page()用来发送请求;_is_proxy_available()用来校验代理IP是否可用。

# -*- coding: utf-8 -*-
import random
import logging
import requests
from ipproxy import IPProxy
from settings import USER_AGENT_LIST, PROXY_CHECK_URLS

# Setting logger output format
logging.basicConfig(level=logging.INFO,
          format='[%(asctime)-15s] [%(levelname)8s] [%(name)10s ] - %(message)s (%(filename)s:%(lineno)s)',
          datefmt='%Y-%m-%d %T'
          )
logger = logging.getLogger(__name__)


def proxy_to_dict(proxy):
  d = {
    "schema": proxy.schema,
    "ip": proxy.ip,
    "port": proxy.port,
    "used_total": proxy.used_total,
    "success_times": proxy.success_times,
    "continuous_failed": proxy.continuous_failed,
    "created_time": proxy.created_time
  }
  return d


def proxy_from_dict(d):
  return IPProxy(schema=d['schema'], ip=d['ip'], port=d['port'], used_total=d['used_total'],
          success_times=d['success_times'], continuous_failed=d['continuous_failed'],
          created_time=d['created_time'])


# Truncate header and tailer blanks
def strip(data):
  if data is not None:
    return data.strip()
  return data


base_headers = {
  'Accept-Encoding': 'gzip, deflate, br',
  'Accept-Language': 'en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7'
}


def request_page(url, options={}, encoding='utf-8'):
  """send a request,get response"""
  headers = dict(base_headers, **options)
  if 'User-Agent' not in headers.keys():
    headers['User-Agent'] = random.choice(USER_AGENT_LIST)

  logger.info('正在抓取: ' + url)
  try:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
      logger.info('抓取成功: ' + url)
      return response.content.decode(encoding=encoding)
  except ConnectionError:
    logger.error('抓取失败' + url)
    return None

def _is_proxy_available(proxy, options={}):
  """Check whether the Proxy is available or not"""
  headers = dict(base_headers, **options)
  if 'User-Agent' not in headers.keys():
    headers['User-Agent'] = random.choice(USER_AGENT_LIST)
  proxies = {proxy.schema: proxy._get_url()}
  check_urls = PROXY_CHECK_URLS[proxy.schema]
  for url in check_urls:
    try:
      response = requests.get(url=url, proxies=proxies, headers=headers, timeout=5)
    except BaseException:
      logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 不可用 ")
    else:
      if response.status_code == 200:
        logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 可用 ")
        return True
      else:
        logger.info("< " + url + " > 验证代理 < " + proxy._get_url() + " > 结果: 不可用 ")
  return False


if __name__ == '__main__':
  headers = dict(base_headers)
  if 'User-Agent' not in headers.keys():
    headers['User-Agent'] = random.choice(USER_AGENT_LIST)
  proxies = {"https": "https://163.125.255.154:9797"}
  response = requests.get("https://www.baidu.com", headers=headers, proxies=proxies, timeout=3)
  print(response.content)

proxy_queue.py

代理队列用来保存并对外提供 IP代理,不同的代理队列内代理IP的保存和提取策略可以不同。在这里, BaseQueue 是所有代理队列的基类,其中声明了所有代理队列都需要实现的保存代理IP、提取代理IP、查看代理IP数量等接口。示例的 FifoQueue 是一个先进先出队列,底层使用 Redis 列表实现,为了确保同一个代理IP只能被放入队列一次,这里使用了一个Redis proxies::existed 集合进行入队前重复校验。

# -*- coding: utf-8 -*-
from proxy_util import logger
import json
import redis
from ipproxy import IPProxy
from proxy_util import proxy_to_dict, proxy_from_dict, _is_proxy_available
from settings import PROXIES_REDIS_EXISTED, PROXIES_REDIS_FORMATTER, MAX_CONTINUOUS_TIMES, PROXY_CHECK_BEFOREADD

"""
Proxy Queue Base Class
"""
class BaseQueue(object):

  def __init__(self, server):
    """Initialize the proxy queue instance

    Parameters
    ----------
    server : StrictRedis
      Redis client instance
    """
    self.server = server

  def _serialize_proxy(self, proxy):
    """Serialize proxy instance"""
    return proxy_to_dict(proxy)

  def _deserialize_proxy(self, serialized_proxy):
    """deserialize proxy instance"""
    return proxy_from_dict(eval(serialized_proxy))

  def __len__(self, schema='http'):
    """Return the length of the queue"""
    raise NotImplementedError

  def push(self, proxy, need_check):
    """Push a proxy"""
    raise NotImplementedError

  def pop(self, schema='http', timeout=0):
    """Pop a proxy"""
    raise NotImplementedError


class FifoQueue(BaseQueue):
  """First in first out queue"""

  def __len__(self, schema='http'):
    """Return the length of the queue"""
    return self.server.llen(PROXIES_REDIS_FORMATTER.format(schema))

  def push(self, proxy, need_check=PROXY_CHECK_BEFOREADD):
    """Push a proxy"""
    if need_check and not _is_proxy_available(proxy):
      return
    elif proxy.continuous_failed < MAX_CONTINUOUS_TIMES and not self._is_existed(proxy):
      key = PROXIES_REDIS_FORMATTER.format(proxy.schema)
      self.server.rpush(key, json.dumps(self._serialize_proxy(proxy),ensure_ascii=False))

  def pop(self, schema='http', timeout=0):
    """Pop a proxy"""
    if timeout > 0:
      p = self.server.blpop(PROXIES_REDIS_FORMATTER.format(schema.lower()), timeout)
      if isinstance(p, tuple):
        p = p[1]
    else:
      p = self.server.lpop(PROXIES_REDIS_FORMATTER.format(schema.lower()))
    if p:
      p = self._deserialize_proxy(p)
      self.server.srem(PROXIES_REDIS_EXISTED, p._get_url())
      return p

  def _is_existed(self, proxy):
    added = self.server.sadd(PROXIES_REDIS_EXISTED, proxy._get_url())
    return added == 0


if __name__ == '__main__':
  r = redis.StrictRedis(host='localhost', port=6379)
  queue = FifoQueue(r)
  proxy = IPProxy('http', '218.66.253.144', '80')
  queue.push(proxy)
  proxy = queue.pop(schema='http')
  print(proxy._get_url())

proxy_crawlers.py

ProxyBaseCrawler 是所有代理爬虫的基类,其中只定义了一个 _start_crawl() 方法用来从搜集到的代理网站爬取代理IP。

# -*- coding: utf-8 -*-
from lxml import etree
from ipproxy import IPProxy
from proxy_util import strip, request_page, logger


class ProxyBaseCrawler(object):

  def __init__(self, queue=None, website=None, urls=[]):
    self.queue = queue
    self.website = website
    self.urls = urls

  def _start_crawl(self):
    raise NotImplementedError


class KuaiDailiCrawler(ProxyBaseCrawler): # 快代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url))
        tr_list = html.xpath("//table[@class='table table-bordered table-striped']/tbody/tr")
        for tr in tr_list:
          ip = tr.xpath("./td[@data-title='IP']/text()")[0] if len(
            tr.xpath("./td[@data-title='IP']/text()")) else None
          port = tr.xpath("./td[@data-title='PORT']/text()")[0] if len(
            tr.xpath("./td[@data-title='PORT']/text()")) else None
          schema = tr.xpath("./td[@data-title='类型']/text()")[0] if len(
            tr.xpath("./td[@data-title='类型']/text()")) else None
          proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
          if proxy._check_format():
            self.queue.push(proxy)
        if tr_list is None:
          has_more = False


class FeiyiDailiCrawler(ProxyBaseCrawler): # 飞蚁代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url))
        tr_list = html.xpath("//div[@id='main-content']//table/tr[position()>1]")
        for tr in tr_list:
          ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None
          port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None
          schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None
          proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
          if proxy._check_format():
            self.queue.push(proxy)
        if tr_list is None:
          has_more = False


class WuyouDailiCrawler(ProxyBaseCrawler): # 无忧代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url))
        ul_list = html.xpath("//div[@class='wlist'][2]//ul[@class='l2']")
        for ul in ul_list:
          ip = ul.xpath("./span[1]/li/text()")[0] if len(ul.xpath("./span[1]/li/text()")) else None
          port = ul.xpath("./span[2]/li/text()")[0] if len(ul.xpath("./span[2]/li/text()")) else None
          schema = ul.xpath("./span[4]/li/text()")[0] if len(ul.xpath("./span[4]/li/text()")) else None
          proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
          if proxy._check_format():
            self.queue.push(proxy)
        if ul_list is None:
          has_more = False


class IPhaiDailiCrawler(ProxyBaseCrawler): # IP海代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url))
        tr_list = html.xpath("//table//tr[position()>1]")
        for tr in tr_list:
          ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None
          port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None
          schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None
          proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
          if proxy._check_format():
            self.queue.push(proxy)
        if tr_list is None:
          has_more = False


class YunDailiCrawler(ProxyBaseCrawler): # 云代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url, encoding='gbk'))
        tr_list = html.xpath("//table/tbody/tr")
        for tr in tr_list:
          ip = tr.xpath("./td[1]/text()")[0] if len(tr.xpath("./td[1]/text()")) else None
          port = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None
          schema = tr.xpath("./td[4]/text()")[0] if len(tr.xpath("./td[4]/text()")) else None
          proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
          if proxy._check_format():
            self.queue.push(proxy)
        if tr_list is None:
          has_more = False


class XiCiDailiCrawler(ProxyBaseCrawler): # 西刺代理
  def _start_crawl(self):
    for url_dict in self.urls:
      logger.info("开始爬取 [ " + self.website + " ] :::> [ " + url_dict['type'] + " ]")
      has_more = True
      url = None
      while has_more:
        if 'page' in url_dict.keys() and str.find(url_dict['url'], '{}') != -1:
          url = url_dict['url'].format(str(url_dict['page']))
          url_dict['page'] = url_dict['page'] + 1
        else:
          url = url_dict['url']
          has_more = False
        html = etree.HTML(request_page(url))
        tr_list = html.xpath("//table[@id='ip_list']//tr[@class!='subtitle']")
        for tr in tr_list:
          ip = tr.xpath("./td[2]/text()")[0] if len(tr.xpath("./td[2]/text()")) else None
          port = tr.xpath("./td[3]/text()")[0] if len(tr.xpath("./td[3]/text()")) else None
          schema = tr.xpath("./td[6]/text()")[0] if len(tr.xpath("./td[6]/text()")) else None
          if schema.lower() == "http" or schema.lower() == "https":
            proxy = IPProxy(schema=strip(schema), ip=strip(ip), port=strip(port))
            if proxy._check_format():
              self.queue.push(proxy)
        if tr_list is None:
          has_more = False

run.py

通过run.py启动各个代理网站爬虫。

# -*- coding: utf-8 -*-
import redis
from proxy_queue import FifoQueue
from settings import REDIS_HOST, REDIS_PORT
from proxy_crawlers import WuyouDailiCrawler, FeiyiDailiCrawler, KuaiDailiCrawler, IPhaiDailiCrawler, YunDailiCrawler, \
  XiCiDailiCrawler

r = redis.StrictRedis(host=REDIS_HOST, port=REDIS_PORT)
fifo_queue = FifoQueue(r)


def run_kuai():
  kuaidailiCrawler = KuaiDailiCrawler(queue=fifo_queue, website='快代理[国内高匿]',
                    urls=[{'url': 'https://www.kuaidaili.com/free/inha/{}/', 'type': '国内高匿',
                        'page': 1},
                       {'url': 'https://www.kuaidaili.com/free/intr/{}/', 'type': '国内普通',
                        'page': 1}])
  kuaidailiCrawler._start_crawl()


def run_feiyi():
  feiyidailiCrawler = FeiyiDailiCrawler(queue=fifo_queue, website='飞蚁代理',
                     urls=[{'url': 'http://www.feiyiproxy.com/?page_id=1457', 'type': '首页推荐'}])
  feiyidailiCrawler._start_crawl()


def run_wuyou():
  wuyoudailiCrawler = WuyouDailiCrawler(queue=fifo_queue, website='无忧代理',
                     urls=[{'url': 'http://www.data5u.com/free/index.html', 'type': '首页推荐'},
                        {'url': 'http://www.data5u.com/free/gngn/index.shtml', 'type': '国内高匿'},
                        {'url': 'http://www.data5u.com/free/gnpt/index.shtml', 'type': '国内普通'}])
  wuyoudailiCrawler._start_crawl()


def run_iphai():
  crawler = IPhaiDailiCrawler(queue=fifo_queue, website='IP海代理',
                urls=[{'url': 'http://www.iphai.com/free/ng', 'type': '国内高匿'},
                   {'url': 'http://www.iphai.com/free/np', 'type': '国内普通'},
                   {'url': 'http://www.iphai.com/free/wg', 'type': '国外高匿'},
                   {'url': 'http://www.iphai.com/free/wp', 'type': '国外普通'}])
  crawler._start_crawl()


def run_yun():
  crawler = YunDailiCrawler(queue=fifo_queue, website='云代理',
               urls=[{'url': 'http://www.ip3366.net/free/?stype=1&page={}', 'type': '国内高匿', 'page': 1},
                  {'url': 'http://www.ip3366.net/free/?stype=2&page={}', 'type': '国内普通', 'page': 1},
                  {'url': 'http://www.ip3366.net/free/?stype=3&page={}', 'type': '国外高匿', 'page': 1},
                  {'url': 'http://www.ip3366.net/free/?stype=4&page={}', 'type': '国外普通', 'page': 1}])
  crawler._start_crawl()


def run_xici():
  crawler = XiCiDailiCrawler(queue=fifo_queue, website='西刺代理',
                urls=[{'url': 'https://www.xicidaili.com/', 'type': '首页推荐'},
                   {'url': 'https://www.xicidaili.com/nn/{}', 'type': '国内高匿', 'page': 1},
                   {'url': 'https://www.xicidaili.com/nt/{}', 'type': '国内普通', 'page': 1},
                   {'url': 'https://www.xicidaili.com/wn/{}', 'type': '国外高匿', 'page': 1},
                   {'url': 'https://www.xicidaili.com/wt/{}', 'type': '国外普通', 'page': 1}])
  crawler._start_crawl()


if __name__ == '__main__':
  run_xici()
  run_iphai()
  run_kuai()
  run_feiyi()
  run_yun()
  run_wuyou()

爬取西刺代理时,后台日志示例如下:

Python爬虫代理池搭建的方法步骤

Redis数据库中爬取到的代理IP的数据结构如下:

Python爬虫代理池搭建的方法步骤

四、代理测试

 接下来,使用爬取好的代理来请求 http://icanhazip.com 进行测试,代码如下:

# -*- coding: utf-8 -*-
import random
import requests
from proxy_util import logger
from run import fifo_queue
from settings import USER_AGENT_LIST
from proxy_util import base_headers

# 测试地址
url = 'http://icanhazip.com'

# 获取代理
proxy = fifo_queue.pop(schema='http')
proxies = {proxy.schema:proxy._get_url()}

# 构造请求头
headers = dict(base_headers)
if 'User-Agent' not in headers.keys():
  headers['User-Agent'] = random.choice(USER_AGENT_LIST)

response = None
successed = False
try:
  response = requests.get(url,headers=headers,proxies = proxies,timeout=5)
except BaseException:
  logger.error("使用代理< "+proxy._get_url()+" > 请求 < "+url+" > 结果: 失败 ")
else:
  if (response.status_code == 200):
    logger.info(response.content.decode())
    successed = True
    logger.info("使用代理< " + proxy._get_url() + " > 请求 < " + url + " > 结果: 成功 ")
  else:
    logger.info(response.content.decode())
    logger.info("使用代理< " + proxy._get_url() + " > 请求 < " + url + " > 结果: 失败 ")

# 根据请求的响应结果更新代理
proxy._update(successed)
# 将代理返还给队列,返还时不校验可用性
fifo_queue.push(proxy,need_check=False)

使用 http://218.66.253.144:80 代理请求成功后将代理重新放回队列,并将 Redis 中该代理的 used_total 、success_times 、continuous_failed三个字段信息进行了相应的更新。

Python爬虫代理池搭建的方法步骤

项目地址:https://github.com/pengjunlee/ipproxy_pool.git

到此这篇关于Python爬虫代理池搭建的方法步骤的文章就介绍到这了,更多相关Python爬虫代理池搭建内容请搜索三水点靠木以前的文章或继续浏览下面的相关文章希望大家以后多多支持三水点靠木!

Python 相关文章推荐
Python编写检测数据库SA用户的方法
Jul 11 Python
Python2.x中str与unicode相关问题的解决方法
Mar 30 Python
Python多线程编程简单介绍
Apr 13 Python
Python实现按特定格式对文件进行读写的方法示例
Nov 30 Python
使用pygame模块编写贪吃蛇的实例讲解
Feb 05 Python
python随机取list中的元素方法
Apr 08 Python
python实现年会抽奖程序
Jan 22 Python
selenium+python自动化测试环境搭建步骤
Jun 03 Python
用Python识别人脸,人种等各种信息
Jul 15 Python
Python使用type动态创建类操作示例
Feb 29 Python
解决reload(sys)后print失效的问题
Apr 25 Python
python 利用panda 实现列联表(交叉表)
Feb 06 Python
浅析python 通⽤爬⾍和聚焦爬⾍
Sep 28 #Python
Scrapy 配置动态代理IP的实现
Sep 28 #Python
Scrapy中如何向Spider传入参数的方法实现
Sep 28 #Python
详解向scrapy中的spider传递参数的几种方法(2种)
Sep 28 #Python
小结Python的反射机制
Sep 28 #Python
scrapy与selenium结合爬取数据(爬取动态网站)的示例代码
Sep 28 #Python
scrapy结合selenium解析动态页面的实现
Sep 28 #Python
You might like
extjs 学习笔记(三) 最基本的grid
2009/10/15 Javascript
JavaScript中的逻辑判断符&amp;&amp;、||与!介绍
2014/12/31 Javascript
js实现头像图片切割缩放及无刷新上传图片的方法
2015/07/17 Javascript
不依赖Flash和任何JS库实现文本复制与剪切附源码下载
2015/10/09 Javascript
详解javascript传统方法实现异步校验
2016/01/22 Javascript
基于jquery编写的放大镜插件
2016/03/23 Javascript
详解jQuery中的deferred对象的使用(一)
2016/05/27 Javascript
JS实现兼容各种浏览器的获取选择文本的方法【测试可用】
2016/06/21 Javascript
微信小程序 WXML、WXSS 和JS介绍及详解
2016/10/08 Javascript
node.JS事件机制与events事件模块的使用方法详解
2020/02/06 Javascript
JavaScript canvas动画实现时钟效果
2020/02/10 Javascript
原生js实现日历效果
2020/03/02 Javascript
详解element上传组件before-remove钩子问题解决
2020/04/08 Javascript
NestJs使用Mongoose对MongoDB操作的方法
2021/02/22 Javascript
[40:12]Liquid vs Chaos 2019国际邀请赛小组赛 BO2 第二场 8.15
2019/08/16 DOTA
Python使用Socket(Https)Post登录百度的实现代码
2012/05/18 Python
python中使用pyhook实现键盘监控的例子
2014/07/18 Python
python引入导入自定义模块和外部文件的实例
2017/07/24 Python
pandas 空的dataframe 插入列名的示例
2018/10/30 Python
python爬虫URL重试机制的实现方法(python2.7以及python3.5)
2018/12/18 Python
Django 设置多环境配置文件载入问题
2020/02/25 Python
如何使用python的ctypes调用医保中心的dll动态库下载医保中心的账单
2020/05/24 Python
opencv之颜色过滤只留下图片中的红色区域操作
2020/06/05 Python
斯凯奇美国官网:SKECHERS美国
2016/08/20 全球购物
SQL Server的固定数据库角色都有哪些?对应的服务器权限有哪些?
2013/05/18 面试题
成人大专生实习期的自我评价
2013/10/02 职场文书
幼儿园教师考核制度
2014/02/01 职场文书
关于抽烟的检讨书
2014/02/25 职场文书
小学学雷锋活动总结
2014/04/25 职场文书
党的群众路线学习材料
2014/05/16 职场文书
项目投资意向书范本
2015/05/09 职场文书
教师专业技术工作总结2015
2015/05/13 职场文书
2015年中学总务处工作总结
2015/07/22 职场文书
《少年闰土》教学反思
2016/02/18 职场文书
mysql 如何获取两个集合的交集/差集/并集
2021/06/08 MySQL
Go 内联优化让程序员爱不释手
2022/06/21 Golang