Python使用requests xpath 并开启多线程爬取西刺代理ip实例


Posted in Python onMarch 06, 2020

我就废话不多说啦,大家还是直接看代码吧!

import requests,random
from lxml import etree
import threading
import time

angents = [
  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]

def get_all_xici_urls(start_num,stop_num):
  xici_urls = []
  for num in range(start_num,len(stop_num)+1):
    xici_http_url = 'http://www.xicidaili.com/wt/'
    xici_http_url += str(num)
    xici_urls.append(xici_http_url)
  print('获取所有待爬取xici url 已完成...')
  return xici_urls
def get_all_http_ip(xici_http_url,headers,proxies_list):
  try:
    all_ip_xpath = '//table//tr/child::*[2]/text()'
    all_prot_xpath = '//table//tr/child::*[3]/text()'
    response = requests.get(url=xici_http_url,headers=headers)
    html_tree = etree.HTML(response.text)
    ip_list = html_tree.xpath(all_ip_xpath)
    port_list = html_tree.xpath(all_prot_xpath)
    # print(ip_list)
    # print(prot_list)
    new_proxies_list = []
    for index in range(1,len(ip_list)):
      # print('http://{}:{}'.format(ip_list[index],port_list[index]))
      proxies_dict = {}
      proxies_dict['http'] = 'http://{}:{}'.format(str(ip_list[index]),str(port_list[index]))
      new_proxies_list.append(proxies_dict)
    proxies_list += new_proxies_list
    return proxies_list
  except Exception as e:
    print('发生了错误:url为 ',xici_http_url,'错误为 ',e)

if __name__ == '__main__':
  start_num = int(input('请输入起始页面:').strip())
  stop_num = int(input('请输入结束页面:').strip())
  print('开始爬取...')
  t_list = []
  # 容纳需要使用的西刺代理ip
  proxies_list = []
  # 使用多线程
  xici_urls = get_all_xici_urls(start_num,stop_num)
  for xici_get_url in xici_urls:
    #随机筛选一个useragent
    headers = {'User-Agent': random.choice(angents)}
    t = threading.Thread(target=get_all_http_ip,args=(xici_get_url,headers,proxies_list))
    t.start()
    t_list.append(t)
  for j in t_list:
    j.join()
  print('所有需要的代理ip已爬取完成...')
  print(proxies_list)
  print(len(proxies_list))

补充知识:python爬取xici的免费代理、并验证(重点、清楚)

网上爬取xici的帖子很多,但是验证都说的不是很清楚,这里我会认真给大家解释

这里我写了一个代理类proxy,写了四个方法(个人写法不必在意),get_user_agent(得到随机use-agent,请求头中最重要的一个)、get_proxy(爬取代理IP)、test_proxy(验证代理可用性)、store_txt(将可用的代理保存到txt文件中。

1.爬取:headers是请求头,choice是可以选择是爬取Http代理还是https代理,first、end为开始和结束的页码(结束不包含最后一页)

def get_proxy(self, headers, choice='http', first=1, end=2):
    """
    获取代理
    :param choice:
    :param first: 开始爬取的页数
    :param end: 结束爬取的后一页
    :return:
    """
 
    ip_list = []
    base_url = None
    
    # 选择爬取的网站,一个是http、一个是https的
    if choice == 'http':
      base_url = 'http://www.xicidaili.com/wt/'
    elif choice == 'https':
      base_url = 'http://www.xicidaili.com/wn/'
    
    # 控制页码用正则匹配,并将爬取的IP和端口号用:链接
    for n in range(first, end):
      actual_url = base_url + str(n)
      html = requests.get(url=actual_url, headers=headers).text
      pattern = '(\d+\.\d+\.\d+\.\d+)</td>\s*<td>(\d+)'
      re_list = re.findall(pattern, html)
 
      for ip_port in re_list:
        ip_port = ip_port[0] + ':' + ip_port[1]
        ip_list.append(ip_port)
    return ip_list

2. 验证:网上大部分是用request直接请求一个网址看是否通过或者看状态码是否是200, 但是有一个问题是即使你设置了代理IP。可能会通过,但通过的不是用你设置的代理IP而是用你自己公网下的IP(大部分时候我们用ifconfig查询的是我们所在局域网下的IP,及私网IP)。

linux下你可以用这些命令的其中任何一个查看你的公网IP:

curl icanhazip.com
curl ifconfig.me
curl curlmyip.com
curl ip.appspot.com
curl ipinfo.io/ip
curl ipecho.net/plain
curl www.trackip.net/i

注意:那这样要怎么办,其实我们可以向上述命令一样先用你爬下的代理IP访问 http://icanhazip.com/, 它可以返回你电脑发送请求时的公网IP(此时如果你设置代理IP了就会是返回你所发送请求的代理IP),然后你将它爬取下来(直接获取返回的值的文本就可以了),并和你发送请求时的代理IP作比较,如果不相等说明此代理IP不能用,因为虽然你设置了代理Ip,但是电脑在你代理IP请求不同的情况下,直接又使用了你公网的IP去请求,当然成功了,但不代表你的代理IP可以用。如果相等,那就证明此网站就是你所用的代理IP访问请求成功的,所以此IP可用。

def test_proxy(self, ip_port, choice='http'):
    """
    测试代理是否能用
    :param ip_port:
    :param choice:
    :return:
    """
    proxies = None
 
    # 这个网站可以返回你公网下的IP,如果你加代理请求后,返回的就是你代理的IP(这样做是防止你虽然用的是代理IP,但实际是用你自己的公网IP访问的请求)
    tar_url = "http://icanhazip.com/"
 
    # 获取随机User-agent
    user_agent = self.get_user_agent()
 
    # 将user-agent放在headers中
    headers = {'User-Agent': user_agent}
 
    # 选择验证的是http、还是https
    if choice == 'http':
      proxies = {
        "http": "http://"+ip_port,
      }
 
    elif choice == 'https':
      proxies = {
        "https": "https://" + ip_port,
      }
 
    try:
      # 将IP从IP和端口号连起来的分出来
      thisIP = "".join(ip_port.split(":")[0:1])
      res = requests.get(tar_url, proxies=proxies, headers=headers, timeout=8)
 
      # 爬取下来返回的值,一定要用strip去除空格
      proxyIP = res.text.strip()
      
      # 三个状态,如过直接通不过,那就返回false,如果通过但是不是代理的IP,也返回false
      if proxyIP == thisIP:
        return proxyIP
      else:
        return False
    except:
      return False

最后附上整段代码:

import requests
import re
import random
import codecs
from urllib import parse
 
 
class proxy:
  """
  代理类
  """
  def __init__(self):
    pass
 
  def get_user_agent(self):
    """
    得到随机user-agent
    :return:
    """
    user_agents = [
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
      "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
      "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
      "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
      "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
      "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
      "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
      "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
      "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
      "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
      "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
    ]
    user_agent = random.choice(user_agents)
    return user_agent
 
 
  def get_proxy(self, headers, choice='http', first=1, end=2):
    """
    获取代理
    :param choice:
    :param first: 开始爬取的页数
    :param end: 结束爬取的后一页
    :return:
    """
 
    ip_list = []
    base_url = None
    if choice == 'http':
      base_url = 'http://www.xicidaili.com/wt/'
    elif choice == 'https':
      base_url = 'http://www.xicidaili.com/wn/'
 
    for n in range(first, end):
      actual_url = base_url + str(n)
      html = requests.get(url=actual_url, headers=headers).text
      pattern = '(\d+\.\d+\.\d+\.\d+)</td>\s*<td>(\d+)'
      re_list = re.findall(pattern, html)
 
      for ip_port in re_list:
        ip_port = ip_port[0] + ':' + ip_port[1]
        ip_list.append(ip_port)
    return ip_list
 
 
  def test_proxy(self, ip_port, choice='http'):
    """
    测试代理是否能用
    :param ip_port:
    :param choice:
    :return:
    """
    proxies = None
    # 这个网站可以返回你公网下的IP,如果你加代理请求后,返回的就是你代理的IP(这样做是防止你虽然用的是代理IP,但实际是用你自己的公网IP访问的请求)
    tar_url = "http://icanhazip.com/"
    user_agent = self.get_user_agent()
    headers = {'User-Agent': user_agent}
    if choice == 'http':
      proxies = {
        "http": "http://"+ip_port,
      }
 
    elif choice == 'https':
      proxies = {
        "https": "https://" + ip_port,
      }
    try:
      thisIP = "".join(ip_port.split(":")[0:1])
      res = requests.get(tar_url, proxies=proxies, headers=headers, timeout=8)
      proxyIP = res.text.strip()
      if proxyIP == thisIP:
        return proxyIP
      else:
        return False
    except:
      return False
 
  def store_txt(self, choice='http', first=1, end=2):
    """
    将测试通过的ip_port保存为txt文件
    :param choice:
    :param first:
    :param end:
    :return:
    """
    user_agent = self.get_user_agent()
    headers = {'User-Agent': user_agent}
    ip_list = self.get_proxy(headers=headers, choice=choice, first=first, end=end)
    with codecs.open("Http_Agent.txt", 'a', 'utf-8') as file:
      for ip_port in ip_list:
        ip_port = self.test_proxy(ip_port, choice=choice)
        print(ip_port)
        if ip_port:
          file.write('\'' + ip_port + "\'\n")

以上这篇Python使用requests xpath 并开启多线程爬取西刺代理ip实例就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python类的定义、继承及类对象使用方法简明教程
May 08 Python
web.py在SAE中的Session问题解决方法(使用mysql存储)
Jun 24 Python
简述:我为什么选择Python而不是Matlab和R语言
Nov 14 Python
python 拷贝特定后缀名文件,并保留原始目录结构的实例
Apr 27 Python
使用Python监控文件内容变化代码实例
Jun 04 Python
PHP实现发送和接收JSON请求
Jun 07 Python
Django 实现前端图片压缩功能的方法
Aug 07 Python
python实现画出e指数函数的图像
Nov 21 Python
vscode写python时的代码错误提醒和自动格式化的方法
May 07 Python
全网最细 Python 格式化输出用法讲解(推荐)
Jan 18 Python
详解Python小数据池和代码块缓存机制
Apr 07 Python
Python实现查询剪贴板自动匹配信息的思路详解
Jul 09 Python
Python 批量读取文件中指定字符的实现
Mar 06 #Python
python GUI库图形界面开发之PyQt5布局控件QGridLayout详细使用方法与实例
Mar 06 #Python
python3 xpath和requests应用详解
Mar 06 #Python
python 装饰器功能与用法案例详解
Mar 06 #Python
python GUI库图形界面开发之PyQt5布局控件QVBoxLayout详细使用方法与实例
Mar 06 #Python
利用 Python ElementTree 生成 xml的实例
Mar 06 #Python
Python3 xml.etree.ElementTree支持的XPath语法详解
Mar 06 #Python
You might like
PHP 伪静态技术原理以及突破原理实现介绍
2013/07/12 PHP
thinkPHP使用post方式查询时分页失效的解决方法
2015/12/09 PHP
Android App中DrawerLayout抽屉效果的菜单编写实例
2016/03/21 PHP
Linux下编译redis和phpredis的方法
2016/04/07 PHP
prototype.js的Ajax对象
2006/09/23 Javascript
浅谈tudou土豆网首页图片延迟加载的效果
2010/06/23 Javascript
js Dialog 实践分享
2012/10/22 Javascript
javascript的字符串按引用复制和传递,按值来比较介绍与应用
2012/12/28 Javascript
Javascript中判断变量是数组还是对象(array还是object)
2013/08/14 Javascript
jquery索引在使用中的一些困惑
2013/10/24 Javascript
jquery cookie的用法总结
2013/11/18 Javascript
jquery访问ashx文件示例代码
2014/08/11 Javascript
Ajax局部更新导致JS事件重复触发问题的解决方法
2014/10/14 Javascript
浅析jQuery Ajax请求参数和返回数据的处理
2016/02/24 Javascript
基于JS实现横线提示输入验证码随验证码输入消失(js验证码的实现)
2016/10/27 Javascript
AngularJS过滤器filter用法分析
2016/12/11 Javascript
js鼠标跟随运动效果
2017/03/11 Javascript
原生JS实现的轮播图功能详解
2018/08/06 Javascript
jQuery中$原理实例分析
2018/08/13 jQuery
vue实现信息管理系统
2020/05/30 Javascript
vue的webcamjs集成方式
2020/11/16 Javascript
Python生成数字图片代码分享
2017/10/31 Python
numpy给array增加维度np.newaxis的实例
2018/11/01 Python
Python中的引用知识点总结
2019/05/20 Python
使用Python画股票的K线图的方法步骤
2019/06/28 Python
浅谈Python 命令行参数argparse写入图片路径操作
2020/07/12 Python
详解numpy.ndarray.reshape()函数的参数问题
2020/10/13 Python
html5读取本地文件示例代码
2014/04/22 HTML / CSS
H&M美国官网:欧洲最大的服饰零售商
2016/09/07 全球购物
分公司总经理岗位职责
2014/07/30 职场文书
防邪知识进家庭活动方案
2014/08/26 职场文书
优秀教师先进事迹材料
2014/12/15 职场文书
感恩的心主题班会
2015/08/12 职场文书
2016中秋节广告语
2016/01/28 职场文书
在windows server 2012 r2中安装mysql的详细步骤
2022/07/23 Servers
Win11 Build 25179预览版发布(附更新内容+ISO官方镜像下载)
2022/08/14 数码科技