编程 Python

Python使用requests xpath 并开启多线程爬取西刺代理ip实例

Posted in Python onMarch 06, 2020

我就废话不多说啦，大家还是直接看代码吧！

import requests,random
from lxml import etree
import threading
import time

angents = [
  "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
  "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
  "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
  "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
  "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]

def get_all_xici_urls(start_num,stop_num):
  xici_urls = []
  for num in range(start_num,len(stop_num)+1):
    xici_http_url = 'http://www.xicidaili.com/wt/'
    xici_http_url += str(num)
    xici_urls.append(xici_http_url)
  print('获取所有待爬取xici url 已完成...')
  return xici_urls
def get_all_http_ip(xici_http_url,headers,proxies_list):
  try:
    all_ip_xpath = '//table//tr/child::*[2]/text()'
    all_prot_xpath = '//table//tr/child::*[3]/text()'
    response = requests.get(url=xici_http_url,headers=headers)
    html_tree = etree.HTML(response.text)
    ip_list = html_tree.xpath(all_ip_xpath)
    port_list = html_tree.xpath(all_prot_xpath)
    # print(ip_list)
    # print(prot_list)
    new_proxies_list = []
    for index in range(1,len(ip_list)):
      # print('http://{}:{}'.format(ip_list[index],port_list[index]))
      proxies_dict = {}
      proxies_dict['http'] = 'http://{}:{}'.format(str(ip_list[index]),str(port_list[index]))
      new_proxies_list.append(proxies_dict)
    proxies_list += new_proxies_list
    return proxies_list
  except Exception as e:
    print('发生了错误：url为 ',xici_http_url,'错误为 ',e)

if __name__ == '__main__':
  start_num = int(input('请输入起始页面：').strip())
  stop_num = int(input('请输入结束页面：').strip())
  print('开始爬取...')
  t_list = []
  # 容纳需要使用的西刺代理ip
  proxies_list = []
  # 使用多线程
  xici_urls = get_all_xici_urls(start_num,stop_num)
  for xici_get_url in xici_urls:
    #随机筛选一个useragent
    headers = {'User-Agent': random.choice(angents)}
    t = threading.Thread(target=get_all_http_ip,args=(xici_get_url,headers,proxies_list))
    t.start()
    t_list.append(t)
  for j in t_list:
    j.join()
  print('所有需要的代理ip已爬取完成...')
  print(proxies_list)
  print(len(proxies_list))

补充知识：python爬取xici的免费代理、并验证(重点、清楚)

网上爬取xici的帖子很多，但是验证都说的不是很清楚，这里我会认真给大家解释

这里我写了一个代理类proxy，写了四个方法（个人写法不必在意），get_user_agent（得到随机use-agent,请求头中最重要的一个）、get_proxy(爬取代理IP)、test_proxy(验证代理可用性）、store_txt(将可用的代理保存到txt文件中。

1.爬取：headers是请求头，choice是可以选择是爬取Http代理还是https代理，first、end为开始和结束的页码（结束不包含最后一页）

def get_proxy(self, headers, choice='http', first=1, end=2):
    """
    获取代理
    :param choice:
    :param first: 开始爬取的页数
    :param end: 结束爬取的后一页
    :return:
    """
 
    ip_list = []
    base_url = None
    
    # 选择爬取的网站，一个是http、一个是https的
    if choice == 'http':
      base_url = 'http://www.xicidaili.com/wt/'
    elif choice == 'https':
      base_url = 'http://www.xicidaili.com/wn/'
    
    # 控制页码用正则匹配，并将爬取的IP和端口号用:链接
    for n in range(first, end):
      actual_url = base_url + str(n)
      html = requests.get(url=actual_url, headers=headers).text
      pattern = '(\d+\.\d+\.\d+\.\d+)</td>\s*<td>(\d+)'
      re_list = re.findall(pattern, html)
 
      for ip_port in re_list:
        ip_port = ip_port[0] + ':' + ip_port[1]
        ip_list.append(ip_port)
    return ip_list

2. 验证：网上大部分是用request直接请求一个网址看是否通过或者看状态码是否是200，但是有一个问题是即使你设置了代理IP。可能会通过，但通过的不是用你设置的代理IP而是用你自己公网下的IP（大部分时候我们用ifconfig查询的是我们所在局域网下的IP，及私网IP）。

linux下你可以用这些命令的其中任何一个查看你的公网IP：

curl icanhazip.com
curl ifconfig.me
curl curlmyip.com
curl ip.appspot.com
curl ipinfo.io/ip
curl ipecho.net/plain
curl www.trackip.net/i

注意：那这样要怎么办，其实我们可以向上述命令一样先用你爬下的代理IP访问 http://icanhazip.com/，它可以返回你电脑发送请求时的公网IP（此时如果你设置代理IP了就会是返回你所发送请求的代理IP），然后你将它爬取下来（直接获取返回的值的文本就可以了），并和你发送请求时的代理IP作比较，如果不相等说明此代理IP不能用，因为虽然你设置了代理Ip，但是电脑在你代理IP请求不同的情况下，直接又使用了你公网的IP去请求，当然成功了，但不代表你的代理IP可以用。如果相等，那就证明此网站就是你所用的代理IP访问请求成功的，所以此IP可用。

def test_proxy(self, ip_port, choice='http'):
    """
    测试代理是否能用
    :param ip_port:
    :param choice:
    :return:
    """
    proxies = None
 
    # 这个网站可以返回你公网下的IP，如果你加代理请求后，返回的就是你代理的IP（这样做是防止你虽然用的是代理IP，但实际是用你自己的公网IP访问的请求）
    tar_url = "http://icanhazip.com/"
 
    # 获取随机User-agent
    user_agent = self.get_user_agent()
 
    # 将user-agent放在headers中
    headers = {'User-Agent': user_agent}
 
    # 选择验证的是http、还是https
    if choice == 'http':
      proxies = {
        "http": "http://"+ip_port,
      }
 
    elif choice == 'https':
      proxies = {
        "https": "https://" + ip_port,
      }
 
    try:
      # 将IP从IP和端口号连起来的分出来
      thisIP = "".join(ip_port.split(":")[0:1])
      res = requests.get(tar_url, proxies=proxies, headers=headers, timeout=8)
 
      # 爬取下来返回的值，一定要用strip去除空格
      proxyIP = res.text.strip()
      
      # 三个状态，如过直接通不过，那就返回false，如果通过但是不是代理的IP，也返回false
      if proxyIP == thisIP:
        return proxyIP
      else:
        return False
    except:
      return False

最后附上整段代码：

import requests
import re
import random
import codecs
from urllib import parse
 
 
class proxy:
  """
  代理类
  """
  def __init__(self):
    pass
 
  def get_user_agent(self):
    """
    得到随机user-agent
    :return:
    """
    user_agents = [
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
      "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
      "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
      "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
      "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
      "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
      "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
      "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
      "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
      "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
      "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
      "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",
      "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
      "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
      "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",
      "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
      "Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"
    ]
    user_agent = random.choice(user_agents)
    return user_agent
 
 
  def get_proxy(self, headers, choice='http', first=1, end=2):
    """
    获取代理
    :param choice:
    :param first: 开始爬取的页数
    :param end: 结束爬取的后一页
    :return:
    """
 
    ip_list = []
    base_url = None
    if choice == 'http':
      base_url = 'http://www.xicidaili.com/wt/'
    elif choice == 'https':
      base_url = 'http://www.xicidaili.com/wn/'
 
    for n in range(first, end):
      actual_url = base_url + str(n)
      html = requests.get(url=actual_url, headers=headers).text
      pattern = '(\d+\.\d+\.\d+\.\d+)</td>\s*<td>(\d+)'
      re_list = re.findall(pattern, html)
 
      for ip_port in re_list:
        ip_port = ip_port[0] + ':' + ip_port[1]
        ip_list.append(ip_port)
    return ip_list
 
 
  def test_proxy(self, ip_port, choice='http'):
    """
    测试代理是否能用
    :param ip_port:
    :param choice:
    :return:
    """
    proxies = None
    # 这个网站可以返回你公网下的IP，如果你加代理请求后，返回的就是你代理的IP（这样做是防止你虽然用的是代理IP，但实际是用你自己的公网IP访问的请求）
    tar_url = "http://icanhazip.com/"
    user_agent = self.get_user_agent()
    headers = {'User-Agent': user_agent}
    if choice == 'http':
      proxies = {
        "http": "http://"+ip_port,
      }
 
    elif choice == 'https':
      proxies = {
        "https": "https://" + ip_port,
      }
    try:
      thisIP = "".join(ip_port.split(":")[0:1])
      res = requests.get(tar_url, proxies=proxies, headers=headers, timeout=8)
      proxyIP = res.text.strip()
      if proxyIP == thisIP:
        return proxyIP
      else:
        return False
    except:
      return False
 
  def store_txt(self, choice='http', first=1, end=2):
    """
    将测试通过的ip_port保存为txt文件
    :param choice:
    :param first:
    :param end:
    :return:
    """
    user_agent = self.get_user_agent()
    headers = {'User-Agent': user_agent}
    ip_list = self.get_proxy(headers=headers, choice=choice, first=first, end=end)
    with codecs.open("Http_Agent.txt", 'a', 'utf-8') as file:
      for ip_port in ip_list:
        ip_port = self.test_proxy(ip_port, choice=choice)
        print(ip_port)
        if ip_port:
          file.write('\'' + ip_port + "\'\n")

以上这篇Python使用requests xpath 并开启多线程爬取西刺代理ip实例就是小编分享给大家的全部内容了，希望能给大家一个参考，也希望大家多多支持三水点靠木。

Python使用requests xpath 并开启多线程爬取西刺代理ip实例

- Author -

haeasringnar

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Python通过select实现异步IO的方法

Jun 04 Python

深入学习Python中的装饰器使用

Jun 20 Python

Python random模块用法解析及简单示例

Dec 18 Python

numpy matrix和array的乘和加实例

Jun 28 Python

python保存文件方法小结

Jul 27 Python

在python中利用KNN实现对iris进行分类的方法

Dec 11 Python

简单了解python的内存管理机制

Jul 08 Python

Django shell调试models输出的SQL语句方法

Aug 29 Python

python 求定积分和不定积分示例

Nov 20 Python

python 实现视频图像帧提取

Dec 10 Python

利用 PyCharm 实现本地代码和远端的实时同步功能

Mar 23 Python

python爬虫中url管理器去重操作实例

Nov 30 Python

Python 批量读取文件中指定字符的实现

Mar 06 #Python

python GUI库图形界面开发之PyQt5布局控件QGridLayout详细使用方法与实例

Mar 06 #Python

python3 xpath和requests应用详解

Mar 06 #Python

python 装饰器功能与用法案例详解

Mar 06 #Python

python GUI库图形界面开发之PyQt5布局控件QVBoxLayout详细使用方法与实例

Mar 06 #Python

利用 Python ElementTree 生成 xml的实例

Mar 06 #Python

Python3 xml.etree.ElementTree支持的XPath语法详解

Mar 06 #Python