python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python中的super用法详解
May 28 Python
详解python while 函数及while和for的区别
Sep 07 Python
判断python对象是否可调用的三种方式及其区别详解
Jan 31 Python
Python中按值来获取指定的键
Mar 04 Python
将Python字符串生成PDF的实例代码详解
May 17 Python
Python语言进阶知识点总结
May 28 Python
python 日期排序的实例代码
Jul 11 Python
使用PyTorch将文件夹下的图片分为训练集和验证集实例
Jan 08 Python
tensorflow 获取checkpoint中的变量列表实例
Feb 11 Python
使用Pytorch实现two-head(多输出)模型的操作
May 28 Python
python图像处理 PIL Image操作实例
Apr 09 Python
python解析json数据
Apr 29 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
一个简单实现多条件查询的例子
2006/10/09 PHP
php header()函数使用说明
2008/07/10 PHP
令PHP初学者头疼十四条问题大总结
2008/11/12 PHP
php使浏览器直接下载pdf文件的方法
2013/11/15 PHP
分享PHP计算两个日期相差天数的代码
2015/12/23 PHP
PHP微信API接口类
2016/08/22 PHP
Javascript学习笔记9 prototype封装继承
2010/01/11 Javascript
JavaScript 面向对象编程(2) 定义类
2010/05/18 Javascript
jquery ajax 调用失败的原因示例介绍
2013/09/27 Javascript
jQuery使用CSS()方法给指定元素同时设置多个样式
2015/03/26 Javascript
再JavaScript的jQuery库中编写动画效果的指南
2015/08/13 Javascript
JavaScript模块规范之AMD规范和CMD规范
2015/10/27 Javascript
Js自动截取字符串长度,添加省略号(……)的实现方法
2017/03/06 Javascript
使用angular-cli webpack创建多个包的方法
2018/10/16 Javascript
VUE搭建手机商城心得和遇到的坑
2019/02/21 Javascript
详解vue 动态加载并注册组件且通过 render动态创建该组件
2019/05/30 Javascript
vue+django实现一对一聊天功能的实例代码
2019/07/17 Javascript
[47:48]DOTA2上海特级锦标赛D组小组赛#2 Liquid VS VP第三局
2016/02/28 DOTA
windows下安装python paramiko模块的代码
2013/02/10 Python
在Python中操作时间之tzset()方法的使用教程
2015/05/22 Python
python字符串str和字节数组相互转化方法
2017/03/18 Python
Python 模拟购物车的实例讲解
2017/09/11 Python
python实现简单聊天应用 python群聊和点对点均实现
2017/09/14 Python
python中字符串数组逆序排列方法总结
2019/06/23 Python
使用Python画股票的K线图的方法步骤
2019/06/28 Python
Django封装交互接口代码
2020/07/12 Python
为什么要有struct关键字
2012/05/08 面试题
大专生自我鉴定范文
2013/10/01 职场文书
小学教师国培感言
2014/02/08 职场文书
弘扬民族精神演讲稿
2014/05/07 职场文书
国际贸易专业求职信
2014/06/04 职场文书
全陪导游词开场白
2015/05/29 职场文书
2015年社区反邪教工作总结
2015/10/14 职场文书
PyTorch dropout设置训练和测试模式的实现
2021/05/27 Python
教你如何使用Python开发一个钉钉群应答机器人
2021/06/21 Python
如何使用pdb进行Python调试
2021/06/30 Python