python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python中使用items()方法返回字典元素对的教程
May 21 Python
python实现用于测试网站访问速率的方法
May 26 Python
python 信息同时输出到控制台与文件的实例讲解
May 11 Python
详解Python if-elif-else知识点
Jun 11 Python
Django model反向关联名称的方法
Dec 15 Python
Python绘图实现显示中文
Dec 04 Python
Pandas+Matplotlib 箱式图异常值分析示例
Dec 09 Python
Python+Selenium+phantomjs实现网页模拟登录和截图功能(windows环境)
Dec 11 Python
浅谈tensorflow 中tf.concat()的使用
Feb 07 Python
PyQt5.6+pycharm配置以及pyinstaller生成exe(小白教程)
Jun 02 Python
Python 操作SQLite数据库的示例
Oct 16 Python
python如何为list实现find方法
May 30 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
PHP 和 MySQL 开发的 8 个技巧
2007/01/02 PHP
PHP中如何实现常用邮箱的基本判断
2014/01/07 PHP
PHP设计模式之观察者模式(Observer)详细介绍和代码实例
2014/04/08 PHP
WordPress中查询文章的循环Loop结构及用法分析
2015/12/17 PHP
List the Codec Files on a Computer
2007/06/11 Javascript
picChange 图片切换特效的函数代码
2010/05/06 Javascript
Javascript动态绑定事件的简单实现代码
2010/12/25 Javascript
window.open打开页面居中显示的示例代码
2013/12/27 Javascript
iscroll.js的上拉下拉刷新时无法回弹的解决方法
2016/02/18 Javascript
设置点击文本框或图片弹出日历控件的实现代码
2016/05/12 Javascript
jQuery autoComplete插件两种使用方式及动态改变参数值的方法详解
2016/10/24 Javascript
Vue.js实现在下拉列表区域外点击即可关闭下拉列表的功能(自定义下拉列表)
2017/05/30 Javascript
详解使用vue实现tab 切换操作
2017/07/03 Javascript
微信小程序基于slider组件动态修改标签透明度的方法示例
2017/12/04 Javascript
利用js实现前后台传送Json的示例代码
2018/03/29 Javascript
Bootstrap模态对话框中显示动态内容的方法
2018/08/10 Javascript
基于Element的组件改造的树形选择器(树形下拉框)
2020/02/27 Javascript
JS实现选项卡插件的两种写法(jQuery和class)
2020/12/30 jQuery
react项目从新建到部署的实现示例
2021/02/19 Javascript
[40:05]DOTA2上海特级锦标赛A组小组赛#1 EHOME VS MVP.Phx第一局
2016/02/25 DOTA
[28:07]完美世界DOTA2联赛PWL S3 Phoenix vs INK ICE 第二场 12.13
2020/12/17 DOTA
python端口扫描系统实现方法
2014/11/19 Python
python中类变量与成员变量的使用注意点总结
2017/04/29 Python
使用Python对微信好友进行数据分析
2018/06/27 Python
python批量获取html内body内容的实例
2019/01/02 Python
Python实现的对本地host127.0.0.1主机进行扫描端口功能示例
2019/02/15 Python
Python3+Appium实现多台移动设备操作的方法
2019/07/05 Python
python实现滑雪者小游戏
2020/02/22 Python
在pycharm中debug 实时查看数据操作(交互式)
2020/06/09 Python
全球精选男装和家居用品:Article
2020/04/13 全球购物
应聘护理专业毕业自荐书范文
2014/02/12 职场文书
作风年建设汇报材料
2014/08/14 职场文书
公证委托书标准格式
2014/09/11 职场文书
2014年国庆节演讲稿精选范文1500字
2014/09/25 职场文书
新闻稿标题
2015/07/18 职场文书
Python如何把不同类型数据的json序列化
2021/04/30 Python