python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
Python中3种内建数据结构:列表、元组和字典
Nov 30 Python
Python构造函数及解构函数介绍
Feb 26 Python
python计算方程式根的方法
May 07 Python
Python中函数参数设置及使用的学习笔记
May 03 Python
Python文本相似性计算之编辑距离详解
Nov 28 Python
pandas数据处理基础之筛选指定行或者指定列的数据
May 03 Python
Python补齐字符串长度的实例
Nov 15 Python
python 实现调用子文件下的模块方法
Dec 07 Python
python解析含有重复key的json方法
Jan 22 Python
pytorch-神经网络拟合曲线实例
Jan 15 Python
简单了解pytest测试框架setup和tearDown
Apr 14 Python
Keras实现支持masking的Flatten层代码
Jun 16 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
PHP5.3.1 不再支持ISAPI
2010/01/08 PHP
php图片上传存储源码并且可以预览
2011/08/26 PHP
php使用wordwrap格式化文本段落的方法
2015/03/17 PHP
PHP Cookei记录用户历史浏览信息的代码
2016/02/03 PHP
Zend Framework教程之Zend_Form组件实现表单提交并显示错误提示的方法
2016/03/21 PHP
简单的pgsql pdo php操作类实现代码
2016/08/25 PHP
PHP中strtr与str_replace函数运行性能简单测试示例
2019/06/22 PHP
用js统计用户下载网页所需时间的脚本
2008/10/15 Javascript
apycom出品的jQuery精美菜单破解方法
2011/02/18 Javascript
基于jQuery的图片左右无缝滚动插件
2012/05/23 Javascript
JS获取地址栏参数的小例子
2013/08/23 Javascript
深入分析Cookie的安全性问题
2015/03/01 Javascript
jQuery弹层插件jquery.fancybox.js用法实例
2016/01/22 Javascript
使用JavaScript为Kindeditor自定义按钮增加Audio标签
2016/03/18 Javascript
jQuery中on绑定事件后引发的事件冒泡问题如何解决
2016/05/25 Javascript
Bootstrap实现基于carousel.js框架的轮播图效果
2017/05/02 Javascript
ionic 3.0+ 项目搭建运行环境的教程
2017/08/09 Javascript
VUE长按事件需求详解
2017/10/18 Javascript
MVVM 双向绑定的实现代码
2018/06/21 Javascript
详解性能更优越的小程序图片懒加载方式
2018/07/18 Javascript
element el-input directive数字进行控制
2018/10/11 Javascript
如何从0开始用node写一个自己的命令行程序
2018/12/29 Javascript
python操作redis方法总结
2018/06/06 Python
PyCharm 设置SciView工具窗口的方法
2019/01/15 Python
django框架使用方法详解
2019/07/18 Python
tensorflow实现测试时读取任意指定的check point的网络参数
2020/01/21 Python
python实现飞行棋游戏
2020/02/05 Python
纯DOM+CSS3实现简单的小风车动画
2016/09/27 HTML / CSS
详解background属性的8个属性值(面试题)
2020/11/02 HTML / CSS
Bally美国官网:经典瑞士鞋履、手袋及配饰奢侈品牌
2018/05/18 全球购物
同学会邀请书大全
2014/01/12 职场文书
承兑汇票转让证明怎么写?
2014/11/30 职场文书
硕士学位论文评语
2014/12/31 职场文书
行政复议决定书
2015/06/24 职场文书
运动会通讯稿600字
2015/07/20 职场文书
Java SSM配置文件案例详解
2021/08/30 Java/Android