python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
跟老齐学Python之编写类之二方法
Oct 11 Python
python获取一组汉字拼音首字母的方法
Jul 01 Python
使用FastCGI部署Python的Django应用的教程
Jul 22 Python
Python常见格式化字符串方法小结【百分号与format方法】
Sep 18 Python
python将字典内容存入mysql实例代码
Jan 18 Python
python构建深度神经网络(续)
Mar 10 Python
python计算两个地址之间的距离方法
Jun 09 Python
Flask实现跨域请求的处理方法
Sep 27 Python
django框架F&Q 聚合与分组操作示例
Dec 12 Python
python logging设置level失败的解决方法
Feb 19 Python
可视化pytorch 模型中不同BN层的running mean曲线实例
Jun 24 Python
Python实现的扫码工具居然这么好用!
Jun 07 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
试用php中oci8扩展
2015/06/18 PHP
PHP实现类似题库抽题效果
2018/08/16 PHP
js自带函数备忘 数组
2006/12/29 Javascript
Js sort排序使用方法
2011/10/17 Javascript
JS对象与JSON格式数据相互转换
2012/02/20 Javascript
Node.js模拟浏览器文件上传示例
2014/03/26 Javascript
给html超链接设置事件不使用href来完成跳
2014/04/20 Javascript
JS中attr和prop属性的区别以及优先选择示例介绍
2014/06/30 Javascript
Javascript highcharts 饼图显示数量和百分比实例代码
2016/12/06 Javascript
利用jquery实现下拉框的禁用与启用
2016/12/07 Javascript
Node.JS段点续传:Nginx配置文件分段下载功能的实现方法
2018/03/12 Javascript
JavaScript笛卡尔积超简单实现算法示例
2018/07/30 Javascript
vue计算属性computed的使用方法示例
2019/03/13 Javascript
node删除、复制文件或文件夹示例代码
2019/08/13 Javascript
javascript自定义加载loading效果
2020/09/15 Javascript
详解nginx配置vue h5 history去除#号
2020/11/09 Javascript
Python实现脚本锁功能(同时只能执行一个脚本)
2017/05/10 Python
python 美化输出信息的实例
2018/10/15 Python
Python利用matplotlib做图中图及次坐标轴的实例
2019/07/08 Python
基于python实现的百度新歌榜、热歌榜下载器(附代码)
2019/08/05 Python
python爬虫 urllib模块反爬虫机制UA详解
2019/08/20 Python
python openCV获取人脸部分并存储功能
2019/08/28 Python
django框架forms组件用法实例详解
2019/12/10 Python
Python如何定义有默认参数的函数
2020/08/10 Python
阿联酋电子产品购物网站:Menakart
2017/09/15 全球购物
纽约海:Sea New York
2018/11/04 全球购物
SIDESTEP荷兰:在线购买鞋子
2019/11/18 全球购物
白酒市场开发计划书
2014/01/09 职场文书
法人授权委托书格式
2014/04/08 职场文书
优秀员工推荐信
2014/05/10 职场文书
先进工作者事迹材料
2014/12/23 职场文书
护士2015年终工作总结
2015/04/29 职场文书
单位领导婚礼致辞
2015/07/28 职场文书
python异步的ASGI与Fast Api实现
2021/07/16 Python
浅谈MySQL表空间回收的正确姿势
2021/10/05 MySQL
Vue的生命周期一起来看看
2022/02/24 Vue.js