python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python 解析html之BeautifulSoup
Jul 07 Python
总结Python编程中函数的使用要点
Mar 20 Python
Django-Rest-Framework 权限管理源码浅析(小结)
Nov 12 Python
Python实现 PS 图像调整中的亮度调整
Jun 28 Python
python字典排序的方法
Oct 12 Python
python文件读写代码实例
Oct 21 Python
pycharm 设置项目的根目录教程
Feb 12 Python
Python3.7下安装pyqt5的方法步骤(图文)
May 12 Python
Pycharm Plugins加载失败问题解决方案
Nov 28 Python
详解Python+OpenCV绘制灰度直方图
Mar 22 Python
Python多线程实用方法以及共享变量资源竞争问题
Apr 12 Python
Python使用pandas导入csv文件内容的示例代码
Dec 24 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
Terran历史背景
2020/03/14 星际争霸
提升PHP执行速度全攻略(上)
2006/10/09 PHP
使用php get_headers 判断URL是否有效的解决办法
2013/04/27 PHP
给jqGrid数据行添加修改和删除操作链接(之一)
2011/11/04 Javascript
node.js适合游戏后台开发吗?
2014/09/03 Javascript
分享20款美化网站的 jQuery Lightbox 灯箱插件
2014/10/10 Javascript
jQuery过滤选择器详解
2015/01/13 Javascript
JavaScript实现多栏目切换效果
2016/12/12 Javascript
Node.js readline模块与util模块的使用
2018/03/01 Javascript
关于js对textarea换行符的处理方法浅析
2018/08/03 Javascript
一步步教你利用Docker设置Node.js
2018/11/20 Javascript
jQuery表单元素过滤选择器用法实例分析
2019/02/20 jQuery
Electron + vue 打包桌面操作流程详解
2019/06/24 Javascript
微信小程序开发常见问题及解决方案
2019/07/11 Javascript
微信小程序商品详情页底部弹出框
2019/11/22 Javascript
vue-cli3访问public文件夹静态资源报错的解决方式
2020/09/02 Javascript
vue使用require.context实现动态注册路由
2020/12/25 Vue.js
Python下的subprocess模块的入门指引
2015/04/16 Python
手把手教你如何安装Pycharm(详细图文教程)
2018/11/28 Python
python实现简单多人聊天室
2018/12/11 Python
Python实现图片批量加入水印代码实例
2019/11/30 Python
Python包,__init__.py功能与用法分析
2020/01/07 Python
Python bytes string相互转换过程解析
2020/03/05 Python
python实现数学模型(插值、拟合和微分方程)
2020/11/13 Python
全球虚拟主机商:HostGator
2017/02/06 全球购物
美国浴缸、水槽和水龙头购物网站:Vintage Tub & Bath
2019/11/05 全球购物
给医务人员表扬信
2014/01/12 职场文书
运动会解说词100字
2014/01/31 职场文书
模范家庭事迹材料
2014/02/10 职场文书
售后客服个人自我评价
2014/09/14 职场文书
先进基层党组织事迹材料
2014/12/25 职场文书
销售工作决心书
2015/02/04 职场文书
在CSS中映射鼠标位置并实现通过鼠标移动控制页面元素效果(实例代码)
2021/04/22 HTML / CSS
MySQL创建高性能索引的全步骤
2021/05/02 MySQL
mysql外连接与内连接查询的不同之处
2021/06/03 MySQL
Golang 并发下的问题定位及解决方案
2022/03/16 Golang