编程 Python

python手机号前7位归属地爬虫代码实例

Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位，判断号码是否合法，还有归属地查询。旧的数据是几年前了太久了，打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码，后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信，联通，移动，虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段，共10000次查询，单线程版大概要多1个半小时，太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据，大概2，3min就好，cpu使用飙升，大概维持在70%左右。

总共40多个号段，爬完大概1，2个小时，总数据41w左右

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持三水点靠木。

python手机号前7位归属地爬虫代码实例

- Author -

wanli001

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

python 解析html之BeautifulSoup

Jul 07 Python

总结Python编程中函数的使用要点

Mar 20 Python

Django-Rest-Framework 权限管理源码浅析(小结)

Nov 12 Python

Python实现 PS 图像调整中的亮度调整

Jun 28 Python

python字典排序的方法

Oct 12 Python

python文件读写代码实例

Oct 21 Python

pycharm 设置项目的根目录教程

Feb 12 Python

Python3.7下安装pyqt5的方法步骤(图文)

May 12 Python

Pycharm Plugins加载失败问题解决方案

Nov 28 Python

详解Python+OpenCV绘制灰度直方图

Mar 22 Python

Python多线程实用方法以及共享变量资源竞争问题

Apr 12 Python

Python使用pandas导入csv文件内容的示例代码

Dec 24 Python

django修改models重建数据库的操作

Mar 31 #Python

Python写捕鱼达人的游戏实现

Mar 31 #Python

Django 多对多字段的更新和插入数据实例

Mar 31 #Python

基于python爬取有道翻译过程图解

Mar 31 #Python

django实现将修改好的新模型写入数据库

Mar 31 #Python

Python urlencode和unquote函数使用实例解析

Mar 31 #Python

Python响应对象text属性乱码解决方案

Mar 31 #Python

You might like

Terran历史背景

2020/03/14 星际争霸

提升PHP执行速度全攻略(上)

2006/10/09 PHP

使用php get_headers 判断URL是否有效的解决办法

2013/04/27 PHP

给jqGrid数据行添加修改和删除操作链接（之一）

2011/11/04 Javascript

node.js适合游戏后台开发吗？

2014/09/03 Javascript

分享20款美化网站的 jQuery Lightbox 灯箱插件

2014/10/10 Javascript

jQuery过滤选择器详解

2015/01/13 Javascript

JavaScript实现多栏目切换效果

2016/12/12 Javascript

Node.js readline模块与util模块的使用

2018/03/01 Javascript

关于js对textarea换行符的处理方法浅析

2018/08/03 Javascript

一步步教你利用Docker设置Node.js

2018/11/20 Javascript

jQuery表单元素过滤选择器用法实例分析

2019/02/20 jQuery

Electron + vue 打包桌面操作流程详解

2019/06/24 Javascript

微信小程序开发常见问题及解决方案

2019/07/11 Javascript

微信小程序商品详情页底部弹出框

2019/11/22 Javascript

vue-cli3访问public文件夹静态资源报错的解决方式

2020/09/02 Javascript

vue使用require.context实现动态注册路由

2020/12/25 Vue.js

Python下的subprocess模块的入门指引

2015/04/16 Python

手把手教你如何安装Pycharm(详细图文教程)

2018/11/28 Python

python实现简单多人聊天室

2018/12/11 Python

Python实现图片批量加入水印代码实例

2019/11/30 Python

Python包，__init__.py功能与用法分析

2020/01/07 Python

Python bytes string相互转换过程解析

2020/03/05 Python

python实现数学模型(插值、拟合和微分方程)

2020/11/13 Python

全球虚拟主机商：HostGator

2017/02/06 全球购物

美国浴缸、水槽和水龙头购物网站：Vintage Tub & Bath

2019/11/05 全球购物

给医务人员表扬信

2014/01/12 职场文书

运动会解说词100字

2014/01/31 职场文书

模范家庭事迹材料

2014/02/10 职场文书

售后客服个人自我评价

2014/09/14 职场文书

先进基层党组织事迹材料

2014/12/25 职场文书

销售工作决心书

2015/02/04 职场文书

在CSS中映射鼠标位置并实现通过鼠标移动控制页面元素效果(实例代码)

2021/04/22 HTML / CSS

MySQL创建高性能索引的全步骤

2021/05/02 MySQL

mysql外连接与内连接查询的不同之处

2021/06/03 MySQL

Golang 并发下的问题定位及解决方案

2022/03/16 Golang