python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python判断字符串是否是json格式方法分享
Nov 07 Python
K-means聚类算法介绍与利用python实现的代码示例
Nov 13 Python
Python2与Python3的区别实例总结
Apr 17 Python
详解Python3序列赋值、序列解包
May 14 Python
django 中的聚合函数,分组函数,F 查询,Q查询
Jul 25 Python
Python基于DB-API操作MySQL数据库过程解析
Apr 23 Python
浅谈pycharm导入pandas包遇到的问题及解决
Jun 01 Python
为什么说python更适合树莓派编程
Jul 20 Python
Python unittest生成测试报告过程解析
Sep 08 Python
详解Python中Pyyaml模块的使用
Oct 08 Python
如何解决.cuda()加载用时很长的问题
May 24 Python
Python访问Redis的详细操作
Jun 26 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
当海贼王变成JOJO风
2020/03/02 日漫
模拟OICQ的实现思路和核心程序(三)
2006/10/09 PHP
php数据库连接时容易出错的特殊符号问题
2010/09/01 PHP
php 高性能书写
2010/12/11 PHP
php文件上传表单摘自drupal的代码
2011/02/15 PHP
php ZipArchive实现多文件打包下载实例
2019/10/31 PHP
javascript hasFocus使用实例
2010/06/29 Javascript
JQuery.ajax传递中文参数的解决方法 推荐
2011/03/28 Javascript
Jquery实现搜索框提示功能示例代码
2013/08/13 Javascript
jQuery列表拖动排列具体实现
2013/11/04 Javascript
JS动态显示表格上下frame的方法
2015/03/31 Javascript
JavaScript中的slice()方法使用详解
2015/06/06 Javascript
JavaScript+html5 canvas制作的百花齐放效果完整实例
2016/01/26 Javascript
利用jQuery插件imgAreaSelect实现图片上传裁剪(同步显示图像位置信息)
2016/12/02 Javascript
[原创]jquery判断元素内容是否为空的方法
2018/05/04 jQuery
JavaScript实现浅拷贝与深拷贝的方法分析
2018/07/05 Javascript
使用 webpack 插件自动生成 vue 路由文件的方法
2019/08/20 Javascript
node.js使用fs读取文件出错的解决方案
2019/10/23 Javascript
vuex实现数据状态持久化
2019/11/11 Javascript
Echarts实现单条折线可拖拽效果
2019/12/19 Javascript
加速vue组件渲染之性能优化
2020/04/09 Javascript
小程序自动化测试的示例代码
2020/08/11 Javascript
Flask解决跨域的问题示例代码
2018/02/12 Python
Python格式化输出字符串方法小结【%与format】
2018/10/29 Python
Python实现获取汉字偏旁部首的方法示例【测试可用】
2018/12/18 Python
Python 3.8正式发布重要新功能一览
2019/10/17 Python
Python搭建HTTP服务过程图解
2019/12/14 Python
在pycharm中关掉ipython console/PyDev操作
2020/06/09 Python
CSS3实现闪烁动画效果的方法
2015/02/09 HTML / CSS
CSS3 text shadow字体阴影效果
2016/01/08 HTML / CSS
StubHub西班牙:购买和出售全球活动门票
2017/06/05 全球购物
小学生自我鉴定
2013/10/12 职场文书
小学教师的个人自我鉴定
2013/10/24 职场文书
质量保证书范本
2014/04/29 职场文书
强烈推荐:小学生:暑假作息时间表(值得收藏)
2019/07/09 职场文书
Python机器学习实战之k-近邻算法的实现
2021/11/27 Python