python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python 不关闭控制台的实现方法
Oct 23 Python
python爬虫入门教程之糗百图片爬虫代码分享
Sep 02 Python
python类和函数中使用静态变量的方法
May 09 Python
Python之ReportLab绘制条形码和二维码的实例
Jan 15 Python
python爬取网页转换为PDF文件
Jun 07 Python
Python 正则表达式 re.match/re.search/re.sub的使用解析
Jul 22 Python
python中必要的名词解释
Nov 20 Python
python装饰器代替set get方法实例
Dec 19 Python
Python3.9又更新了:dict内置新功能
Feb 28 Python
python中的socket实现ftp客户端和服务器收发文件及md5加密文件
Apr 01 Python
Keras 使用 Lambda层详解
Jun 10 Python
python如何使用代码运行助手
Jul 03 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
世界第一个无线广播电台 KDKA
2021/03/01 无线电
php添加文章时生成静态HTML文章的实现代码
2013/02/17 PHP
详解WordPress中用于更新和获取用户选项数据的PHP函数
2016/03/08 PHP
PHP实现求连续子数组最大和问题2种解决方法
2017/12/26 PHP
详细解读php的命名空间(二)
2018/02/21 PHP
thinkphp框架表单数组实现图片批量上传功能示例
2020/04/04 PHP
表单(FORM)的一些实用效果代码
2007/03/25 Javascript
为JavaScript添加重载函数的辅助方法
2010/07/04 Javascript
jQuery插件expander实现图片翻转特效
2015/05/21 Javascript
bootstrap基础知识学习笔记
2016/11/02 Javascript
JS中判断null的方法分析
2016/11/21 Javascript
基于javascript实现按圆形排列DIV元素(二)
2016/12/02 Javascript
无法获取隐藏元素宽度和高度的解决方案
2017/03/07 Javascript
js+css实现打字效果
2020/06/24 Javascript
JavaScript实现浅拷贝与深拷贝的方法分析
2018/07/05 Javascript
jQuery滑动效果实现方法分析
2018/09/05 jQuery
微信小程序前端promise封装代码实例
2019/08/24 Javascript
基于vue实现微博三方登录流程解析
2020/11/04 Javascript
[18:16]sakonoko 2017年卡尔集锦
2018/02/06 DOTA
python实现DNS正向查询、反向查询的例子
2014/04/25 Python
python获得一个月有多少天的方法
2015/06/04 Python
python3.5使用tkinter制作记事本
2016/06/20 Python
python异常和文件处理机制详解
2016/07/19 Python
如何解决flask修改静态资源后缓存文件不能及时更改问题
2020/08/02 Python
Python os库常用操作代码汇总
2020/11/03 Python
HTML5到底会有什么发展?HTML5的前景展望
2015/07/07 HTML / CSS
保护环境的建议书
2014/03/12 职场文书
改作风抓落实促发展心得体会
2014/09/10 职场文书
国际残疾人日广播稿范文
2014/10/09 职场文书
导游词格式
2015/02/13 职场文书
小学生光盘行动倡议书
2015/04/28 职场文书
优秀班主任工作总结2015
2015/05/25 职场文书
MySQL性能压力基准测试工具sysbench的使用简介
2021/04/21 MySQL
浅谈JS和Nodejs中的事件驱动
2021/05/05 NodeJs
Win10鼠标轨迹怎么开 Win10显示鼠标轨迹方法
2022/04/06 数码科技
Python3使用Qt5来实现简易的五子棋小游戏
2022/05/02 Python