python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python使用内存zipfile对象在内存中打包文件示例
Apr 30 Python
安装dbus-python的简要教程
May 05 Python
python 写的一个爬虫程序源码
Feb 28 Python
使用rst2pdf实现将sphinx生成PDF
Jun 07 Python
python学习教程之Numpy和Pandas的使用
Sep 11 Python
Python去除、替换字符串空格的处理方法
Apr 01 Python
解决phantomjs截图失败,phantom.exit位置的问题
May 17 Python
Python 取numpy数组的某几行某几列方法
Oct 24 Python
Python实现socket非阻塞通讯功能示例
Nov 06 Python
python 回溯法模板详解
Feb 26 Python
python 基于UDP协议套接字通信的实现
Jan 22 Python
Python办公自动化解决world文件批量转换
Sep 15 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
PHP读MYSQL中文乱码的解决方法
2006/12/17 PHP
php反弹shell实现代码
2009/04/22 PHP
PHP时间戳与日期之间转换的实例介绍
2013/04/19 PHP
php中常量DIRECTORY_SEPARATOR用法深入分析
2014/11/14 PHP
php实现在站点里面添加邮件发送的功能
2020/04/28 PHP
js函数使用技巧之 setTimeout(function(){},0)
2009/02/09 Javascript
Jquery 常用方法经典总结
2010/01/28 Javascript
jquery ajax 如何向jsp提交表单数据
2015/08/23 Javascript
JavaScript数组操作函数汇总
2016/08/05 Javascript
JavaScript中使用webuploader实现上传视频功能(demo)
2017/04/10 Javascript
nodejs入门教程二:创建一个简单应用示例
2017/04/24 NodeJs
chorme 浏览器记住密码后input黄色背景处理方法(两种)
2017/11/22 Javascript
bootstrap实现二级下拉菜单效果
2017/11/23 Javascript
浅谈Vue数据绑定的原理
2018/01/08 Javascript
vue.js中$set与数组更新方法
2018/03/08 Javascript
JS简单实现动态添加HTML标记的方法示例
2018/04/08 Javascript
浅谈webpack4 图片处理汇总
2018/09/12 Javascript
React中阻止事件冒泡的问题详析
2019/04/12 Javascript
[04:52]DOTA2亚洲邀请赛附加赛 TOP10精彩集锦
2015/01/29 DOTA
浅谈Python中函数的参数传递
2016/06/21 Python
python PIL/cv2/base64相互转换实例
2020/01/09 Python
python动态规划算法实例详解
2020/11/22 Python
css3 矩阵的使用详解
2018/03/20 HTML / CSS
html5 postMessage解决跨域、跨窗口消息传递方案
2016/12/20 HTML / CSS
Html5跳转到APP指定页面的实现
2020/01/14 HTML / CSS
iPhoneX安全区域(Safe Area)底部小黑条在微信小程序和H5的屏幕适配
2020/04/08 HTML / CSS
爱他美官方海外旗舰店:Aptamil奶粉
2017/12/22 全球购物
表彰先进的通报
2014/01/31 职场文书
美术毕业生求职信
2014/02/25 职场文书
公司保密承诺书
2014/03/27 职场文书
学习方法演讲稿
2014/05/10 职场文书
优秀应届本科生求职信
2014/07/19 职场文书
2015年度物业公司工作总结
2015/04/27 职场文书
卫生主题班会
2015/08/14 职场文书
Go语言实现Snowflake雪花算法
2021/06/08 Golang
java中如何截取字符串最后一位
2022/07/07 Java/Android