python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
删除目录下相同文件的python代码(逐级优化)
May 25 Python
Centos Python2 升级到Python3的简单实现
Jun 21 Python
在Python中调用Ping命令,批量IP的方法
Jan 26 Python
Python中使用双下划线防止类属性被覆盖问题
Jun 27 Python
python gensim使用word2vec词向量处理中文语料的方法
Jul 05 Python
django框架创建应用操作示例
Sep 26 Python
python中的split()函数和os.path.split()函数使用详解
Dec 21 Python
python中shell执行知识点
May 06 Python
使用SQLAlchemy操作数据库表过程解析
Jun 10 Python
Django中ORM的基本使用教程
Dec 22 Python
如何利用python和DOS获取wifi密码
Mar 31 Python
PyTorch dropout设置训练和测试模式的实现
May 27 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
如何对PHP程序中的常见漏洞进行攻击
2006/10/09 PHP
PHP Mysql编程之高级技巧
2008/08/27 PHP
PHP中开发XML应用程序之基础篇 添加节点 删除节点 查询节点 查询节
2010/07/09 PHP
PHP实现打包zip并下载功能
2018/06/12 PHP
Laravel框架中队列和工作(Queues、Jobs)操作实例详解
2020/04/06 PHP
基于jquery的获取mouse坐标插件的实现代码
2010/04/01 Javascript
JS过滤url参数特殊字符的实现方法
2013/12/24 Javascript
javascript设计模式之工厂模式示例讲解
2014/03/04 Javascript
js正则表达式匹配数字字母下划线等
2015/04/14 Javascript
jquery实现华丽的可折角广告代码
2015/09/02 Javascript
JS实现保留n位小数的四舍五入问题示例
2016/08/03 Javascript
Angular.js中ng-if、ng-show和ng-hide的区别介绍
2017/01/20 Javascript
bootstrap输入框组使用方法
2017/02/07 Javascript
.vue文件 加scoped 样式不起作用的解决方法
2018/05/28 Javascript
vue2.0页面前进刷新回退不刷新的实现方法
2018/07/31 Javascript
使用webpack构建应用的方法步骤
2019/03/04 Javascript
微信小程序实现原生步骤条
2019/07/25 Javascript
Vue全局loading及错误提示的思路与实现
2019/08/09 Javascript
layer弹出层自定义提交取消按钮的例子
2019/09/10 Javascript
Python实现扫描局域网活动ip(扫描在线电脑)
2015/04/28 Python
浅谈pandas筛选出表中满足另一个表所有条件的数据方法
2019/02/08 Python
Django之使用celery和NGINX生成静态页面实现性能优化
2019/10/08 Python
Python列表list常用内建函数实例小结
2019/10/22 Python
Pycharm中import torch报错的快速解决方法
2020/03/05 Python
Python 使用双重循环打印图形菱形操作
2020/08/09 Python
Python Opencv实现单目标检测的示例代码
2020/09/08 Python
Html5 web本地存储实例详解
2016/07/28 HTML / CSS
东南亚旅游平台:The Trip Guru
2018/01/01 全球购物
微软中国官方旗舰店:销售Surface、Xbox One、笔记本电脑、Office
2018/07/23 全球购物
添柏岚英国官方网站:Timberland英国
2019/11/28 全球购物
观看《永远的雷锋》心得体会
2014/03/12 职场文书
目标责任书格式
2014/07/28 职场文书
门面房租房协议书
2014/08/20 职场文书
javaScript Array api梳理
2021/03/31 Javascript
python munch库的使用解析
2021/05/25 Python
Golang表示枚举类型的详细讲解
2021/09/04 Golang