python手机号前7位归属地爬虫代码实例


Posted in Python onMarch 31, 2020

需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8
import requests
from datetime import datetime


class PhoneInfoSpider:
  def __init__(self, phoneSections):
    self.phoneSections = phoneSections

  def phoneInfoHandler(self, textData):
    text = textData.splitlines(True)
    # print("text length:" + str(len(text)))

    if len(text) >= 9:
      number = text[1].split('\'')[1]
      province = text[2].split('\'')[1]
      mobile_area = text[3].split('\'')[1]
      postcode = text[5].split('\'')[1]
      line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
      line_text = number + "," + province + "," + mobile_area + "," + postcode
      print(line_text)
      # print("province:" + province)

      try:
        f = open('./result.txt', 'a')
        f.write(str(line_text) + '\n')
      except Exception as e:
        print(Exception, ":", e)

  def requestPhoneInfo(self, phoneNum):
    try:
      url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
      response = requests.get(url)
      self.phoneInfoHandler(response.text)
    except Exception as e:
      print(Exception, ":", e)

  def requestAllSections(self):
    # last用于接上次异常退出前的号码
    last = 0
    # last = 4
    # 自动生成手机号码,后四位补0
    for head in self.phoneSections:
      head_begin = datetime.now()
      print(head + " begin time:" + str(head_begin))

      # for i in range(last, 10000):
      for i in range(last, 10):
        middle = str(i).zfill(4)
        phoneNum = head + middle + "0000"
        self.requestPhoneInfo(phoneNum)
      last = 0

      head_end = datetime.now()
      print(head + " end time:" + str(head_end))


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  # 电信,联通,移动,虚拟运营商
  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',
     '178', '182', '183', '184', '187', '188', '198']
  add = ['170']
  all_num = dx + lt + yd + add

  # print(all_num)
  print(len(all_num))

  # 要爬的号码段
  spider = PhoneInfoSpider(all_num)
  spider.requestAllSections()

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8
import requests
from datetime import datetime
import queue
import threading

threadNum = 32


class MyThread(threading.Thread):
  def __init__(self, func):
    threading.Thread.__init__(self)
    self.func = func

  def run(self):
    self.func()


def requestPhoneInfo():
  global lock
  while True:
    lock.acquire()
    if q.qsize() != 0:
      print("queue size:" + str(q.qsize()))
      p = q.get() # 获得任务
      lock.release()

      middle = str(9999 - q.qsize()).zfill(4)
      phoneNum = phone_head + middle + "0000"
      print("phoneNum:" + phoneNum)

      try:
        url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum
        # print(url)
        response = requests.get(url)
        # print(response.text)
        phoneInfoHandler(response.text)
      except Exception as e:
        print(Exception, ":", e)
    else:
      lock.release()
      break


def phoneInfoHandler(textData):
  text = textData.splitlines(True)

  if len(text) >= 9:
    number = text[1].split('\'')[1]
    province = text[2].split('\'')[1]
    mobile_area = text[3].split('\'')[1]
    postcode = text[5].split('\'')[1]
    line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode
    line_text = number + "," + province + "," + mobile_area + "," + postcode
    print(line_text)
    # print("province:" + province)

    try:
      f = open('./result.txt', 'a')
      f.write(str(line_text) + '\n')
    except Exception as e:
      print(Exception, ":", e)


if __name__ == '__main__':
  task_begin = datetime.now()
  print("phone check begin time:" + str(task_begin))

  dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']
  lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']
  yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',
     '182', '183', '184', '187', '188', '198']
  all_num = dx + lt + yd
  print(len(all_num))

  for head in all_num:
    head_begin = datetime.now()
    print(head + " begin time:" + str(head_begin))

    q = queue.Queue()
    threads = []
    lock = threading.Lock()

    for p in range(10000):
      q.put(p + 1)

    print(q.qsize())

    for i in range(threadNum):
      middle = str(i).zfill(4)
      global phone_head
      phone_head = head

      thread = MyThread(requestPhoneInfo)
      thread.start()
      threads.append(thread)
    for thread in threads:
      thread.join()

    head_end = datetime.now()
    print(head + " end time:" + str(head_end))

  task_end = datetime.now()
  print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python操作MongoDB基础知识
Nov 01 Python
python基础教程之基本内置数据类型介绍
Feb 20 Python
python实现从字符串中找出字符1的位置以及个数的方法
Aug 25 Python
Python多进程multiprocessing用法实例分析
Aug 18 Python
python去除字符串中的换行符
Oct 11 Python
详解python使用turtle库来画一朵花
Mar 21 Python
Python Pandas分组聚合的实现方法
Jul 02 Python
使用pandas 将DataFrame转化成dict
Dec 10 Python
python要安装在哪个盘
Jun 15 Python
利用Python实现Excel的文件间的数据匹配功能
Jun 16 Python
用Python提取PDF表格的方法
Apr 11 Python
python 利用 PIL 将数组值转成图片的实现
Apr 12 Python
django修改models重建数据库的操作
Mar 31 #Python
Python写捕鱼达人的游戏实现
Mar 31 #Python
Django 多对多字段的更新和插入数据实例
Mar 31 #Python
基于python爬取有道翻译过程图解
Mar 31 #Python
django实现将修改好的新模型写入数据库
Mar 31 #Python
Python urlencode和unquote函数使用实例解析
Mar 31 #Python
Python响应对象text属性乱码解决方案
Mar 31 #Python
You might like
JS 网站性能优化笔记
2011/05/24 PHP
浅谈php数组array_change_key_case() 函数和array_chunk()函数
2016/10/22 PHP
Javascript事件热键兼容ie|firefox
2010/12/30 Javascript
NodeJS的模块写法入门(实例代码)
2012/03/07 NodeJs
JavaScript动态创建div属性和样式示例代码
2013/10/09 Javascript
JS使用replace()方法和正则表达式进行字符串的搜索与替换实例
2014/04/10 Javascript
jquery实现一个简单好用的弹出框
2014/09/26 Javascript
jQuery操作cookie方法实例教程
2014/11/25 Javascript
JS实现页面超时后自动跳转到登陆页面
2015/01/19 Javascript
jQuery获取剪贴板内容的方法
2016/06/16 Javascript
D3.js进阶系列之CSV表格文件的读取详解
2017/06/06 Javascript
JavaScript插件Tab选项卡效果
2017/11/14 Javascript
Vue+jquery实现表格指定列的文字收缩的示例代码
2018/01/09 jQuery
Koa项目搭建过程详细记录
2018/04/12 Javascript
Vue.js 踩坑记之双向绑定
2018/05/03 Javascript
layer弹出层 iframe层去掉滚动条的实例代码
2018/08/17 Javascript
vue完成项目后,打包成静态文件的方法
2018/09/03 Javascript
jQuery实现数字华容道小游戏(实例代码)
2020/01/16 jQuery
在Vue里如何把网页的数据导出到Excel的方法
2020/09/30 Javascript
Python 调用DLL操作抄表机
2009/01/12 Python
python通过post提交数据的方法
2015/05/06 Python
使用python存储网页上的图片实例
2018/05/22 Python
python执行精确的小数计算方法
2019/01/21 Python
selenium+python自动化测试之页面元素定位
2019/01/23 Python
基于python3 的百度图片下载器的实现代码
2019/11/05 Python
如何基于python实现年会抽奖工具
2020/10/20 Python
关于青春的演讲稿
2014/05/05 职场文书
如何写求职信
2014/05/24 职场文书
机关作风建设工作总结
2014/10/23 职场文书
保送生自荐信
2015/03/06 职场文书
大学生求职信怎么写
2015/03/19 职场文书
2015年外联部工作总结
2015/04/03 职场文书
2016教师廉洁教育心得体会
2016/01/13 职场文书
Android存储中最基本的文件存储方式
2022/04/30 Java/Android
python神经网络学习 使用Keras进行简单分类
2022/05/04 Python
Navicat Premium自定义 sql 标签的创建方式
2022/09/23 数据库