python下调用pytesseract识别某网站验证码的实现方法


Posted in Python onJune 06, 2016

一、pytesseract介绍

1、pytesseract说明

pytesseract最新版本0.1.6,网址:https://pypi.python.org/pypi/pytesseract

Python-tesseract is a wrapper for google's Tesseract-OCR
( http://code.google.com/p/tesseract-ocr/ ). It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.

翻译一下大意:

a、Python-tesseract是一个基于google's Tesseract-OCR的独立封装包;

b、Python-tesseract功能是识别图片文件中文字,并作为返回参数返回识别结果;

c、Python-tesseract默认支持tiff、bmp格式图片,只有在安装PIL之后,才能支持jpeg、gif、png等其他图片格式;

2、pytesseract安装

INSTALLATION:

Prerequisites:
* Python-tesseract requires python 2.5 or later or python 3.
* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
the package "python-imaging" or "python3-imaging" for python3.
* Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ .
You must be able to invoke the tesseract command as "tesseract". If this
isn't the case, for example because tesseract isn't in your PATH, you will
have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
Under Debian/Ubuntu you can use the package "tesseract-ocr".

Installing via pip:

See the [pytesseract package page](https://pypi.python.org/pypi/pytesseract)
```
$> sudo pip install pytesseract

翻译一下:

a、Python-tesseract支持python2.5及更高版本;

b、Python-tesseract需要安装PIL(Python Imaging Library) ,来支持更多的图片格式;

c、Python-tesseract需要安装tesseract-ocr安装包。

综上,Pytesseract原理:

1、上一篇博文中提到,执行命令行 tesseract.exe 1.png output -l eng ,可以识别1.png中文字,并把识别结果输出到output.txt中;

2、Pytesseract对上述过程进行了二次封装,自动调用tesseract.exe,并读取output.txt文件的内容,作为函数的返回值进行返回。

二、pytesseract使用

USAGE:
```
> try:
> import Image
> except ImportError:
> from PIL import Image
> import pytesseract
> print(pytesseract.image_to_string(Image.open('test.png')))
> print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))

可以看到:

1、核心代码就是image_to_string函数,该函数还支持-l eng 参数,支持-psm 参数。

用法:

image_to_string(Image.open('test.png'),lang="eng" config="-psm 7")

2、pytesseract里调用了image,所以才需要PIL,其实tesseract.exe本身是支持jpeg、png等图片格式的。

实例代码,识别某公共网站的验证码(大家千万别干坏事啊,思虑再三,最后还是隐掉网站域名,大家去找别的网站试试吧……):

#-*-coding=utf-8-*-
__author__='zhongtang'

import urllib
import urllib2
import cookielib
import math
import random
import time
import os
import htmltool
from pytesseract import *
from PIL import Image
from PIL import ImageEnhance
import re

class orclnypcg:
  def __init__(self):
    self.baseUrl='http://jbywcg.****.com.cn'
    self.ht=htmltool.htmltool()
    self.curPath=self.ht.getPyFileDir()
    self.authCode=''
    
  def initUrllib2(self):
    try:
      cookie = cookielib.CookieJar()
      cookieHandLer = urllib2.HTTPCookieProcessor(cookie)
      httpHandLer=urllib2.HTTPHandler(debuglevel=0)
      httpsHandLer=urllib2.HTTPSHandler(debuglevel=0)
    except:
      raise
    else:
       opener = urllib2.build_opener(cookieHandLer,httpHandLer,httpsHandLer)
       opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
       urllib2.install_opener(opener)
       
  def urllib2Navigate(self,url,data={}):      #定义连接函数,有超时重连功能
    tryTimes = 0
    while True:
      if (tryTimes>20):
        print u"多次尝试仍无法链接网络,程序终止"
        break
      try:
        if (data=={}):
          req = urllib2.Request(url)
        else:
          req = urllib2.Request(url,urllib.urlencode(data))
        response =urllib2.urlopen(req)
        bodydata = response.read()
        headerdata = response.info()
        if headerdata.get('Content-Encoding')=='gzip':
          rdata = StringIO.StringIO(bodydata)
          gz = gzip.GzipFile(fileobj=rdata)
          bodydata = gz.read()
          gz.close()
        tryTimes = tryTimes +1
      except urllib2.HTTPError, e:
       print 'HTTPError[%s]\n' %e.code        
      except urllib2.URLError, e:
       print 'URLError[%s]\n' %e.reason  
      except socket.error:
        print u"连接失败,尝试重新连接"
      else:
        break
    return bodydata,headerdata
  
  def randomCodeOcr(self,filename):
    image = Image.open(filename)
    #使用ImageEnhance可以增强图片的识别率
    #enhancer = ImageEnhance.Contrast(image)
    #enhancer = enhancer.enhance(4)
    image = image.convert('L')
    ltext = ''
    ltext= image_to_string(image)
    #去掉非法字符,只保留字母数字
    ltext=re.sub("\W", "", ltext)
    print u'[%s]识别到验证码:[%s]!!!' %(filename,ltext)
    image.save(filename)
    #print ltext
    return ltext

  def getRandomCode(self):
    #开始获取验证码
    #http://jbywcg.****.com.cn/CommonPage/Code.aspx?0.9409255818463862
    i = 0 
    while ( i<=100):
      i += 1 
      #拼接验证码Url
      randomUrlNew='%s/CommonPage/Code.aspx?%s' %(self.baseUrl,random.random())
      #拼接验证码本地文件名
      filename= '%s.png' %(i)
      filename= os.path.join(self.curPath,filename)
      jpgdata,jpgheader = self.urllib2Navigate(randomUrlNew)
      if len(jpgdata)<= 0 :
        print u'获取验证码出错!\n'
        return False
      f = open(filename, 'wb')
      f.write(jpgdata)
      #print u"保存图片:",fileName
      f.close()
      self.authCode = self.randomCodeOcr(filename)


#主程序开始
orcln=orclnypcg()
orcln.initUrllib2()
orcln.getRandomCode()

三、pytesseract代码优化

上述程序在windows平台运行时,会发现有黑色的控制台窗口一闪而过的画面,不太友好。

略微修改了pytesseract.py(C:\Python27\Lib\site-packages\pytesseract目录下),把上述过程进行了隐藏。

# modified by zhongtang hide console window
# new code
IS_WIN32 = 'win32' in str(sys.platform).lower()
if IS_WIN32:
   startupinfo = subprocess.STARTUPINFO()
   startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
   startupinfo.wShowWindow = subprocess.SW_HIDE
   proc = subprocess.Popen(command,
        stderr=subprocess.PIPE,startupinfo=startupinfo)
'''
# old code
proc = subprocess.Popen(command,
   stderr=subprocess.PIPE)
'''
# modified end

为了方便初学者,把pytesseract.py也贴出来,高手自行忽略。

#!/usr/bin/env python
'''
Python-tesseract is an optical character recognition (OCR) tool for python.
That is, it will recognize and "read" the text embedded in images.

Python-tesseract is a wrapper for google's Tesseract-OCR
( http://code.google.com/p/tesseract-ocr/ ). It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, whereas tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.


USAGE:
```
 > try:
 >   import Image
 > except ImportError:
 >   from PIL import Image
 > import pytesseract
 > print(pytesseract.image_to_string(Image.open('test.png')))
 > print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='fra'))
```

INSTALLATION:

Prerequisites:
* Python-tesseract requires python 2.5 or later or python 3.
* You will need the Python Imaging Library (PIL). Under Debian/Ubuntu, this is
 the package "python-imaging" or "python3-imaging" for python3.
* Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ .
 You must be able to invoke the tesseract command as "tesseract". If this
 isn't the case, for example because tesseract isn't in your PATH, you will
 have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.
 Under Debian/Ubuntu you can use the package "tesseract-ocr".
 
Installing via pip:  
See the [pytesseract package page](https://pypi.python.org/pypi/pytesseract)   
$> sudo pip install pytesseract  

Installing from source:  
$> git clone git@github.com:madmaze/pytesseract.git  
$> sudo python setup.py install  


LICENSE:
Python-tesseract is released under the GPL v3.

CONTRIBUTERS:
- Originally written by [Samuel Hoffstaetter](https://github.com/hoffstaetter) 
- [Juarez Bochi](https://github.com/jbochi)
- [Matthias Lee](https://github.com/madmaze)
- [Lars Kistner](https://github.com/Sr4l)

'''

# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY
tesseract_cmd = 'tesseract'

try:
  import Image
except ImportError:
  from PIL import Image
import subprocess
import sys
import tempfile
import os
import shlex

__all__ = ['image_to_string']

def run_tesseract(input_filename, output_filename_base, lang=None, boxes=False, config=None):
  '''
  runs the command:
    `tesseract_cmd` `input_filename` `output_filename_base`
  
  returns the exit status of tesseract, as well as tesseract's stderr output

  '''
  command = [tesseract_cmd, input_filename, output_filename_base]
  
  if lang is not None:
    command += ['-l', lang]

  if boxes:
    command += ['batch.nochop', 'makebox']
    
  if config:
    command += shlex.split(config)
    
  # modified by zhongtang hide console window
  # new code
  IS_WIN32 = 'win32' in str(sys.platform).lower()
  if IS_WIN32:
    startupinfo = subprocess.STARTUPINFO()
    startupinfo.dwFlags |= subprocess.STARTF_USESHOWWINDOW
    startupinfo.wShowWindow = subprocess.SW_HIDE
  proc = subprocess.Popen(command,
      stderr=subprocess.PIPE,startupinfo=startupinfo)
  '''
  # old code
  proc = subprocess.Popen(command,
      stderr=subprocess.PIPE)
  '''
  # modified end
  
  return (proc.wait(), proc.stderr.read())

def cleanup(filename):
  ''' tries to remove the given filename. Ignores non-existent files '''
  try:
    os.remove(filename)
  except OSError:
    pass

def get_errors(error_string):
  '''
  returns all lines in the error_string that start with the string "error"

  '''

  lines = error_string.splitlines()
  error_lines = tuple(line for line in lines if line.find('Error') >= 0)
  if len(error_lines) > 0:
    return '\n'.join(error_lines)
  else:
    return error_string.strip()

def tempnam():
  ''' returns a temporary file-name '''
  tmpfile = tempfile.NamedTemporaryFile(prefix="tess_")
  return tmpfile.name

class TesseractError(Exception):
  def __init__(self, status, message):
    self.status = status
    self.message = message
    self.args = (status, message)

def image_to_string(image, lang=None, boxes=False, config=None):
  '''
  Runs tesseract on the specified image. First, the image is written to disk,
  and then the tesseract command is run on the image. Resseract's result is
  read, and the temporary files are erased.
  
  also supports boxes and config.
  
  if boxes=True
    "batch.nochop makebox" gets added to the tesseract call
  if config is set, the config gets appended to the command.
    ex: config="-psm 6"

  '''

  if len(image.split()) == 4:
    # In case we have 4 channels, lets discard the Alpha.
    # Kind of a hack, should fix in the future some time.
    r, g, b, a = image.split()
    image = Image.merge("RGB", (r, g, b))
  
  input_file_name = '%s.bmp' % tempnam()
  output_file_name_base = tempnam()
  if not boxes:
    output_file_name = '%s.txt' % output_file_name_base
  else:
    output_file_name = '%s.box' % output_file_name_base
  try:
    image.save(input_file_name)
    status, error_string = run_tesseract(input_file_name,
                       output_file_name_base,
                       lang=lang,
                       boxes=boxes,
                       config=config)
    if status:
      #print 'test' , status,error_string
      errors = get_errors(error_string)
      raise TesseractError(status, errors)
    f = open(output_file_name)
    try:
      return f.read().strip()
    finally:
      f.close()
  finally:
    cleanup(input_file_name)
    cleanup(output_file_name)

def main():
  if len(sys.argv) == 2:
    filename = sys.argv[1]
    try:
      image = Image.open(filename)
      if len(image.split()) == 4:
        # In case we have 4 channels, lets discard the Alpha.
        # Kind of a hack, should fix in the future some time.
        r, g, b, a = image.split()
        image = Image.merge("RGB", (r, g, b))
    except IOError:
      sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)
      exit(1)
    print(image_to_string(image))
  elif len(sys.argv) == 4 and sys.argv[1] == '-l':
    lang = sys.argv[2]
    filename = sys.argv[3]
    try:
      image = Image.open(filename)
    except IOError:
      sys.stderr.write('ERROR: Could not open file "%s"\n' % filename)
      exit(1)
    print(image_to_string(image, lang=lang))
  else:
    sys.stderr.write('Usage: python pytesseract.py [-l language] input_file\n')
    exit(2)

if __name__ == '__main__':
  main()

以上……

以上这篇python下调用pytesseract识别某网站验证码的实现方法就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持三水点靠木。

Python 相关文章推荐
python获取beautifulphoto随机某图片代码实例
Dec 18 Python
python类参数self使用示例
Feb 17 Python
Python实现的数据结构与算法之队列详解
Apr 22 Python
Python中的多行注释文档编写风格汇总
Jun 16 Python
TensorFlow在MAC环境下的安装及环境搭建
Nov 14 Python
神经网络理论基础及Python实现详解
Dec 15 Python
opencv python 图像去噪的实现方法
Aug 31 Python
python opencv 二值化 计算白色像素点的实例
Jul 03 Python
Python爬虫运用正则表达式的方法和优缺点
Aug 25 Python
Python lxml模块的基本使用方法分析
Dec 21 Python
Pytorch evaluation每次运行结果不同的解决
Jan 02 Python
Python urllib.request对象案例解析
May 11 Python
浅析AST抽象语法树及Python代码实现
Jun 06 #Python
使用Python的Flask框架构建大型Web应用程序的结构示例
Jun 04 #Python
在Python的Flask框架中构建Web表单的教程
Jun 04 #Python
Python中规范定义命名空间的一些建议
Jun 04 #Python
全面理解Python中self的用法
Jun 04 #Python
举例讲解Python中字典的合并值相加与异或对比
Jun 04 #Python
详解Python中open()函数指定文件打开方式的用法
Jun 04 #Python
You might like
php基础知识:类与对象(1)
2006/12/13 PHP
解析php中如何直接执行SHELL
2013/06/28 PHP
destoon常用的安全设置概述
2014/06/21 PHP
基于PHP生成简单的验证码
2016/06/01 PHP
Prototype使用指南之array.js
2007/01/10 Javascript
腾讯的ip接口 方便获取当前用户的ip地理位置
2010/11/25 Javascript
java与javascript之间json格式数据互转介绍
2013/10/29 Javascript
node.js中的fs.readdirSync方法使用说明
2014/12/17 Javascript
jquery动感漂浮导航菜单代码分享
2020/04/15 Javascript
BOM系列第一篇之定时器setTimeout和setInterval
2016/08/17 Javascript
详解XMLHttpRequest(二)响应属性、二进制数据、监测上传下载进度
2016/09/14 Javascript
浅谈vue后台管理系统权限控制思考与实践
2018/12/19 Javascript
jQuery选择器之基本过滤选择器用法实例分析
2019/02/19 jQuery
nodejs 递归拷贝、读取目录下所有文件和目录
2019/07/18 NodeJs
JavaScript实现更换背景图片
2019/10/18 Javascript
[01:02]DOTA2上海特锦赛SHOWOPEN
2016/03/25 DOTA
[56:29]Secret vs Optic 2018国际邀请赛小组赛BO2 第一场 8.18
2018/08/19 DOTA
简单文件操作python 修改文件指定行的方法
2013/05/15 Python
python类参数self使用示例
2014/02/17 Python
Python编码爬坑指南(必看)
2016/06/10 Python
Python操作MongoDB详解及实例
2017/05/18 Python
Python实现向服务器请求压缩数据及解压缩数据的方法示例
2017/06/09 Python
python生成密码字典的方法
2018/07/06 Python
Python简直是万能的,这5大主要用途你一定要知道!(推荐)
2019/04/03 Python
python中for循环把字符串或者字典添加到列表的方法
2019/07/20 Python
pandas factorize实现将字符串特征转化为数字特征
2019/12/19 Python
最小二乘法及其python实现详解
2020/02/24 Python
Python爬虫实例——scrapy框架爬取拉勾网招聘信息
2020/07/14 Python
python 实现单例模式的5种方法
2020/09/23 Python
css3绘制天猫logo实现代码
2012/11/06 HTML / CSS
西班牙美妆电商:Perfume’s Club(有中文站)
2018/08/08 全球购物
分布式数据库需要考虑哪些问题
2013/12/08 面试题
迎新晚会邀请函
2014/02/01 职场文书
秋天的雨教学反思
2014/04/27 职场文书
流动人口婚育证明范本
2014/09/26 职场文书
小学教师师德师风自我剖析材料
2014/09/29 职场文书