python实现将html表格转换成CSV文件的方法


Posted in Python onJune 28, 2015

本文实例讲述了python实现将html表格转换成CSV文件的方法。分享给大家供大家参考。具体如下:

使用方法:python html2csv.py *.html
这段代码使用了 HTMLParser 模块

#!/usr/bin/python
# -*- coding: iso-8859-1 -*-
# Hello, this program is written in Python - http://python.org
programname = 'html2csv - version 2002-09-20 - http://sebsauvage.net'
import sys, getopt, os.path, glob, HTMLParser, re
try:  import psyco ; psyco.jit() # If present, use psyco to accelerate the program
except: pass
def usage(progname):
  ''' Display program usage. '''
  progname = os.path.split(progname)[1]
  if os.path.splitext(progname)[1] in ['.py','.pyc']: progname = 'python '+progname
  return '''%s
A coarse HTML tables to CSV (Comma-Separated Values) converter.
Syntax  : %s source.html
Arguments : source.html is the HTML file you want to convert to CSV.
      By default, the file will be converted to csv with the same
      name and the csv extension (source.html -> source.csv)
      You can use * and ?.
Examples  : %s mypage.html
      : %s *.html
This program is public domain.
Author : Sebastien SAUVAGE <sebsauvage at sebsauvage dot net>
     http://sebsauvage.net
''' % (programname, progname, progname, progname)
class html2csv(HTMLParser.HTMLParser):
  ''' A basic parser which converts HTML tables into CSV.
    Feed HTML with feed(). Get CSV with getCSV(). (See example below.)
    All tables in HTML will be converted to CSV (in the order they occur
    in the HTML file).
    You can process very large HTML files by feeding this class with chunks
    of html while getting chunks of CSV by calling getCSV().
    Should handle badly formated html (missing <tr>, </tr>, </td>,
    extraneous </td>, </tr>...).
    This parser uses HTMLParser from the HTMLParser module,
    not HTMLParser from the htmllib module.
    Example: parser = html2csv()
         parser.feed( open('mypage.html','rb').read() )
         open('mytables.csv','w+b').write( parser.getCSV() )
    This class is public domain.
    Author: Sébastien SAUVAGE <sebsauvage at sebsauvage dot net>
        http://sebsauvage.net
    Versions:
      2002-09-19 : - First version
      2002-09-20 : - now uses HTMLParser.HTMLParser instead of htmllib.HTMLParser.
            - now parses command-line.
    To do:
      - handle <PRE> tags
      - convert html entities (&name; and &#ref;) to Ascii.
      '''
  def __init__(self):
    HTMLParser.HTMLParser.__init__(self)
    self.CSV = ''   # The CSV data
    self.CSVrow = ''  # The current CSV row beeing constructed from HTML
    self.inTD = 0   # Used to track if we are inside or outside a <TD>...</TD> tag.
    self.inTR = 0   # Used to track if we are inside or outside a <TR>...</TR> tag.
    self.re_multiplespaces = re.compile('\s+') # regular expression used to remove spaces in excess
    self.rowCount = 0 # CSV output line counter.
  def handle_starttag(self, tag, attrs):
    if  tag == 'tr': self.start_tr()
    elif tag == 'td': self.start_td()
  def handle_endtag(self, tag):
    if  tag == 'tr': self.end_tr()
    elif tag == 'td': self.end_td()     
  def start_tr(self):
    if self.inTR: self.end_tr() # <TR> implies </TR>
    self.inTR = 1
  def end_tr(self):
    if self.inTD: self.end_td() # </TR> implies </TD>
    self.inTR = 0      
    if len(self.CSVrow) > 0:
      self.CSV += self.CSVrow[:-1]
      self.CSVrow = ''
    self.CSV += '\n'
    self.rowCount += 1
  def start_td(self):
    if not self.inTR: self.start_tr() # <TD> implies <TR>
    self.CSVrow += '"'
    self.inTD = 1
  def end_td(self):
    if self.inTD:
      self.CSVrow += '",' 
      self.inTD = 0
  def handle_data(self, data):
    if self.inTD:
      self.CSVrow += self.re_multiplespaces.sub(' ',data.replace('\t',' ').replace('\n','').replace('\r','').replace('"','""'))
  def getCSV(self,purge=False):
    ''' Get output CSV.
      If purge is true, getCSV() will return all remaining data,
      even if <td> or <tr> are not properly closed.
      (You would typically call getCSV with purge=True when you do not have
      any more HTML to feed and you suspect dirty HTML (unclosed tags). '''
    if purge and self.inTR: self.end_tr() # This will also end_td and append last CSV row to output CSV.
    dataout = self.CSV[:]
    self.CSV = ''
    return dataout
if __name__ == "__main__":
  try: # Put getopt in place for future usage.
    opts, args = getopt.getopt(sys.argv[1:],None)
  except getopt.GetoptError:
    print usage(sys.argv[0]) # print help information and exit:
    sys.exit(2)
  if len(args) == 0:
    print usage(sys.argv[0]) # print help information and exit:
    sys.exit(2)    
  print programname
  html_files = glob.glob(args[0])
  for htmlfilename in html_files:
    outputfilename = os.path.splitext(htmlfilename)[0]+'.csv'
    parser = html2csv()
    print 'Reading %s, writing %s...' % (htmlfilename, outputfilename)
    try:
      htmlfile = open(htmlfilename, 'rb')
      csvfile = open( outputfilename, 'w+b')
      data = htmlfile.read(8192)
      while data:
        parser.feed( data )
        csvfile.write( parser.getCSV() )
        sys.stdout.write('%d CSV rows written.\r' % parser.rowCount)
        data = htmlfile.read(8192)
      csvfile.write( parser.getCSV(True) )
      csvfile.close()
      htmlfile.close()
    except:
      print 'Error converting %s    ' % htmlfilename
      try:  htmlfile.close()
      except: pass
      try:  csvfile.close()
      except: pass
  print 'All done. '

希望本文所述对大家的Python程序设计有所帮助。

Python 相关文章推荐
python3使用tkinter实现ui界面简单实例
Jan 10 Python
深入解析Python编程中super关键字的用法
Jun 24 Python
python实现树形打印目录结构
Mar 29 Python
python 编码规范整理
May 05 Python
TensorFlow数据输入的方法示例
Jun 19 Python
python学生管理系统开发
Jan 30 Python
NumPy 基本切片和索引的具体使用方法
Apr 24 Python
python实现输出一个序列的所有子序列示例
Nov 18 Python
Python搭建Keras CNN模型破解网站验证码的实现
Apr 07 Python
基于Python绘制个人足迹地图
Jun 01 Python
TensorFlow中如何确定张量的形状实例
Jun 23 Python
Python截图并保存的具体实例
Jan 14 Python
python实现根据主机名字获得所有ip地址的方法
Jun 28 #Python
python自动zip压缩目录的方法
Jun 28 #Python
python查找指定具有相同内容文件的方法
Jun 28 #Python
python中getaddrinfo()基本用法实例分析
Jun 28 #Python
python实现搜索指定目录下文件及文件内搜索指定关键词的方法
Jun 28 #Python
分析用Python脚本关闭文件操作的机制
Jun 28 #Python
python实现linux下使用xcopy的方法
Jun 28 #Python
You might like
[原创]PHP正则删除html代码中a标签并保留标签内容的方法
2017/05/23 PHP
关于Laravel参数验证的一些疑与惑
2019/11/19 PHP
actionscript与javascript的区别
2011/05/25 Javascript
[原创]Bootstrap 中下拉菜单修改成鼠标悬停直接显示
2016/04/14 Javascript
Js删除数组中某一项或几项的几种方法(推荐)
2016/07/27 Javascript
JS实现的表格行上下移动操作示例
2016/08/03 Javascript
移动端使用localStorage缓存Js和css文的方法(web开发)
2016/09/20 Javascript
javaScript嗅探执行神器-sniffer.js
2017/02/14 Javascript
koa2实现登录注册功能的示例代码
2018/12/03 Javascript
详解react-refetch的使用小例子
2019/02/15 Javascript
vue实现文字加密功能
2019/09/27 Javascript
用云开发Cloudbase实现小程序多图片内容安全监测的代码详解
2020/06/07 Javascript
react实现复选框全选和反选组件效果
2020/08/25 Javascript
详解vue 组件注册
2020/11/20 Vue.js
[01:14:12]2018DOTA2亚洲邀请赛4.7 总决赛 LGD vs Mineski 第二场
2018/04/09 DOTA
Python3.X 线程中信号量的使用方法示例
2017/07/24 Python
Python模块结构与布局操作方法实例分析
2017/07/24 Python
Python生成数字图片代码分享
2017/10/31 Python
Python基于贪心算法解决背包问题示例
2017/11/27 Python
详解用python自制微信机器人,定时发送天气预报
2019/03/25 Python
Python语法之精妙的十个知识点(装B语法)
2020/01/18 Python
Pycharm激活码激活两种快速方式(附最新激活码和插件)
2020/03/12 Python
Django 构建模板form表单的两种方法
2020/06/14 Python
Python创建临时文件和文件夹
2020/08/05 Python
css3 中translate和transition的使用方法
2020/03/26 HTML / CSS
HTML5中drawImage用法分析
2014/12/01 HTML / CSS
申请任职学生会干部自荐书范文
2014/02/13 职场文书
高中军训第一天感言
2014/03/06 职场文书
社区清明节活动总结
2014/07/04 职场文书
公司委托书范本5篇
2014/09/20 职场文书
2015年网络舆情工作总结
2015/07/24 职场文书
国家助学金受助感言
2015/08/01 职场文书
房地产置业顾问工作总结
2015/10/23 职场文书
使用Python拟合函数曲线
2022/04/14 Python
速龙x4-860k处理器相当于i几
2022/04/20 数码科技
centos7安装mysql5.7经验记录
2022/05/02 Servers