python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
Python安装Imaging报错:The _imaging C module is not installed问题解决方法
Aug 22 Python
Python3.2中Print函数用法实例详解
May 19 Python
举例详解Python中threading模块的几个常用方法
Jun 18 Python
简单介绍使用Python解析并修改XML文档的方法
Oct 15 Python
Python操作MongoDB详解及实例
May 18 Python
Python读写zip压缩文件的方法
Aug 29 Python
windows7 32、64位下python爬虫框架scrapy环境的搭建方法
Nov 29 Python
Django中自定义查询对象的具体使用
Oct 13 Python
python3 实现函数写文件路径的正确方法
Nov 27 Python
Pycharm的Available Packages为空的解决方法
Sep 18 Python
python 利用matplotlib在3D空间绘制二次抛物面的案例
Feb 06 Python
Python爬虫:从m3u8文件里提取小视频的正确操作
May 14 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
欧美媒体选出10年前最流行的17部动画
2017/01/18 日漫
做一个有下拉功能的留言版
2006/10/09 PHP
PHP 年龄计算函数(精确到天)
2012/06/07 PHP
ThinkPHP字符串函数及常用函数汇总
2014/07/18 PHP
CentOS下PHP7的编译安装及MySQL的支持和一些常见问题的解决办法
2015/12/17 PHP
多浏览器支持的右下角浮动窗口
2010/04/01 Javascript
扩展easyui.datagrid,添加数据loading遮罩效果代码
2010/11/02 Javascript
javascript实现简单的省市区三级联动
2015/05/14 Javascript
javascript 内置对象及常见API详细介绍
2016/11/01 Javascript
利用D3.js实现最简单的柱状图示例代码
2016/12/09 Javascript
深入了解JavaScript的逻辑运算符(与、或)
2016/12/20 Javascript
Vue.js实现一个漂亮、灵活、可复用的提示组件示例
2017/03/17 Javascript
详解Weex基于Vue2.0开发模板搭建
2017/03/20 Javascript
AngularJS实现根据不同条件显示不同控件
2017/04/20 Javascript
Express + Node.js实现登录拦截器的实例代码
2017/07/01 Javascript
Vue的Class与Style绑定的方法
2017/09/01 Javascript
Iview Table组件中各种组件扩展的使用
2018/10/20 Javascript
jQuery内容选择器与表单选择器实例分析
2019/06/28 jQuery
JavaScript实现公告栏上下滚动效果
2020/03/13 Javascript
JS使用正则表达式实现常用的表单验证功能分析
2020/04/30 Javascript
vue 插槽简介及使用示例
2020/11/19 Vue.js
Python封装shell命令实例分析
2015/05/05 Python
python监控文件或目录变化
2016/06/07 Python
Windows中使用wxPython和py2exe开发Python的GUI程序的实例教程
2016/07/11 Python
详解Python 实现元胞自动机中的生命游戏(Game of life)
2018/01/27 Python
Python 读写文件的操作代码
2018/09/20 Python
python numpy之np.random的随机数函数使用介绍
2019/10/06 Python
哈工大自然语言处理工具箱之ltp在windows10下的安装使用教程
2020/05/07 Python
Python实现迪杰斯特拉算法过程解析
2020/09/18 Python
澳大利亚在线划船、露营和钓鱼商店:BCF Australia
2020/03/22 全球购物
创建精神文明单位实施方案
2014/03/08 职场文书
习总书记三严三实学习心得体会
2014/10/13 职场文书
伏羲庙导游词
2015/02/09 职场文书
公司行政助理岗位职责
2015/04/11 职场文书
公司岗位说明书
2015/10/08 职场文书
《小乌鸦爱妈妈》教学反思
2016/02/19 职场文书