python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
Python中用Descriptor实现类级属性(Property)详解
Sep 18 Python
Python实现的生成格雷码功能示例
Jan 24 Python
Python处理CSV与List的转换方法
Apr 19 Python
对python requests发送json格式数据的实例详解
Dec 19 Python
pycharm配置pyqt5-tools开发环境的方法步骤
Feb 11 Python
Python列表常见操作详解(获取,增加,删除,修改,排序等)
Feb 18 Python
如何通过python画loss曲线的方法
Jun 26 Python
python异步编程 使用yield from过程解析
Sep 25 Python
Python对接支付宝支付自实现功能
Oct 10 Python
Python上下文管理器全实例详解
Nov 12 Python
Pycharm debug调试时带参数过程解析
Feb 03 Python
使用numpngw和matplotlib生成png动画的示例代码
Jan 24 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
PHP截取中文字符串的问题
2006/07/12 PHP
PHP连接MySQL的2种方法小结以及防止乱码
2014/03/11 PHP
php类的扩展和继承用法实例
2015/06/20 PHP
jQuery 自动增长的文本输入框实现代码
2010/04/02 Javascript
jQuery-serialize()输出序列化form表单值的方法
2012/12/26 Javascript
关于jQuery中.attr()和.prop()的问题探讨
2013/09/06 Javascript
Bootstrap框架的学习教程详解(二)
2016/10/18 Javascript
详解为Angular.js内置$http服务添加拦截器的方法
2016/12/20 Javascript
addEventListener()与removeEventListener()解析
2017/04/20 Javascript
Angualrjs 表单验证的两种方式(失去焦点验证和点击提交验证)
2017/05/09 Javascript
禁止弹窗中蒙层底部页面跟随滚动的几种方法
2017/12/07 Javascript
微信小程序scroll-view组件实现滚动动画
2018/01/31 Javascript
解决Vue 通过下表修改数组,页面不渲染的问题
2018/03/08 Javascript
ES6模板字符串和标签模板的应用实例分析
2019/06/25 Javascript
JS this关键字在ajax中使用出现问题解决方案
2020/07/17 Javascript
vue同个按钮控制展开和折叠同个事件操作
2020/07/29 Javascript
Python构建XML树结构的方法示例
2017/06/30 Python
python 集合 并集、交集 Series list set 转换的实例
2018/05/29 Python
Python中py文件转换成exe可执行文件的方法
2019/06/14 Python
利用python list完成最简单的DB连接池方法
2019/08/09 Python
Pycharm中出现ImportError:DLL load failed:找不到指定模块的解决方法
2019/09/17 Python
Python filter过滤器原理及实例应用
2020/08/18 Python
CSS3制作气泡对话框的实例教程
2016/05/10 HTML / CSS
如何使用css3实现一个类在线直播的队列动画的示例代码
2020/06/17 HTML / CSS
幼儿园托班开学寄语
2014/01/18 职场文书
2014年迎新年活动方案
2014/02/19 职场文书
优秀护士先进事迹
2014/05/08 职场文书
计算机网络专业自荐信
2014/07/04 职场文书
商场促销活动总结
2014/07/10 职场文书
五好家庭事迹材料
2014/12/20 职场文书
村干部任职承诺书
2015/01/21 职场文书
五星红旗迎风飘扬观后感
2015/06/17 职场文书
幼儿教师师德培训心得体会
2016/01/09 职场文书
《圆的周长》教学反思
2016/02/17 职场文书
描写九月优美句子(39条)
2019/09/11 职场文书
SpringBoot详解执行过程
2022/07/15 Java/Android