python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python通过urllib2获取带有中文参数url内容的方法
Mar 13 Python
python使用Image处理图片常用技巧分析
Jun 01 Python
Python快速排序算法实例分析
Nov 29 Python
Python读取图片为16进制表示简单代码
Jan 19 Python
Python Tkinter实现简易计算器功能
Jan 30 Python
python使用Flask操作mysql实现登录功能
May 14 Python
python实现五子棋小游戏
Mar 25 Python
PyTorch中常用的激活函数的方法示例
Aug 20 Python
Python线程条件变量Condition原理解析
Jan 20 Python
使用keras实现非线性回归(两种加激活函数的方式)
Jul 05 Python
python实现测试工具(二)——简单的ui测试工具
Oct 19 Python
如何在pycharm中安装第三方包
Oct 27 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
PHP中设置时区,记录日志文件的实现代码
2013/01/07 PHP
thinkphp模板继承实例简述
2014/11/26 PHP
浅谈PHP的反射API
2017/02/26 PHP
PHP Laravel 上传图片、文件等类封装
2017/08/16 PHP
PHP实现的最大正向匹配算法示例
2017/12/19 PHP
PHP 实现重载
2021/03/09 PHP
jQuery入门知识简介
2010/03/04 Javascript
精通Javascript系列之Javascript基础篇
2011/06/07 Javascript
简短几句jquery代码的实现一个图片向上滚动切换
2011/09/02 Javascript
jQuery代码优化 遍历篇
2011/11/01 Javascript
Bootstrap CSS布局之列表
2016/12/15 Javascript
详解RequireJs官方使用教程
2017/10/31 Javascript
jQuery Dom元素操作技巧
2018/02/04 jQuery
使用webpack打包后的vue项目如何正确运行(express)
2018/10/26 Javascript
Python实现获取网站PR及百度权重
2015/01/21 Python
python通过floor函数舍弃小数位的方法
2015/03/17 Python
Python中设置变量访问权限的方法
2015/04/27 Python
Python基于回溯法子集树模板实现图的遍历功能示例
2017/09/05 Python
使用TensorFlow实现二分类的方法示例
2019/02/05 Python
python实现翻转棋游戏(othello)
2019/07/29 Python
使用OpenCV实现仿射变换—旋转功能
2019/08/29 Python
Python实现平行坐标图的绘制(plotly)方式
2019/11/22 Python
详解Python中如何将数据存储为json格式的文件
2020/11/18 Python
英国第一豪华护肤品牌:Elemis
2017/10/12 全球购物
印度电子产品购物网站:Vijay Sales
2021/02/16 全球购物
茶叶生产计划书
2014/01/10 职场文书
高一地理教学反思
2014/01/18 职场文书
小学清明节活动方案
2014/03/08 职场文书
硕士生找工作求职信
2014/07/05 职场文书
教育系统干部作风整顿心得体会
2014/09/09 职场文书
党员领导干部民主生活会批评与自我批评发言
2014/09/28 职场文书
红高粱观后感
2015/06/10 职场文书
2015教师节通讯稿
2015/07/20 职场文书
工作简报怎么写
2015/07/21 职场文书
详解Python中的进程和线程
2021/06/23 Python
python_tkinter弹出对话框创建
2022/03/20 Python