python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python正则表达式re模块详细介绍
May 29 Python
使用Python编写简单的画图板程序的示例教程
Dec 08 Python
学习python之编写简单乘法口诀表实现代码
Feb 27 Python
Python实现字符型图片验证码识别完整过程详解
May 10 Python
解决python执行不输出系统命令弹框的问题
Jun 24 Python
django使用admin站点上传图片的实例
Jul 28 Python
解决Django 在ForeignKey中出现 non-nullable field错误的问题
Aug 06 Python
Python imutils 填充图片周边为黑色的实现
Jan 19 Python
对tensorflow中cifar-10文档的Read操作详解
Feb 10 Python
windows下Pycharm安装opencv的多种方法
Mar 05 Python
使用keras根据层名称来初始化网络
May 21 Python
python实现双向链表原理
May 25 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
Classes and Objects in PHP5-面向对象编程 [1]
2006/10/09 PHP
PHP 和 MySQL 开发的 8 个技巧
2006/10/09 PHP
WINXP下apache+php4+mysql
2006/11/25 PHP
php adodb操作mysql数据库
2009/03/19 PHP
用来解析.htpasswd文件的PHP类
2012/09/05 PHP
关于PHP二进制流 逐bit的低位在前算法(详解)
2013/06/13 PHP
深入解析php中的foreach问题
2013/06/30 PHP
php可应用于面包屑导航的迭代寻找家谱树实现方法
2015/02/02 PHP
php生成验证码,缩略图及水印图的类分享
2016/04/07 PHP
php查询内存信息操作示例
2019/05/09 PHP
关于IE BUG与字符串截取substr的解决办法
2013/04/10 Javascript
利用JS解决ie6不支持max-width,max-height问题的方法
2014/01/02 Javascript
JQuery 使用attr方法实现下拉列表选中
2014/10/13 Javascript
iScroll中事件点击触发两次解决方案
2015/03/11 Javascript
JS控件bootstrap datepicker使用方法详解
2017/03/25 Javascript
使用vue + less 实现简单换肤功能的示例
2018/02/21 Javascript
vue兄弟组件传递数据的实例
2018/09/06 Javascript
Nginx设置为Node.js的前端服务器方法总结
2019/03/27 Javascript
uploadify插件实现多个图片上传并预览
2019/09/30 Javascript
JS数组方法push()、pop()用法实例分析
2020/01/18 Javascript
跟老齐学Python之有容乃大的list(3)
2014/09/15 Python
django 使用 request 获取浏览器发送的参数示例代码
2018/06/11 Python
python 多进程并行编程 ProcessPoolExecutor的实现
2019/10/11 Python
TensorFlow实现从txt文件读取数据
2020/02/05 Python
Python递归求出列表(包括列表中的子列表)的最大值实例
2020/02/27 Python
浅谈keras通过model.fit_generator训练模型(节省内存)
2020/06/17 Python
通信专业个人自我鉴定
2013/10/21 职场文书
《雨霖铃》教学反思
2014/02/22 职场文书
婚宴邀请函
2015/01/30 职场文书
销售员岗位职责范本
2015/04/11 职场文书
入队仪式主持词
2015/07/04 职场文书
教师理论学习心得体会
2016/01/21 职场文书
全家福照片寄语怎么写?
2019/04/02 职场文书
党风廉政建设心得体会
2019/05/21 职场文书
springboot集成flyway自动创表的详细配置
2021/06/26 Java/Android
nginx实现动静分离的方法示例
2021/11/07 Servers