python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python生成器的使用方法
Nov 21 Python
使用基于Python的Tornado框架的HTTP客户端的教程
Apr 24 Python
编写自定义的Django模板加载器的简单示例
Jul 21 Python
解决python3中自定义wsgi函数,make_server函数报错的问题
Nov 21 Python
django使用html模板减少代码代码解析
Dec 12 Python
Python实现多条件筛选目标数据功能【测试可用】
Jun 13 Python
使用TensorFlow实现二分类的方法示例
Feb 05 Python
简单了解python的内存管理机制
Jul 08 Python
python使用梯度下降算法实现一个多线性回归
Mar 24 Python
浅谈pytorch中的BN层的注意事项
Jun 23 Python
scrapy-redis分布式爬虫的搭建过程(理论篇)
Sep 29 Python
关于Python使用turtle库画任意图的问题
Apr 01 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
Apache+php+mysql在windows下的安装与配置图解(最新版)
2008/11/30 PHP
PHP 变量类型的强制转换
2009/10/23 PHP
在PHP中操作Excel实例代码
2010/04/29 PHP
ThinkPHP CURD方法之table方法详解
2014/06/18 PHP
php中实现精确设置session过期时间的方法
2014/07/17 PHP
PHP中if和or运行效率对比
2014/12/12 PHP
关于WordPress的SEO优化相关的一些PHP页面脚本技巧
2015/12/10 PHP
Prototype 1.5.0_rc1 及 Prototype 1.5.0 Pre0小抄本
2006/09/22 Javascript
ExtJS TabPanel beforeremove beforeclose使用说明
2010/03/31 Javascript
在Javascript中 声明时用"var"与不用"var"的区别
2013/04/15 Javascript
最好用的省市二级联动 原生js实现你值得拥有
2013/09/22 Javascript
js修改原型的属性使用介绍
2014/01/26 Javascript
javascript window.open打开新窗口后无法再次打开该窗口问题的解决方法
2014/04/12 Javascript
利用jquery操作Radio方法小结
2014/10/20 Javascript
浅谈EasyUI中编辑treegrid的方法
2015/03/01 Javascript
jQuery实现仿百度帖吧头部固定导航效果
2015/08/07 Javascript
avalon js实现仿google plus图片多张拖动排序附源码下载
2015/09/24 Javascript
AngularJS控制器controller正确的通信的方法
2016/01/25 Javascript
Ajax分页插件Pagination从前台jQuery到后端java总结
2016/07/22 Javascript
JavaScript中的this使用详解
2016/07/27 Javascript
详谈Ajax请求中的async:false/true的作用(ajax 在外部调用问题)
2017/02/10 Javascript
JavaScript实现简单图片轮播效果
2017/08/21 Javascript
React + webpack 环境配置的方法步骤
2017/09/07 Javascript
JS+HTML实现自定义上传图片按钮并显示图片功能的方法分析
2020/02/12 Javascript
Python 实现将数组/矩阵转换成Image类
2020/01/09 Python
python如何建立全零数组
2020/07/19 Python
利用Python如何制作贪吃蛇及AI版贪吃蛇详解
2020/08/24 Python
鱼油专家:Omegavia
2016/10/10 全球购物
马来西亚最热门的在线时尚商店:FashionValet
2018/11/11 全球购物
会展中心部门工作职责
2013/11/27 职场文书
满月酒主持词
2014/03/27 职场文书
公司股东合作协议书
2014/09/14 职场文书
三年级学生评语大全
2014/12/26 职场文书
员工工作心得体会
2019/05/07 职场文书
创业开店,这样方式更合理
2019/08/26 职场文书
python中的plt.cm.Paired用法说明
2021/05/31 Python