python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
Python实现在线程里运行scrapy的方法
Apr 07 Python
Python升级导致yum、pip报错的解决方法
Sep 06 Python
使用Python编写Prometheus监控的方法
Oct 15 Python
Python中GeoJson和bokeh-1的使用讲解
Jan 03 Python
通过shell+python实现企业微信预警
Mar 07 Python
python实现登录密码重置简易操作代码
Aug 14 Python
使用Python合成图片的实现代码(图片添加个性化文本,图片上叠加其他图片)
Apr 30 Python
基于python实现破解滑动验证码过程解析
May 28 Python
Python3如何使用多线程升程序运行速度
Aug 11 Python
pymongo insert_many 批量插入的实例
Dec 05 Python
68行Python代码实现带难度升级的贪吃蛇
Jan 18 Python
Python时间操作之pytz模块使用详解
Jun 14 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
PHP error_log()将错误信息写入一个文件(定义和用法)
2013/10/25 PHP
PHP 二维数组根据某个字段排序的具体实现
2014/06/03 PHP
详解PHP版本兼容之openssl调用参数
2018/07/25 PHP
Laravel框架学习笔记之批量更新数据功能
2019/05/30 PHP
Ext第一周 史上最强学习笔记---GridPanel(基础篇)
2008/12/29 Javascript
实用的JS正则表达式(手机号码/IP正则/邮编正则/电话等)
2013/01/11 Javascript
利用JQuery动画制作滑动菜单项效果实现步骤及代码
2013/02/07 Javascript
扩展IE中一些不兼容的方法如contains、startWith等等
2014/01/09 Javascript
jquery对元素拖动排序示例
2014/01/16 Javascript
PHP配置文件php.ini中打开错误报告的设置方法
2015/01/09 PHP
jQuery实现为图片添加镜头放大效果的方法
2015/06/25 Javascript
jQuery实现图片预加载效果
2015/11/27 Javascript
JavaScript中的this到底是什么(一)
2015/12/09 Javascript
Node.js静态文件服务器改进版
2016/01/10 Javascript
JS延时器提示框的应用实例代码解析
2016/04/27 Javascript
第一次接触神奇的Bootstrap网格系统
2016/07/27 Javascript
JS实现直接运行html代码的方法
2017/03/13 Javascript
jQuery实现鼠标响应式淘宝动画效果示例
2018/02/13 jQuery
详解如何解决Vue和vue-template-compiler版本之间的问题
2018/09/17 Javascript
vue中element 的upload组件发送请求给后端操作
2020/09/07 Javascript
微信小程序自定义yPicker组件实现省市区三级联动功能
2020/10/29 Javascript
[01:02:03]2014 DOTA2华西杯精英邀请赛 5 24 NewBee VS VG
2014/05/26 DOTA
用实例说明python的*args和**kwargs用法
2013/11/01 Python
python 截取 取出一部分的字符串方法
2017/03/01 Python
Python使用Turtle库绘制一棵西兰花
2019/11/23 Python
Python彻底删除文件夹及其子文件方式
2019/12/23 Python
Python插件机制实现详解
2020/05/04 Python
CSS3实现千变万化的文字阴影text-shadow效果设计
2016/04/26 HTML / CSS
性能服装:HYLETE
2018/08/14 全球购物
探亲邀请信范文
2014/01/30 职场文书
小学中秋节活动方案
2014/02/06 职场文书
手工社团活动方案
2014/02/17 职场文书
新党章心得体会
2014/09/04 职场文书
2014年机关后勤工作总结
2014/12/16 职场文书
工程部文员岗位职责
2015/02/04 职场文书
win10识别不了U盘怎么办 win10系统读取U盘失败的解决办法
2022/08/05 数码科技