python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
python文件写入实例分析
Apr 08 Python
python实现读取命令行参数的方法
May 22 Python
python+selenium实现登录账户后自动点击的示例
Dec 22 Python
Linux下python与C++使用dlib实现人脸检测
Jun 29 Python
基于tensorflow加载部分层的方法
Jul 26 Python
Window环境下Scrapy开发环境搭建
Nov 18 Python
python 获取毫秒数,计算调用时长的方法
Feb 20 Python
django之状态保持-使用redis存储session的例子
Jul 28 Python
Python3自动生成MySQL数据字典的markdown文本的实现
May 07 Python
Python正则表达式如何匹配中文
May 27 Python
用Python制作mini翻译器的实现示例
Aug 17 Python
编写python代码实现简单抽奖器
Oct 20 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
提升PHP执行速度全攻略(下)
2006/10/09 PHP
PHP CURL模拟GET及POST函数代码
2010/04/25 PHP
PHP 函数执行效率的小比较
2010/10/17 PHP
基于php上传图片重命名的6种解决方法的详细介绍
2013/04/28 PHP
php中smarty模板条件判断用法实例
2015/06/11 PHP
PHP自定义序列化接口Serializable用法分析
2017/12/29 PHP
php和js实现根据子网掩码和ip计算子网功能示例
2019/11/09 PHP
自动更新作用
2006/10/08 Javascript
js控制的遮罩层实例介绍
2013/05/29 Javascript
javascript获取隐藏元素(display:none)的高度和宽度的方法
2014/06/06 Javascript
JavaScript 获取任一float型小数点后两位的小数
2014/06/30 Javascript
jQuery原生的动画效果
2015/07/10 Javascript
jQuery自动添加表单项的方法
2015/07/13 Javascript
javascript实现超炫的向上滑行菜单实例
2015/08/03 Javascript
jQuery绑定事件-多种实现方式总结
2016/05/09 Javascript
vue 粒子特效的示例代码
2017/09/19 Javascript
vue组件jsx语法的具体使用
2018/05/21 Javascript
深入解析koa之中间件流程控制
2019/06/17 Javascript
python实现下载文件的三种方法
2017/02/09 Python
python中itertools模块zip_longest函数详解
2018/06/12 Python
Python中collections模块的基本使用教程
2018/12/07 Python
Pycharm设置utf-8自动显示方法
2019/01/17 Python
实例介绍Python中整型
2019/02/11 Python
Python2和Python3的共存和切换使用
2019/04/12 Python
如何基于Python代码实现高精度免费OCR工具
2020/06/18 Python
Python实现一个简单的递归下降分析器
2020/08/01 Python
HTML 5.1来了 9月份正式发布 更新内容预览
2016/04/26 HTML / CSS
H5新属性audio音频和video视频的控制详解(推荐)
2016/12/09 HTML / CSS
莫斯科高科技在线商店:KremlinStore
2019/03/13 全球购物
2014年学雷锋活动总结
2014/06/26 职场文书
2014年小学教导处工作总结
2014/12/19 职场文书
导游词之崇武古城
2019/10/07 职场文书
Python中Permission denied的解决方案
2021/04/02 Python
MySQL时间盲注的五种延时方法实现
2021/05/18 MySQL
Python3中最常用的5种线程锁实例总结
2021/07/07 Python
python读取并查看npz/npy文件数据以及数据显示方法
2022/04/14 Python