python读取html中指定元素生成excle文件示例


Posted in Python onApril 03, 2014

Python2.7编写的读取html中指定元素,并生成excle文件

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup 
from xlrd import open_workbook
class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)
        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)
        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1) 
Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/' 
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
    
    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []   
    files = os.listdir(DATAPATH)
    k = 0
    for f in files:  
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'):  
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))       
                #print line.encode('gbk') 
            htmlFile.close()    
            for i in content:
                print content.index(i),',',i 
                log.output(i) 
                log.output(content.index(i)) 
            print '----------------------------------------'
            
            folderName =  content[6]
            contentName=  content[4]       
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6] 
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k            
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1
    wbk.save(DATAPATH + XLSname)        
    print '=========================================' 
Python 相关文章推荐
利用Python绘制数据的瀑布图的教程
Apr 07 Python
利用scrapy将爬到的数据保存到mysql(防止重复)
Mar 31 Python
PyQt5每天必学之单行文本框
Apr 19 Python
Python中pip更新和三方插件安装说明
Jul 08 Python
详解Django解决ajax跨域访问问题
Aug 24 Python
PyCharm鼠标右键不显示Run unittest的解决方法
Nov 30 Python
python实现大文本文件分割
Jul 22 Python
PyCharm设置Ipython交互环境和宏快捷键进行数据分析图文详解
Apr 23 Python
基于plt.title无法显示中文的快速解决
May 16 Python
Keras-多输入多输出实例(多任务)
Jun 22 Python
13个Pandas实用技巧,助你提高开发效率
Aug 19 Python
Python自动化办公Excel模块openpyxl原理及用法解析
Nov 05 Python
python实现zencart产品数据导入到magento(python导入数据)
Apr 03 #Python
python模拟登陆阿里妈妈生成商品推广链接
Apr 03 #Python
python多线程抓取天涯帖子内容示例
Apr 03 #Python
python局域网ip扫描示例分享
Apr 03 #Python
python实现数通设备tftp备份配置文件示例
Apr 02 #Python
python实现巡检系统(solaris)示例
Apr 02 #Python
python实现apahce网站日志分析示例
Apr 02 #Python
You might like
PHP利用COM对象访问SQLServer、Access
2006/10/09 PHP
PHP 数字左侧自动补0
2008/03/31 PHP
php提示Failed to write session data错误的解决方法
2014/12/17 PHP
Symfony2学习笔记之插件格式分析
2016/03/17 PHP
CakePHP框架Model函数定义方法示例
2017/08/04 PHP
JavaScript 利用Cookie记录用户登录信息
2009/12/08 Javascript
jquery 的 $("#id").html() 无内容的解决方法
2010/06/07 Javascript
window.onresize 多次触发的解决方法
2013/11/08 Javascript
javascript实现简单的贪吃蛇游戏
2015/03/31 Javascript
javascript密码强度校验代码(两种方法)
2015/08/10 Javascript
jquery自定义右键菜单、全选、不连续选择
2016/03/01 Javascript
深入理解jQuery3.0的domManip函数
2016/09/01 Javascript
vue2.0开发实践总结之疑难篇
2016/12/07 Javascript
从零开始实现Vue简单的Toast插件
2018/12/03 Javascript
JavaScript解析及序列化JSON的方法实例分析
2019/01/04 Javascript
基于Node.js搭建hexo博客过程详解
2019/06/25 Javascript
微信小程序分包加载代码实现方法详解
2019/09/23 Javascript
详解python3中socket套接字的编码问题解决
2017/07/01 Python
python机器学习理论与实战(四)逻辑回归
2018/01/19 Python
Python 中导入csv数据的三种方法
2018/11/01 Python
python搜索包的路径的实现方法
2019/07/19 Python
python动态视频下载器的实现方法
2019/09/16 Python
python 使用csv模块读写csv格式文件的示例
2020/12/02 Python
详解HTML5之pushstate、popstate操作history,无刷新改变当前url
2017/03/15 HTML / CSS
自我评价的正确写法
2013/09/19 职场文书
客户代表自我评价范例
2013/09/24 职场文书
历史学专业推荐信
2013/11/06 职场文书
小学门卫岗位职责
2013/12/17 职场文书
校运会入场式解说词
2014/02/10 职场文书
关爱残疾人演讲稿
2014/05/24 职场文书
匿名检举信范文
2015/03/02 职场文书
病房管理制度范本
2015/08/06 职场文书
2019同学聚会主持词
2019/05/06 职场文书
检讨书怎么写?
2019/06/21 职场文书
涨工资申请书应该怎么写?
2019/07/08 职场文书
一篇文章搞懂python混乱的切换操作与优雅的推导式
2021/08/23 Python