使用python解析xml成对应的html示例分享


Posted in Python onApril 02, 2014

SAX将dd.xml解析成html。当然啦,如果得到了xml对应的xsl文件可以直接用libxml2将其转换成html。

#!/usr/bin/env python 
# -*- coding: utf-8 -*-
#---------------------------------------
#   程序:XML解析器
#   版本:01.0
#   作者:mupeng
#   日期:2013-12-18
#   语言:Python 2.7
#   功能:将xml解析成对应的html
#   注解:该程序用xml.sax模块的parse函数解析XML,并生成事件
#   继承ContentHandler并重写其事件处理函数
#   Dispatcher主要用于相应标签的起始、结束事件的派发
#---------------------------------------
from xml.sax.handler import ContentHandler
from xml.sax import parse
class Dispatcher:
    def dispatch(self, prefix, name, attrs=None):
        mname = prefix + name.capitalize()
        dname = 'default' + prefix.capitalize()
        method = getattr(self, mname, None)
        if callable(method): args = ()
        else:
            method = getattr(self, dname, None)
            #args = name
        #if prefix == 'start': args += attrs
        if callable(method): method()
    def startElement(self, name, attrs):
        self.dispatch('start', name, attrs)
    def endElement(self, name):
        self.dispatch('end', name)
class Website(Dispatcher, ContentHandler):
    def __init__(self):
        self.fout = open('ddt_SAX.html', 'w')
        self.imagein = False
        self.desflag = False
        self.item = False
        self.title = ''
        self.link = ''
        self.guid = ''
        self.url = ''
        self.pubdate = ''
        self.description = ''
        self.temp = ''
        self.prx = ''
    def startChannel(self):
        self.fout.write('''<html>\n<head>\n<title> RSS-''')
    def endChannel(self):
       self.fout.write('''
                    <tr><td height="20"></td></tr>
                    </table>
                    </center>
                    <script>
    function  GetTimeDiff(str)
    {
     if(str == '')
     {
      return '';
     }
     var pubDate = new Date(str);
     var nowDate = new Date();
     var diffMilSeconds = nowDate.valueOf()-pubDate.valueOf();
     var days = diffMilSeconds/86400000;
     days = parseInt(days);
     diffMilSeconds = diffMilSeconds-(days*86400000);
     var hours = diffMilSeconds/3600000;
     hours = parseInt(hours);
     diffMilSeconds = diffMilSeconds-(hours*3600000);
     var minutes = diffMilSeconds/60000;
     minutes = parseInt(minutes);
     diffMilSeconds = diffMilSeconds-(minutes*60000);
     var seconds = diffMilSeconds/1000;
     seconds = parseInt(seconds);
     var returnStr = "±±¾©·¢²¼Ê±¼ä£º" + pubDate.toLocaleString();
     if(days > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + days + "Ìì" + hours + "Сʱ" + minutes + "·ÖÖÓ£©";
     }
     else if (hours > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + hours + "Сʱ" + minutes + "·ÖÖÓ£©";
     }
     else if (minutes > 0)
     {
      returnStr = returnStr + " £¨¾àÀëÏÖÔÚ" + minutes + "·ÖÖÓ£©";
     }
     return returnStr;
    }
    function GetSpanText()
    {
     var pubDate;
     var pubDateArray;
     var spanArray = document.getElementsByTagName("span");
     for(var i = 0; i < spanArray.length; i++)
     {
      pubDate = spanArray[i].innerHTML;
      document.getElementsByTagName("span")[i].innerHTML = GetTimeDiff(pubDate);   
     }
    }
    GetSpanText();
   </script>
                </body>
                </html>
                ''')
       self.fout.close()
    def characters(self, chars):
        if chars.strip():
            #chars = chars.strip()
            self.temp += chars
            #print self.temp
       
    def startTitle(self):
        if self.item:
            self.fout.write('''
                        <tr bgcolor="#eeeeee">\n<td style="padding-top:5px;padding-left:5px;" height="30">\n<B>
                    ''')
    def endTitle(self):
        if not self.imagein and not self.item:
            self.title = self.temp
            self.temp = ''
            self.fout.write(self.title.encode('gb2312'))
            #self.title = self.temp
            self.fout.write('''
                </title>\n</head>\n<body>\n<center>\n
                <script>\n
                        function copyLink()
                        {
                                clipboardData.setData("Text",window.location.href);
                                alert("RSSÁ´½ÓÒѾ­¸´ÖƵ½¼ôÌù°å");
                        }
                        function subscibeLink()
                        {
                                var str = window.location.pathname;
                                while(str.match(/^\//))
                                {
                                        str = str.replace(/^\//,"");
                                }
                                window.open("http://rss.sina.com.cn/my_sina_web_rss_news.html?url=" + str,"_self");
                        }
                        </script>\n
                <table width="750" cellpadding="0" cellspacing="0">\n
                <tr>\n
                <td align="right" style="padding-right:15px;" valign="bottom">\n
            ''')
        if self.item:
            self.title = self.temp
            self.temp = ''
            self.fout.write(self.title.encode('gb2312'))
            self.fout.write('''
                        </B>
                        </td>
                        </tr>
                        <tr bgcolor="#eeeeee">
                        <td style="padding-left:5px;">
                        ''')
    def startImage(self):
        self.imagein = True
    def endImage(self):
        self.imagein = False
    def startLink(self):
        if self.imagein:
            self.fout.write('''<A href=" ''')
            
    def endLink(self):
        self.link = self.temp
        self.temp = ''
        if self.imagein:
            self.fout.write(self.link.encode('gb2312'))
            self.fout.write('''" target="_blank">\n ''')
        elif self.item:
            #self.link = self.temp
            pass
        else:
            self.fout.write(self.link)
            self.fout.write(''' " target="
      _blank
     "> ''')
            self.fout.write(self.title.encode('gb2312'))
            self.fout.write(''' </A></B></td>
                            </tr>
                            <tr><td colspan="2" align="center">
                            ''')
            self.fout.write(self.description.encode('gb2312'))
            self.fout.write('''
                        </td></tr>
                        <tr style="font-size:12px;" bgcolor="#eeeeff"><td colspan="2" style="font-size:14px;padding-top:5px;padding-bottom:5px;"><b><a href="javascript:copyLink();">¸´ÖÆ´ËÒ³Á´½Ó</a>                <a href="javascript:subscibeLink();">ÎÒҪǶÈë¸ÃÐÂÎÅÁÐ±íµ½ÎÒµÄÒ³Ãæ£¨¼òµ¥¡¢¿ìËÙ¡¢ÊµÊ±¡¢Ãâ·Ñ£©</a></b></td></tr>
                        </table>
                        <table width="750" cellpadding="0" cellspacing="0">
                            ''')
    def startUrl(self):
        if self.imagein:
            self.fout.write('''<IMG src=" ''')
    def endUrl(self):
        self.url = self.temp
        self.temp = ''
        if self.imagein:
            self.fout.write(self.url.encode('gb2312'))
            self.fout.write('''" border="0">\n
                            </A>
                            </td>
                            <td align="left" valign="bottom" style="padding-bottom:8px;"><B><A href="
                            ''')
        if self.item:
            #self.url = self.temp
            pass
    def defaultStart(self):
        pass
    def defaultEnd(self):
        self.temp = ''
    def startDescription(self):
        pass
    def endDescription(self):
        self.description = self.temp
        self.temp = ''
        if self.item:
            #self.fout.write('¡¡¡¡')
            self.fout.write(self.description.encode('gb2312'))
    def endGuid(self):
        self.guid = self.temp
    def endPubdate(self):
        if not self.temp.startswith('http'):
         self.pubdate = self.temp
         self.temp = ''
        else:
            self.pubdate = ''
    def startItem(self):
        self.item = True
    def endItem(self):
        self.item = False
        self.fout.write('''
                            </td>
                            </tr>
                            <tr bgcolor="#eeeeee">
                            <td style="padding-top:5px;padding-left:5px;">
                            <A href="''')
        self.fout.write(self.link)
        self.fout.write(''' " target="_blank"> ''')
        self.fout.write(self.guid)
        self.fout.write('''
                        </A>
                        </td>
                        </tr>
                        <tr bgcolor="#eeeeee">
                        <td style="padding-top:5px;padding-left:5px;padding-bottom:5px;"><span>''')
        self.fout.write(self.pubdate)
        self.fout.write('''</span></td>
                        </tr>
                        <tr height="10"><td></td></tr>''')
#程序入口
if __name__ == '__main__':
    parse('ddt.xml', Website())
Python 相关文章推荐
使用Python搭建虚拟环境的配置方法
Feb 28 Python
Python 实现删除某路径下文件及文件夹的实例讲解
Apr 24 Python
python使用matplotlib模块绘制多条折线图、散点图
Apr 26 Python
Python 实现异步调用函数的示例讲解
Oct 14 Python
Python中is和==的区别详解
Nov 15 Python
python自动分箱,计算woe,iv的实例代码
Nov 22 Python
python使用turtle库绘制奥运五环
Feb 24 Python
Django框架获取form表单数据方式总结
Apr 22 Python
python查找特定名称文件并按序号、文件名分行打印输出的方法
Apr 24 Python
Python中格式化字符串的四种实现
May 26 Python
python如何编写win程序
Jun 08 Python
Anaconda安装pytorch和paddle的方法步骤
Apr 03 Python
Python爬虫框架Scrapy安装使用步骤
Apr 01 #Python
使用python绘制人人网好友关系图示例
Apr 01 #Python
python异步任务队列示例
Apr 01 #Python
用Python编程实现语音控制电脑
Apr 01 #Python
35个Python编程小技巧
Apr 01 #Python
ptyhon实现sitemap生成示例
Mar 30 #Python
python实现百度关键词排名查询
Mar 30 #Python
You might like
php预定义常量
2006/12/25 PHP
PHP 截取字符串 分别适合GB2312和UTF8编码情况
2009/02/12 PHP
对squid中refresh_pattern的一些理解和建议
2009/04/17 PHP
php 输出缓冲 Output Control用法实例详解
2020/03/03 PHP
JS解析XML的实现代码
2009/11/12 Javascript
JQuery 返回布尔值Is()条件判断方法代码
2012/05/14 Javascript
JavaScript中的立即执行函数表达式介绍
2015/03/15 Javascript
javascript显示上周、上个月日期的处理方法
2016/02/03 Javascript
js字符串截取函数slice、substring和substr的比较
2016/05/17 Javascript
javascript获取select标签选中的值
2016/06/04 Javascript
扩展Bootstrap Tooltip插件使其可交互的方法
2016/11/07 Javascript
javascript中的后退和刷新实现方法
2016/11/10 Javascript
JavaScript闭包和范围实例详解
2016/12/19 Javascript
React入门教程之Hello World以及环境搭建详解
2017/07/11 Javascript
详解Vue 全局引入bass.scss 处理方案
2018/03/26 Javascript
解决axios发送post请求返回400状态码的问题
2018/08/11 Javascript
vue实现压缩图片预览并上传功能(promise封装)
2019/01/10 Javascript
关于vue表单提交防双/多击的例子
2019/10/31 Javascript
JS事件循环机制event loop宏任务微任务原理解析
2020/08/04 Javascript
vue实现抽屉弹窗效果
2020/11/15 Javascript
wxPython 入门教程
2008/10/07 Python
Python中删除文件的程序代码
2011/03/13 Python
python获取网页状态码示例
2014/03/30 Python
Python实现的多线程端口扫描工具分享
2015/01/21 Python
Python 监测文件是否更新的方法
2019/06/10 Python
python中tkinter的应用:修改字体的实例讲解
2019/07/17 Python
孤独星球出版物:Lonely Planet Publications
2018/03/17 全球购物
Expedia瑞典官网:预订度假屋、酒店、汽车租赁、机票等
2021/01/23 全球购物
毕业生多媒体设计求职信
2013/10/12 职场文书
小学教师师德承诺书
2014/05/23 职场文书
沙滩主题婚礼活动策划方案
2014/09/15 职场文书
务虚会发言材料
2014/12/25 职场文书
蓬莱阁导游词
2015/02/04 职场文书
2015年人民调解工作总结
2015/05/18 职场文书
庆祝教师节主持词
2015/07/06 职场文书
golang fmt格式“占位符”的实例用法详解
2021/07/04 Golang