编程 Python

Python处理XML格式数据的方法详解

Posted in Python onMarch 21, 2017

本文实例讲述了Python处理XML格式数据的方法。分享给大家供大家参考，具体如下：

这里的操作是基于Python3平台。

在使用Python处理XML的问题上，首先遇到的是编码问题。

Python并不支持gb2312，所以面对encoding="gb2312"的XML文件会出现错误。Python读取的文件本身的编码也可能导致抛出异常，这种情况下打开文件的时候就需要指定编码。此外就是XML中节点所包含的中文。

我这里呢，处理就比较简单了，只需要修改XML的encoding头部。

#!/usr/bin/env python
import os, sys
import re
def replaceXmlEncoding(filepath, oldEncoding='gb2312', newEncoding='utf-8'):
  f = open(filepath, mode='r')
  content = f.read()
  content = re.sub(oldEncoding, newEncoding, content)
  f.close()
  f = open(filepath, mode='w')
  f.write(content)
  f.close()
if __name__ == "__main__":
  replaceXmlEncoding('./ActivateAccount.xml')

接着是使用xml.etree.ElementTree来操作XML文件。

在一个类里面定义__call__函数可以使得该类可调用，比如下面代码的最后几行，在__main__函数中。这也很突出地体现了在Python的世界里，一切都是对象，包括对象本身：）

一直觉得__main__函数用来测试真是蛮好用的。

#!/usr/bin/env python
import os, re
import xml.etree.ElementTree as etree
Locale_Path = "./locale.txt"
class xmlExtractor(object):
  def __init__(self):
    pass
  def __call__(self, filepath):
    retDict = {}
    f = open(filepath, 'r')
    Line = len(open(filepath, 'r').readlines())
    retDict['Line'] = Line
    tree = etree.parse(f)
    root = tree.find("ResItem")
    Id = root.get("ID")
    retDict['Title'] = Id
    resItemCnt = len(list(root.findall("ResItem"))) + 1
    retDict['ResItemCount'] = resItemCnt
    retDict['ChineseTip'] = 'None'
    for child in root:
      attrDict = child.attrib
      keyword = "Name"
      if(keyword in attrDict.keys() and attrDict['Name'] == "Caption"):
        if len(child.attrib['Value']) > 1:
          if child.attrib['Value'][0] == '~':
            title = child.attrib['Value'][1:]
          else:
            title = child.attrib['Value'][0:]
          #print(title)
          chs = open(Locale_Path).read()
          pattern = '<String id="' + title + '">[^>]+>'
          m = re.search(pattern, chs)
          if m != None:
            realTitle = re.sub('<[^>]+>', '', m.group(0))
            retDict['ChineseTip'] = realTitle
    f.close()
    return retDict
if __name__ == "__main__":
  fo = xmlExtractor()
  d = fo('./ActivateAccount.xml')
  print(d)

最后，就是入口文件，导入上面两个文件，使用xml.dom和os.listdir来递归处理XML文件，并生成一个结果集。

一直觉得Python的UnboundLocalError错误挺有意思的，不知道是不是符号表的覆盖问题。

#!/usr/bin/env python
from xmlExtractor import *
from replaceXmlEncoding import *
from xml.dom import minidom,Node
doc = minidom.Document()
extractor = xmlExtractor()
totalLines = 0
totalResItemCnt = 0
totalXmlFileCnt = 0
totalErrorCnt = 0
errorFileList = []
xmlRoot = doc.createElement("XmlResourceFile")
doc.appendChild(xmlRoot)
def myWalkDir(level, path):
  global doc, extractor, totalLines, totalResItemCnt, totalXmlFileCnt
  global totalErrorCnt, errorFileList
  global xmlRoot
  for i in os.listdir(path):
    if i[-3:] == 'xml':
      totalXmlFileCnt += 1
      try:
        #先把xml的encoding由gb2312转换为utf-8
        replaceXmlEncoding(path + '\\' + i)
        #再提取xml文档中需要的信息
        info = extractor(path + '\\' + i)
        #在上述两行代码没有出现异常的基础上再创建节点
        #print(info)
        #print(type(i))
        xmlNode = doc.createElement("XmlFile")
        xmlRoot.appendChild(xmlNode)
        xmlName = doc.createElement("Filename")
        xmlName.setAttribute('Value', i)
        #xmlName.appendChild(doc.createTextNode(i))
        xmlNode.appendChild(xmlName)
        filePath = doc.createElement("Filepath")
        filePath.setAttribute('Value', path[34:])
        #filePath.appendChild(doc.createTextNode(path[1:]))
        xmlNode.appendChild(filePath)
        titleNode = doc.createElement("Title")
        titleNode.setAttribute('Value', str(info['Title']))
        #titleNode.appendChild(doc.createTextNode(str(info['Title'])))
        xmlNode.appendChild(titleNode)
        chsNode = doc.createElement("ChineseTip")
        chsNode.setAttribute('Value', str(info['ChineseTip']))
        #chsNode.appendChild(doc.createTextNode(str(info['Chinese'])))
        xmlNode.appendChild(chsNode)
        resItemNode = doc.createElement("ResItemCount")
        resItemNode.setAttribute('Value', str(info['ResItemCount']))
        #resItemNode.appendChild(doc.createTextNode(str(info['ResItemCount'])))
        xmlNode.appendChild(resItemNode)
        lineNode = doc.createElement("LineCount")
        lineNode.setAttribute('Value', str(info['Line']))
        #lineNode.appendChild(doc.createTextNode(str(info['Line'])))
        xmlNode.appendChild(lineNode)
        descNode = doc.createElement("Description")
        descNode.setAttribute('Value', '')
        #descNode.appendChild(doc.createTextNode(''))
        xmlNode.appendChild(descNode)
      except Exception as errorDetail:
        totalErrorCnt += 1
        errorFileList.append(path + '\\' + i)
        print(path + '\\' + i, errorDetail)
    if os.path.isdir(path + '\\' + i):
      myWalkDir(level+1, path + '\\' + i)
if __name__ == "__main__":
  path = os.getcwd() + '\\themes'
  myWalkDir(0, path)
  print(totalXmlFileCnt, totalErrorCnt)
  #print(doc.toprettyxml(indent = "  "))
  resultXml = open("./xmlResourceList.xml", "w")
  resultXml.write(doc.toprettyxml(indent = "  "))
  resultXml.close()

Python处理XML格式数据的方法详解

- Author -

jasonblog

声明：登载此文出于传递更多信息之目的，并不意味着赞同其观点或证实其描述。

Python 相关文章推荐

Windows下为Python安装Matplotlib模块

Nov 06 Python

Python基础教程之正则表达式基本语法以及re模块

Mar 25 Python

利用Python暴力破解zip文件口令的方法详解

Dec 21 Python

Windows下PyCharm安装图文教程

Aug 27 Python

Django分页功能的实现代码详解

Jul 29 Python

python计算n的阶乘的方法代码

Oct 25 Python

Python实现自动打开电脑应用的示例代码

Apr 17 Python

python如何随机生成高强度密码

Aug 19 Python

Pycharm github配置实现过程图解

Oct 13 Python

Python爬虫入门教程01之爬取豆瓣Top电影

Jan 24 Python

python tkinter Entry控件的焦点移动操作

May 22 Python

Python合并pdf文件的工具

Jul 01 Python

Python做简单的字符串匹配详解

Mar 21 #Python

Python 转义字符详细介绍

Mar 21 #Python

python 迭代器和iter()函数详解及实例

Mar 21 #Python

浅谈五大Python Web框架

Mar 20 #Python

python rsa 加密解密

Mar 20 #Python

Python 专题六局部变量、全局变量global、导入模块变量

Mar 20 #Python

python list排序的两种方法及实例讲解

Mar 20 #Python

You might like

php设计模式 Template (模板模式)

2011/06/26 PHP

Thinkphp中volist标签mod控制一定记录的换行BUG解决方法

2014/11/04 PHP

php通过两层过滤获取留言内容的方法

2016/07/11 PHP

PHP数组的定义、初始化和数组元素的显示实现代码

2016/11/05 PHP

一个可以兼容IE FF的加为首页与加入收藏实现代码

2009/11/02 Javascript

jquery validate poshytip 自定义样式

2012/11/26 Javascript

jQuery图片播放8款精美插件分享

2013/02/17 Javascript

从零学JS之你需要了解的几本书

2014/05/19 Javascript

JavaScript使用ActiveXObject访问Access和SQL Server数据库

2015/04/02 Javascript

jQuery获取DOM节点实例分析(2种方式)

2015/12/15 Javascript

Bootstarp基本模版学习教程

2017/02/01 Javascript

轻松理解JavaScript闭包

2017/03/14 Javascript

React/Redux应用使用Async/Await的方法

2017/11/16 Javascript

seajs下require书写约定实例分析

2018/05/16 Javascript

jQuery实现的简单获取索引功能示例

2018/06/04 jQuery

js实现购物车功能

2018/06/12 Javascript

初试vue-cli使用HBuilderx打包app的坑

2019/07/17 Javascript

react 生命周期实例分析

2020/05/18 Javascript

使用PDB简单调试Python程序简明指南

2015/04/25 Python

python reduce 函数使用详解

2017/12/05 Python

使用Python的OpenCV模块识别滑动验证码的缺口（推荐）

2019/05/10 Python

python控制台实现tab补全和清屏的例子

2019/08/20 Python

class类在python中获取金融数据的实例方法

2020/12/10 Python

html5中嵌入视频自动播放的问题解决

2020/05/25 HTML / CSS

美国女士泳装店：Swimsuits For All

2017/03/02 全球购物

线程的基本概念、线程的基本状态以及状态之间的关系

2012/10/26 面试题

方正Java笔试题

2014/07/03 面试题

介绍一下UNIX启动过程

2013/11/14 面试题

营销总经理岗位职责

2014/02/02 职场文书

孝老爱亲模范事迹材料

2014/05/25 职场文书

优秀教师先进个人事迹材料

2014/08/31 职场文书

驾驶员管理制度范本

2015/08/06 职场文书

珍惜时间的诗歌赏析

2019/08/23 职场文书

留学文书中的个人陈述，应该注意哪些问题？

2019/08/23 职场文书

win10安装配置nginx的过程

2021/03/31 Servers

详解Alibaba Java诊断工具Arthas查看Dubbo动态代理类

2022/04/08 Java/Android