解析XML

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

方案一：ElementTree

Python可以使用几种不同的方式解析xml文档（一个实例XML文档）。它包含了DOM和SAX解析器，但是我们焦点将放在另外一个叫做 ElementTree 的库上边。

>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>


    ElementTree

属于


    Python

标准库的一部分，它的位置为


    xml.etree.ElementTree

。


    parse()

函数是ElementTree库的主要入口，它使用文件名或者流对象作为参数。


    parse()

函数会立即解析完整个文档。如果内存资源紧张，也可以增量式地解析xml文档


    parse()

函数会返回一个能代表整篇文档的对象。这不是根元素。要获得根元素的引用可以调用


    getroot()

方法。

xml元素由名字空间（name space）和本地名(local name)）组成。


    ElementTree

使用``{namespace}localname


    来表达xml元素。这篇文档中的每个元素都在名字空间Atom中，所以根元素被表示为

{ http://www.w3.org/2005/Atom }feed`。

元素即列表

在ElementTree API中，元素的行为就像列表一样。列表中的项即该元素的子元素。

# continued from the previous example
>>> root.tag                        
'{http://www.w3.org/2005/Atom}feed'
>>> len(root)                       
8
>>> for child in root:              
...   print(child)                  
... 
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>

根元素的“长度”即子元素的个数。我们可以像使用迭代器一样来遍历其子元素。注意该列表只包含直接子元素，子元素也可能包含再下一级的子元素，但是并没有包括在这个列表中。

属性即字典

xml不只是元素的集合；每一个元素还有其属性集。一旦获取了某个元素的引用，我们可以像操作Python的字典一样轻松获取到其属性。

# continuing from the previous example
>>> root.attrib                           
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
>>> root[4]            # 第五个子元素 link   
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib                        
{'href': 'http://diveintomark.org/',
 'type': 'text/html',
 'rel': 'alternate'}
>>> root[3]                               
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib      # 元素updated没有子元素，所以为空
{}

在XML文档中查找结点

许多情况下我们需要找到xml中特定的元素。Etree也能完成这项工作。

>>> import xml.etree.ElementTree as etree
>>> tree = etree.parse('examples/feed.xml')
>>> root = tree.getroot()
>>> root.findall('{http://www.w3.org/2005/Atom}entry')    
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> tree.findall('{http://www.w3.org/2005/Atom}entry')
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]
>>> root.tag
'{http://www.w3.org/2005/Atom}feed'
>>> root.findall('{http://www.w3.org/2005/Atom}feed') # 根元素中没有该子元素
[]
>>> root.findall('{http://www.w3.org/2005/Atom}author') # author元素不是直接子元素
[]
>>> all_links=tree.findall('.//{http://www.w3.org/2005/Atom}author')
>>> all_links
[<Element '{http://www.w3.org/2005/Atom}author' at 0x7fb709060838>, <Element '{http://
www.w3.org/2005/Atom}author' at 0x7fb709060f70>, <Element '{http://www.w3.org/2005/Ato
m}author' at 0x7fb7090623c0>]
>>> entries = tree.findall('{http://www.w3.org/2005/Atom}entry')          
>>> len(entries)
3
>>> title_element = entries[0].find('{http://www.w3.org/2005/Atom}title') 
>>> title_element.text
'Dive into history, 2009 edition'
>>> foo_element = entries[0].find('{http://www.w3.org/2005/Atom}foo')     
>>> foo_element
>>> type(foo_element)
<class 'NoneType'>


    findfall()

方法查找匹配特定格式的子元素。如果在开头加上

.//

，则会在所有嵌套层次里查找，否则只会查找直接子元素。

为了方便，对象


    tree

（调用


    etree.parse()

的返回值）中的一些方法是根元素中这些方法的镜像，因此例子中的


    tree.findall()

等效于


    tree.getroot().findall()


    find()

方法用来返回第一个匹配到的元素。当我们认为只会有一个匹配，或者有多个匹配但我们只关心第一个的时候，这个方法是很有用的。

生成XML

>>> import xml.etree.ElementTree as etree
>>> new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',     
...     attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})  
>>> print(etree.tostring(new_feed))                                   
<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>

需要说明的是最后一个命令：在任何时候，我们可以使用ElementTree的 tostring() 函数序列化任意元素（还有它的子元素）。

技术上说，ElementTree使用的序列化方法是精确的，但却不是最理想的。在本章开头给出的xml样例文档中定义了一个默认名字空间(default namespace)( xmlns='http://www.w3.org/2005/Atom' )。对于每个元素都在同一个名字空间中的文档 — 比如Atom feeds — 定义默认的名字空间非常有用，因为只需要声明一次名字空间，然后在声明每个元素的时候只需要使用其本地名即可( ，， )。除非想要定义另外一个名字空间中的元素，否则没有必要使用前缀。

对于xml解析器来说，它不会“注意”到使用默认名字空间和使用前缀名字空间的xml文档之间有什么不同。当前序列化结果的dom为：

1	<ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/>

与下列序列化的DOM是一模一样的：

1	<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/>

明显下面的更简洁一些，对于Atom feed这样的东西，数据越简洁越有利于优化传输速率。

ElementTree的不足

ElementTree只能提供“ 有限的XPath支持 ”， XPath 是一种用于查询xml文档的W3C标准。ElementTree与XPath语法上足够相似，但有一些差异。如果需要涉及复杂的操作，建议使用下面的方案。

方案二：LXML（推荐）

lxml 是一个开源的第三方库，以流行的 libxml2 解析器为基础开发。提供了与ElementTree完全兼容的api， 并且扩展它以提供了对XPath 1.0的全面支持 ，以及改进了一些其他精巧的细节。提供Windows的安装程序；Linux用户推荐使用特定发行版自带的工具比如yum或者apt-get从它们的程序库中安装预编译好了的二进制文件。要不然，你就得手工安装他们了。

解析XML

>>> from lxml import etree                   
>>> tree = etree.parse('examples/feed.xml')  
>>> root = tree.getroot()                    
>>> root.findall('{http://www.w3.org/2005/Atom}entry')  
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b510>,
 <Element {http://www.w3.org/2005/Atom}entry at e2b540>]

导入lxml以后，可以发现它与内置的ElementTree库提供相同的api。

parse()函数：与ElementTree相同。

getroot()方法：相同。

findall()方法：完全相同。

对于大型的xml文档，lxml明显比内置的ElementTree快了许多。如果现在只用到了ElementTree的api，并且想要使用其最快的实现(implementation)，我们可以尝试导入lxml，并且将内置的ElementTree作为备用。

try:
    from lxml import etree
except ImportError:
    import xml.etree.ElementTree as etree

更强大的 findall()

但是lxml不只是一个更快速的ElementTree。它的 findall() 方法能够支持更加复杂的表达式。

>>> import lxml.etree                                                                   
>>> tree = lxml.etree.parse('examples/feed.xml')
>>> tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')                             
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
 <Element {http://www.w3.org/2005/Atom}link at eeb990>,
 <Element {http://www.w3.org/2005/Atom}link at eeb960>,
 <Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
>>> tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")  
[<Element {http://www.w3.org/2005/Atom}link at eeb930>]
>>> NS = '{http://www.w3.org/2005/Atom}'
>>> tree.findall('//{NS}author[{NS}uri]'.format(NS=NS)) # 将名字空间利用format来简化 
[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
 <Element {http://www.w3.org/2005/Atom}author at eebba0>]

第三条命令在整个文档范围内搜索名字空间Atom中具有href属性的所有元素。在查询语句开头的

//

表示“搜索的范围为整个文档（不只是根元素的子元素）。”


    {http://www.w3.org/2005/Atom}

指示“搜索范围仅在名字空间Atom中。”

表示“任意本地名(local name)的元素。”


    [@href]

表示“含有href属性。”

第四条命令找出所有包含href属性并且其值为


    http://diveintomark.org/

的Atom元素。

第五条命令在简单的字符串格式化后（要不然这条复合查询语句会变得特别长），它搜索名字空间Atom中包含uri元素作为子元素的author元素。

使用XPath表达式

lxml也集成了对任意 XPath 1.0 表达式的支持。示例：

>>> import lxml.etree
>>> tree = lxml.etree.parse('examples/feed.xml')
>>> NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}                    
>>> entries = tree.xpath("//atom:category[@term='accessibility']/..",  
...     namespaces=NSMAP)
>>> entries                                                            
[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]
>>> entry = entries[0]
>>> entry.xpath('./atom:title/text()', namespaces=NSMAP)               
['Accessibility is a harsh mistress']

生成xml

内置的ElementTree库没有提供细粒度地对序列化时名字空间内的元素的控制，但是lxml有这样的功能。

>>> import lxml.etree
>>> NSMAP = {None: 'http://www.w3.org/2005/Atom'}                     
>>> new_feed = lxml.etree.Element('feed', nsmap=NSMAP)                
>>> print(lxml.etree.tounicode(new_feed))                             
<feed xmlns='http://www.w3.org/2005/Atom'/>
>>> new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en') # 使用set()方法来随时给元素添加所需属性 
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/>

在该样例中，只有nsmap参数是lxml特有的，它用来控制序列化输出时名字空间的前缀。

难道每个xml文档只能有一个元素吗？当然不了。我们可以创建子元素。

>>> title = lxml.etree.SubElement(new_feed, 'title',          
...     attrib={'type':'html'})                               
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'/></feed>
>>> title.text = 'dive into &hellip;'                         
>>> print(lxml.etree.tounicode(new_feed))                     
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'>dive into &amp;hellip;</title></feed>
>>> print(lxml.etree.tounicode(new_feed, pretty_print=True))  
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
<title type='html'>dive into&amp;hellip;</title>
</feed>

给已有元素创建子元素，我们需要实例化SubElement类。它只要求两个参数，父元素（即该样例中的new_feed）和子元素的名字。由于该子元素会从父元素那儿继承名字空间的映射关系，所以这里不需要再声明名字空间前缀。

我们也可以传递属性字典给它。字典的键即属性名；值为属性的值。（如上面的


    attrib={'type':'html'})

）

当前title元素序列化的时候就使用了其文本内容。任何包含了

或者

符号的内容在序列化的时候需要被转义。lxml会自动处理转义。

xmlwitch ：一个用于生成xml的另外一个第三方库。它大量地使用了with语句来使生成的xml代码更具可读性。