在python中解析巨大的xml时lxml的内存使用情况_随笔

在python中解析巨大的xml时lxml的内存使用情况欢迎使用Python和堆栈溢出！

看来您遵循了一些很好的建议

lxml

，尤其是

etree.iterparse(..)

，但是我认为您的实现从错误的角度来解决问题。的想法

iterparse(..)

是摆脱收集和存储数据，而是在读取标签时进行处理。您的

readAllChildren(..)

功能是将所有内容保存到中

rowList

，该内容不断增长以覆盖整个文档树。我做了一些更改以显示正在发生的事情：

from lxml import etreedef parseXml(context,attribList):    for event, element in context:        print "%s element %s:" % (event, element)        fieldMap = {}        rowList = []        readAttribs(element, fieldMap, attribList)        readAllChildren(element, fieldMap, attribList, rowList)        for row in rowList: yield row        element.clear()def readAttribs(element, fieldMap, attribList):    for attrib in attribList:        fieldMap[attrib] = element.get(attrib,'')    print "fieldMap:", fieldMapdef readAllChildren(element, fieldMap, attribList, rowList):    for childElem in element:        print "Found child:", childElem        readAttribs(childElem, fieldMap, attribList)        if len(childElem) > 0:readAllChildren(childElem, fieldMap, attribList, rowList)        rowList.append(fieldMap.copy())        print "len(rowList) =", len(rowList)        childElem.clear()def process_xml_original(xml_file):    attribList=['name','age','id']    context=etree.iterparse(xml_file, events=("start",))    for row in parseXml(context,attribList):        print "Row:", row

使用一些伪数据运行：

>>> from cStringIO import StringIO>>> test_xml = """... <family>...     <person name="somebody" id="5" />...     <person age="45" />...     <person name="Grandma" age="62">...         <child age="35" id="10" name="Mom">...  <grandchild age="7 and 3/4" />...  <grandchild id="12345" />...         </child>...     </person>...     <something-completely-different />... </family>... """>>> process_xml_original(StringIO(test_xml))start element: <Element family at 0x105ca58>fieldMap: {'age': '', 'name': '', 'id': ''}Found child: <Element person at 0x105ca80>fieldMap: {'age': '', 'name': 'somebody', 'id': '5'}len(rowList) = 1Found child: <Element person at 0x105c468>fieldMap: {'age': '45', 'name': '', 'id': ''}len(rowList) = 2Found child: <Element person at 0x105c7b0>fieldMap: {'age': '62', 'name': 'Grandma', 'id': ''}Found child: <Element child at 0x106e468>fieldMap: {'age': '35', 'name': 'Mom', 'id': '10'}Found child: <Element grandchild at 0x106e148>fieldMap: {'age': '7 and 3/4', 'name': '', 'id': ''}len(rowList) = 3Found child: <Element grandchild at 0x106e490>fieldMap: {'age': '', 'name': '', 'id': '12345'}len(rowList) = 4len(rowList) = 5len(rowList) = 6Found child: <Element something-completely-different at 0x106e4b8>fieldMap: {'age': '', 'name': '', 'id': ''}len(rowList) = 7Row: {'age': '', 'name': 'somebody', 'id': '5'}Row: {'age': '45', 'name': '', 'id': ''}Row: {'age': '7 and 3/4', 'name': '', 'id': ''}Row: {'age': '', 'name': '', 'id': '12345'}Row: {'age': '', 'name': '', 'id': '12345'}Row: {'age': '', 'name': '', 'id': '12345'}Row: {'age': '', 'name': '', 'id': ''}start element: <Element person at 0x105ca80>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element person at 0x105c468>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element person at 0x105c7b0>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element child at 0x106e468>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element grandchild at 0x106e148>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element grandchild at 0x106e490>fieldMap: {'age': '', 'name': '', 'id': ''}start element: <Element something-completely-different at 0x106e4b8>fieldMap: {'age': '', 'name': '', 'id': ''}

读取起来有些困难，但是您可以看到它是在第一遍中从根标签开始向下爬整棵树，

rowList

为整个文档中的每个元素建立起来的。您还会注意到它甚至没有停在那儿，因为

element.clear()

调用是
在中的

yield

语句 之后进行 的

parseXml(..)

，直到第二次迭代（即树中的下一个元素）才会执行。

增量处理FTW

一个简单的解决方法是让它

iterparse(..)

完成工作：迭代解析！以下内容将提取相同的信息并对其进行增量处理：

def do_something_with_data(data):    """This just prints it out. Yours will probably be more interesting."""    print "Got data: ", datadef process_xml_iterative(xml_file):    # by using the default 'end' event, you start at the _bottom_ of the tree    ATTRS = ('name', 'age', 'id')    for event, element in etree.iterparse(xml_file):        print "%s element: %s" % (event, element)        data = {}        for attr in ATTRS: data[attr] = element.get(attr, u"")        do_something_with_data(data)        element.clear()        del element # for extra insurance

在相同的伪XML上运行：

>>> print test_xml<family>    <person name="somebody" id="5" />    <person age="45" />    <person name="Grandma" age="62">        <child age="35" id="10" name="Mom"> <grandchild age="7 and 3/4" /> <grandchild id="12345" />        </child>    </person>    <something-completely-different /></family>>>> process_xml_iterative(StringIO(test_xml))end element: <Element person at 0x105cc10>Got data:  {'age': u'', 'name': 'somebody', 'id': '5'}end element: <Element person at 0x106e468>Got data:  {'age': '45', 'name': u'', 'id': u''}end element: <Element grandchild at 0x106e148>Got data:  {'age': '7 and 3/4', 'name': u'', 'id': u''}end element: <Element grandchild at 0x106e490>Got data:  {'age': u'', 'name': u'', 'id': '12345'}end element: <Element child at 0x106e508>Got data:  {'age': '35', 'name': 'Mom', 'id': '10'}end element: <Element person at 0x106e530>Got data:  {'age': '62', 'name': 'Grandma', 'id': u''}end element: <Element something-completely-different at 0x106e558>Got data:  {'age': u'', 'name': u'', 'id': u''}end element: <Element family at 0x105c6e8>Got data:  {'age': u'', 'name': u'', 'id': u''}

这将大大提高脚本的速度和内存性能。另外，通过钩住

'end'

事件，您可以随时清除和删除元素，而不必等到所有子级都已处理完毕。

根据您的数据集，最好只处理某些类型的元素。根元素之一可能不是很有意义，其他嵌套元素也可能用填充很多数据

{'age': u'', 'id': u'','name': u''}

。

或者，使用SAX

顺便说一句，当我阅读“
XML”和“低内存”时，我的想法总是直接跳到SAX上，这是您可以解决此问题的另一种方法。使用内置

xml.sax

模块：

import xml.saxclass AttributeGrabber(xml.sax.handler.ContentHandler):    """SAX Handler which will store selected attribute values."""    def __init__(self, target_attrs=()):        self.target_attrs = target_attrs    def startElement(self, name, attrs):        print "Found element: ", name        data = {}        for target_attr in self.target_attrs: data[target_attr] = attrs.get(target_attr, u"")        # (no xml trees or elements created at all)        do_something_with_data(data)def process_xml_sax(xml_file):    grabber = AttributeGrabber(target_attrs=('name', 'age', 'id'))    xml.sax.parse(xml_file, grabber)

您必须根据哪种情况最适合您来评估这两个选项（如果您经常这样做，则可能要运行几个基准测试）。

确保跟进事情的进展！

根据后续评论进行编辑

实施上述任何一种解决方案都可能需要对代码的整体结构进行一些更改，但是您所拥有的一切仍然应该可行。例如，批量处理“行”，您可能需要：

def process_xml_batch(xml_file, batch_size=10):    ATTRS = ('name', 'age', 'id')    batch = []    for event, element in etree.iterparse(xml_file):        data = {}        for attr in ATTRS: data[attr] = element.get(attr, u"")        batch.append(data)        element.clear()        del element        if len(batch) == batch_size: do_something_with_batch(batch) # Or, if you want this to be a genrator: # yield batch batch = []    if batch:        # there are leftover items        do_something_with_batch(batch) # Or, yield batch

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5649239.html

在python中解析巨大的xml时lxml的内存使用情况

发表评论

评论列表（0条）