对大型XML文件使用Python Iterparse_随笔

对大型XML文件使用Python Iterparse

试试LizaDaly的fast_iter。处理完元素后

elem

，它会调用

elem.clear()

以移除后代，并移除之前的兄弟姐妹。

def fast_iter(context, func, *args, **kwargs):    """    http://lxml.de/parsing.html#modifying-the-tree    based on Liza Daly's fast_iter    http://www.ibm.com/developerworks/xml/library/x-hiperfparse/    See also http://effbot.org/zone/element-iterparse.htm    """    for event, elem in context:        func(elem, *args, **kwargs)        # It's safe to call clear() here because no descendants will be        # accessed        elem.clear()        # Also eliminate now-empty references from the root node to elem        for ancestor in elem.xpath('ancestor-or-self::*'): while ancestor.getprevious() is not None:     del ancestor.getparent()[0]    del contextdef process_element(elem):    print elem.xpath( 'description/text( )' )context = etree.iterparse( MYFILE, tag='item' )fast_iter(context,process_element)

Daly的文章非常不错，特别是在处理大型XML文件时。

编辑：

fast_iter

上面发布的是Daly的修改版本

fast_iter

。在处理完一个元素之后，它会更加主动地删除不再需要的其他元素。

下面的脚本显示了行为上的差异。特别要注意的是，

orig_fast_iter

不删除

A1

元素，而确实删除了元素

mod_fast_iter

，从而节省了更多内存。

import lxml.etree as ETimport textwrapimport iodef setup_ABC():    content = textwrap.dedent('''      <root>        <A1>          <B1></B1>          <C>1<D1></D1></C>          <E1></E1>        </A1>        <A2>          <B2></B2>          <C>2<D></D></C>          <E2></E2>        </A2>      </root>        ''')    return contentdef study_fast_iter():    def orig_fast_iter(context, func, *args, **kwargs):        for event, elem in context: print('Processing {e}'.format(e=ET.tostring(elem))) func(elem, *args, **kwargs) print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() while elem.getprevious() is not None:     print('Deleting {p}'.format(         p=(elem.getparent()[0]).tag))     del elem.getparent()[0]        del context    def mod_fast_iter(context, func, *args, **kwargs):        """        http://www.ibm.com/developerworks/xml/library/x-hiperfparse/        Author: Liza Daly        See also http://effbot.org/zone/element-iterparse.htm        """        for event, elem in context: print('Processing {e}'.format(e=ET.tostring(elem))) func(elem, *args, **kwargs) # It's safe to call clear() here because no descendants will be # accessed print('Clearing {e}'.format(e=ET.tostring(elem))) elem.clear() # Also eliminate now-empty references from the root node to elem for ancestor in elem.xpath('ancestor-or-self::*'):     print('Checking ancestor: {a}'.format(a=ancestor.tag))     while ancestor.getprevious() is not None:         print(  'Deleting {p}'.format(p=(ancestor.getparent()[0]).tag))         del ancestor.getparent()[0]        del context    content = setup_ABC()    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')    orig_fast_iter(context, lambda elem: None)    # Processing <C>1<D1/></C>    # Clearing <C>1<D1/></C>    # Deleting B1    # Processing <C>2<D/></C>    # Clearing <C>2<D/></C>    # Deleting B2    print('-' * 80)    """    The improved fast_iter deletes A1. The original fast_iter does not.    """    content = setup_ABC()    context = ET.iterparse(io.BytesIO(content), events=('end', ), tag='C')    mod_fast_iter(context, lambda elem: None)    # Processing <C>1<D1/></C>    # Clearing <C>1<D1/></C>    # Checking ancestor: root    # Checking ancestor: A1    # Checking ancestor: C    # Deleting B1    # Processing <C>2<D/></C>    # Clearing <C>2<D/></C>    # Checking ancestor: root    # Checking ancestor: A2    # Deleting A1    # Checking ancestor: C    # Deleting B2study_fast_iter()

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5647748.html

对大型XML文件使用Python Iterparse

发表评论

评论列表（0条）