迭代大XML,具有较低的内存占用量,并获取所有甚至嵌套的Sequence Elements

迭代大XML,具有较低的内存占用量,并获取所有甚至嵌套的Sequence Elements,第1张

迭代大XML,具有较低的内存占用量,并获取所有甚至嵌套的Sequence Elements

评论 :由于现在仅输出结果

输出结果仅用于演示,跟踪和调试。
要将

record
addresses
写入
SQL
数据库(例如使用)
sqlite3
,请执行以下 *** 作:

c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)addresses = []for addr in record['addresses']:    addr[1].update({'id': record['id']})    addresses.append(addr[1])c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)

在循环外 为大熊猫 Preconditon 展平

df = pd.Dataframe()

from copy import copyaddresses = copy(record['addresses'])del record['addresses']df_records = []for addr in addresses:    record.update(addr[1])    df_records.append(record)df = df.append(df_records, ignore_index=True)

问题 :用于

etree.iterparse
在XML文件中包括所有节点

请执行以下

class Entity
*** 作:

  • 使用解析
    XML
    文件
    lxml.etree.iterparse
  • 没有文件大小限制 ,作为
    <entity>...</entity>
    元素树被 处理后删除
  • 从每棵
    <entity>...</entity>
    树上建造一个
    dict {tag, value, ...}
  • 使用
    generator objects
    yield
    dict
  • 序列元素,例如
    <addresses>/<address>
    Tuple列表
    [(address, {tag, text})...

待办事项

  • 要拼合成许多记录,请循环
    record['addresses']
  • 等于不同的标签名称:
    address
    address1
  • 扁平化,序列标签,例如
    <titels>
    <probs>
    <dobs>


from lxml import etreeclass Entity:    def __init__(self, fh):        """        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'        :param fh: File Handle from the XML File to parse        """        self.context = etree.iterparse(fh, events=("end",), tag=['entity'])    def _parse(self):        """        Parse the XML File for all '<entity>...</entity>' Elements        Clear/Delete the Element Tree after processing        :return: Yield the current '<entity>...</entity>' Element Tree        """        for event, elem in self.context: yield elem elem.clear() while elem.getprevious() is not None:     del elem.getparent()[0]    def sequence(self, elements):        """        Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).        If found a nested Sequence Element, e.g. <address>,          to a Tuple ('address', {tag, text})        :param elements: The Sequence Element        :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))        """        _elements = []        for elem in elements: if len(elem):     _elements.append((elem.tag, dict(self.sequence(elem)))) else:     _elements.append((elem.tag, elem.text))        return _elements    def __iter__(self):        """        Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}        """        for xml_entity in self._parse(): entity = {'id': xml_entity.attrib['id']} for elem in xml_entity:     # if elem is Sequence     if len(elem):         # Append tuple(tag, value)         entity[elem.tag] = self.sequence(elem)     else:         entity[elem.tag] = elem.text yield entityif __name__ == "__main__":    with open('.\FILE.XML', 'rb') as in_xml_        for record in Entity(in_xml): print("record:{}".format(record)) for key, value in record.items():     if isinstance(value, (list)):         #print_list(key, value)         print("{}:{}".format(key, value))     else:         print("{}:{}".format(key, value))

输出 显示第一个记录,仅显示 4个 字段。
注意 :存在唯一标签名称的 陷阱

address
address1

record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for

brevity)
id:1124353
name:DAVID, Beckham
titles:[(‘title’, ‘Football player’)]
addresses:
address:{‘city’: ‘London’, ‘address’: None, ‘post… (omitted for
brevity)
address:{‘city’: ‘London’, ‘address1’: ‘35-37 Par… (omitted for
brevity)

使用Python测试:3.5-lxml.etree:3.7.1



欢迎分享,转载请注明来源:内存溢出

原文地址: http://outofmemory.cn/zaji/5675034.html

(0)
打赏 微信扫一扫 微信扫一扫 支付宝扫一扫 支付宝扫一扫
上一篇 2022-12-17
下一篇 2022-12-17

发表评论

登录后才能评论

评论列表(0条)

保存