评论 :由于现在仅输出结果
输出结果仅用于演示,跟踪和调试。
要将
record和
addresses写入
SQL数据库(例如使用)
sqlite3,请执行以下 *** 作:
c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)addresses = []for addr in record['addresses']: addr[1].update({'id': record['id']}) addresses.append(addr[1])c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)
在循环外 为大熊猫 Preconditon 展平
:
df = pd.Dataframe()
from copy import copyaddresses = copy(record['addresses'])del record['addresses']df_records = []for addr in addresses: record.update(addr[1]) df_records.append(record)df = df.append(df_records, ignore_index=True)
问题 :用于
etree.iterparse在XML文件中包括所有节点
请执行以下
class Entity*** 作:
- 使用解析
XML
文件lxml.etree.iterparse
。 - 有 没有文件大小限制 ,作为
<entity>...</entity>
元素树被 处理后删除 。 - 从每棵
<entity>...</entity>
树上建造一个dict {tag, value, ...}
。 - 使用
generator objects
到yield
的dict
。 - 序列元素,例如
<addresses>/<address>
Tuple列表[(address, {tag, text})...
。
待办事项 :
- 要拼合成许多记录,请循环
record['addresses']- 等于不同的标签名称:
address和address1- 扁平化,序列标签,例如
<titels>,<probs>和<dobs>
from lxml import etreeclass Entity: def __init__(self, fh): """ Initialize 'iterparse' to only generate 'end' events on tag '<entity>' :param fh: File Handle from the XML File to parse """ self.context = etree.iterparse(fh, events=("end",), tag=['entity']) def _parse(self): """ Parse the XML File for all '<entity>...</entity>' Elements Clear/Delete the Element Tree after processing :return: Yield the current '<entity>...</entity>' Element Tree """ for event, elem in self.context: yield elem elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] def sequence(self, elements): """ Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text). If found a nested Sequence Element, e.g. <address>, to a Tuple ('address', {tag, text}) :param elements: The Sequence Element :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value)) """ _elements = [] for elem in elements: if len(elem): _elements.append((elem.tag, dict(self.sequence(elem)))) else: _elements.append((elem.tag, elem.text)) return _elements def __iter__(self): """ Iterate all '<entity>...</entity>' Element Trees yielded from self._parse() :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}} """ for xml_entity in self._parse(): entity = {'id': xml_entity.attrib['id']} for elem in xml_entity: # if elem is Sequence if len(elem): # Append tuple(tag, value) entity[elem.tag] = self.sequence(elem) else: entity[elem.tag] = elem.text yield entityif __name__ == "__main__": with open('.\FILE.XML', 'rb') as in_xml_ for record in Entity(in_xml): print("record:{}".format(record)) for key, value in record.items(): if isinstance(value, (list)): #print_list(key, value) print("{}:{}".format(key, value)) else: print("{}:{}".format(key, value))
输出 : 仅 显示第一个记录,仅显示 4个 字段。
注意 :存在唯一标签名称的 陷阱 :address和address1record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted forbrevity)
id:1124353
name:DAVID, Beckham
titles:[(‘title’, ‘Football player’)]
addresses:
address:{‘city’: ‘London’, ‘address’: None, ‘post… (omitted for
brevity)
address:{‘city’: ‘London’, ‘address1’: ‘35-37 Par… (omitted for
brevity)
使用Python测试:3.5-lxml.etree:3.7.1
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)