如何在 Linux 上使用 Python 读取 word 文件信息_系统运维

第一步：获取doc文件的xml组成文件

import zipfiledef get_word_xml(docx_filename):

with open(docx_filename) as f:

zip = zipfile.ZipFile(f)

xml_content = zip.read('word/document.xml')

return xml_content

第二步：解析xml为树形数据结构

from lxml import etreedef get_xml_tree(xml_string):

return etree.fromstring(xml_string)

第三步：读取word内容：

def _itertext(self, my_etree):

"""Iterator to go through xml tree's text nodes"""

for node in my_etree.iter(tag=etree.Element):

if self._check_element_is(node, 't'):

yield (node, node.text)def _check_element_is(self, element, type_char):

word_schema = '99999'

return element.tag == '{%s}%s' % (word_schema,type_char)

如果你的脚本就在linux服务器上，直接使用file读取文件就可以 *** 作了

m = file("你的文件路径")

如果是远程访问，可能就需要架设http服务器，然后通过url访问，这个你可以看下urllib,urllib2这两个python库。

如果解决了您的问题请采纳！

如果未解决请继续追问

欢迎分享，转载请注明来源：内存溢出

如何在 Linux 上使用 Python 读取 word 文件信息