pyPdf忽略PDF文件中的换行符_随笔

pyPdf忽略PDF文件中的换行符

我对PDF编码了解不多，但是我认为您可以通过修改来解决您的特定问题

pdf.py

。在该

PageObject.extractText

方法中，您会看到发生了什么：

def extractText(self):    [...]    for operands,operator in content.operations:        if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject):     text += _text        elif operator == "T*": text += "n"        elif operator == "'": text += "n" _text = operands[0] if isinstance(_text, TextStringObject):     text += operands[0]        elif operator == '"': _text = operands[2] if isinstance(_text, TextStringObject):     text += "n"     text += _text        elif operator == "TJ": for i in operands[0]:     if isinstance(i, TextStringObject):         text += i

如果运算符为

Tj

或

TJ

（在示例PDF中为Tj），则仅附加文本，不添加任何换行符。现在，您不一定要
添加换行符，至少如果我正在阅读PDF参考权利：

Tj/TJ

仅仅是单个和多个show-string运算符，并且某种分隔符的存在不是强制性的。

无论如何，如果您将此代码修改为类似

def extractText(self, Tj_sep="", TJ_sep=""):

[…]

        if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject):     text += Tj_sep     text += _text

[…]

        elif operator == "TJ": for i in operands[0]:     if isinstance(i, TextStringObject):         text += TJ_sep         text += i

那么默认行为应该是相同的：

In [1]: pdf.getPage(1).extractText()[1120:1250]Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'

但是您可以在需要时更改它：

In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '

要么

In [3]: pdf.getPage(1).extractText(Tj_sep="n")[1120:1250]Out[3]: u'ta" means any information concerning an individual which, because of name, identifyingnnumber, mark or description can be readily '

另外，您可以自己修改就位 *** 作符，从而自己添加分隔符，但这可能会破坏其他功能（例如

get_original_bytes

让我感到紧张的方法）。

最后，

pdf.py

如果您不想进行编辑，则无需自己编辑：您可以简单地将此方法提取到函数中。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5631040.html

pyPdf忽略PDF文件中的换行符

发表评论

评论列表（0条）