我对PDF编码了解不多,但是我认为您可以通过修改来解决您的特定问题
pdf.py。在该
PageObject.extractText方法中,您会看到发生了什么:
def extractText(self): [...] for operands,operator in content.operations: if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += _text elif operator == "T*": text += "n" elif operator == "'": text += "n" _text = operands[0] if isinstance(_text, TextStringObject): text += operands[0] elif operator == '"': _text = operands[2] if isinstance(_text, TextStringObject): text += "n" text += _text elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += i
如果运算符为
Tj或
TJ(在示例PDF中为Tj),则仅附加文本,不添加任何换行符。现在,您不一定 要
添加换行符,至少如果我正在阅读PDF参考权利:
Tj/TJ仅仅是单个和多个show-string运算符,并且某种分隔符的存在不是强制性的。
无论如何,如果您将此代码修改为类似
def extractText(self, Tj_sep="", TJ_sep=""):
[…]
if operator == "Tj": _text = operands[0] if isinstance(_text, TextStringObject): text += Tj_sep text += _text
[…]
elif operator == "TJ": for i in operands[0]: if isinstance(i, TextStringObject): text += TJ_sep text += i
那么默认行为应该是相同的:
In [1]: pdf.getPage(1).extractText()[1120:1250]Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
但是您可以在需要时更改它:
In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
要么
In [3]: pdf.getPage(1).extractText(Tj_sep="n")[1120:1250]Out[3]: u'ta" means any information concerning an individual which, because of name, identifyingnnumber, mark or description can be readily '
另外,您可以自己修改就位 *** 作符,从而自己添加分隔符,但这可能会破坏其他功能(例如
get_original_bytes让我感到紧张的方法)。
最后,
pdf.py如果您不想进行编辑,则无需自己编辑:您可以简单地将此方法提取到函数中。
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)