PDF文本的提取（Linux）_随笔

PDF文本的提取（Linux）

前言

一、使用pdfgrep来提取文本

二、使用pdfplumber来提取文本

三、改善问题

总结

前言

不用打开pdf文件，在Linux终端下可用pdfgrep或python3编程来提取文本。

存在问题：双栏的文本仍按单行提取，尚未自动生成单栏排版。

一、使用pdfgrep来提取文本

功能：命令行工具，用来查找pdf中文本。详见官网。

pdfgrep -n 'Pdf*' file.pdf|grep -v 'pdf file'

解释：查找file.pdf中含'Pdf*'的行，但不包括‘pdf file'的行，同时给出所在页码。

二、使用pdfplumber来提取文本

安装：pip3 install pdfplumber

WARNING: The script pdfplumber is installed in '~/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

出现上述信息后，需要执行：

In "~/.bashrc": export PYTHonPATH=::...:$PYTHONPATH
source ~/.bashrc

这样pdfplumber才可在python中调用。

下面是提取文本的程序，python3版下运行：

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import pdfplumber
with pdfplumber.open("./test.pdf") as pdf:
    for page in pdf.pages:   # for all pages.
        print(page.extract_text(x_tolerance=1))    # get all txt on current page; "None" for no content.
        print(page.extract_tables())    # get tables on current page; If no table, output "[]"; otherwise, "[[[row1],[row2]...],[[row1],[row2]...]...]"
        # get every table
        for table in page.extract_tables():
            print(table) # [[row1],[row2]...]
            # for every row data
            for row in table:
                print(row) # ['xxx','xxx'...]
        # one split line per page.
        print('---------- Split Page line ----------')

#    first_page = pdf.pages[0]
#    print(first_page.chars[0])

将来改进：分割线可用“第*页”来标明。

存在问题：有时出现单词之间没有空格，全部连接在一起，需调节x_tolerance参数值。

三、改善问题

这个留待将来完善。

总结

pdfgrep 文本提取展现较 pdfplumber 要整齐。若直接简单看下结果，推荐使用前者。

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/5711213.html

PDF文本的提取（Linux）

发表评论

评论列表（0条）