在我的其他几篇文章中介绍了Tesseract识别中文+数字+字母以及PDF去水印的一些技巧。当整个PDF都是由图片构成(如扫描件)时,如何提取PDF中的表格并按行列返回JSON数据呢?
一种方法就是将PDF中的图片转存为图片,然后通过对图片的识别来达到目的。Github上有一些诸如:CascadeTabNet、CDecNet的Deep Learning项目,百度和腾讯我也看了,有类似的Deep Learning项目。我试用了CascadeTabNet(目前Github上92颗星)以及百度的图片表格识别Deep Learning项目,其中CascadeTabNet11个G,百度的19个G。试验的结果感觉还可以,对小篇幅的图片识别准确率还可以,但是对大尺寸的图片(如A4纸)识别正确率很低。并且无法以JSON数据返回。
在这里我介绍另外一种通过OpenCV+Tesseract技术实现对图片中表格提取的方法,该方法可以提取更加复杂的表格(如嵌套表)。
思路:本文的思路是通过OpenCV对图片进行检测,检测完毕后返回关键数据,然后通过设计工具在图片上进行划定区域切割,并生成单页元数据,通过元数据对图片进行表格数据。
1. OpenCV表格检测 (可完成60%的表格识别)
2. 设计工具,Vue开发的一个小工具,可对OpenCV返回的格子数据进行再次加工。(用于实现100%的表格检测)
3. 通过国OpenCV进行表格识别。
4.通过Tesseract进行OCR识别
5. 转换为JSON返回
下面上源代码(本文并不会对源代码做过多介绍,请仔细深读),第一部分:OpenCV表格检测:
public Table parseImageTableStructure(String path) { // 图像倾斜度调整 pictureTiltCorrection(path); Mat src = Imgcodecs.imread(path); // 1. 将图片灰度化 Mat gray = OpenCVUtils.gray(src); // 2. 将图片二值化 Mat adaptiveThreshold = OpenCVUtils.adaptiveThreshold(gray); // 3. 膨胀+腐蚀:补全表格线内的空洞 Mat element = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3)); Imgproc.dilate(adaptiveThreshold, adaptiveThreshold, element); Imgproc.erode(adaptiveThreshold, adaptiveThreshold, element); // 4. 获得横线 Mat horizontalLine = getHorizontal(adaptiveThreshold.clone()); // 5. 获得竖线 Mat verticalLine = getVertical(adaptiveThreshold.clone()); // 6. 横竖线合并 Mat tableLine = OpenCVUtils.getOr(horizontalLine, verticalLine); // 7. 通过 bitwise_and 定位横线、垂直线交汇的点 Mat points_image = new Mat(); Core.bitwise_and(horizontalLine, verticalLine, points_image); // 8. 查找轮廓 Listcontours = new ArrayList (); Mat rootHierarchy = new Mat(); Imgproc.findContours(tableLine, contours, rootHierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_TC89_KCOS, new Point(0, 0)); // 9. 分析轮廓 List contours_poly = contours; Rect[] boundRect = new Rect[contours.size()]; linkedList tables = new linkedList (); // 循环所有找到的轮廓-点 for (int i = 0; i < contours.size(); i++) { MatOfPoint point = contours.get(i); MatOfPoint contours_poly_point = contours_poly.get(i); double area = Imgproc.contourArea(contours.get(i)); // 如果小于某个值就忽略,代表是杂线不是表格 if (area < 100) { continue; } Imgproc.approxPolyDP(new MatOfPoint2f(point.toArray()), new MatOfPoint2f(contours_poly_point.toArray()), 3, true); // 为将这片区域转化为矩形,此矩形包含输入的形状 boundRect[i] = Imgproc.boundingRect(contours_poly.get(i)); // 找到交汇处的的表区域对象 Mat table_image = points_image.submat(boundRect[i]); List table_contours = new ArrayList (); Mat joint_mat = new Mat(); Imgproc.findContours(table_image, table_contours, joint_mat, Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_TC89_L1); // 从表格的特性看,如果这片区域的点数小于4,那就代表没有一个完整的表格,忽略掉 if (table_contours.size() < 4) continue; // 提取矩形数据 MatWithProperty mp = new MatWithProperty(null, boundRect[i]); tables.addFirst(mp); } ImageTable table = new ImageTable(); table.setImageHeight(src.rows()); table.setImageWidth(src.cols()); // 10. 生成桶 List horBuckets = new ArrayList<>(); table.setRows(horBuckets); // 生成横桶 createRowBuckets(tables, horBuckets); // 遍历横桶, for (Row row : horBuckets) { RowBucket bucket = (RowBucket) row; List
rowMats = bucket.elements; // 生成列桶 List verBuckets = new ArrayList<>(); createColBuckets(rowMats, verBuckets); } // 返回结构 return table; }
返回的table就是表结构数据,你可以给它理解为表格蒙板数据。这个识别对于简单、清晰的表格可以100%识别,但是对于大表、嵌套表识别率60%左右,所以为了达到100%识别,我们需要对表格结构数据进行再次设计,这次设计就需要通过UI界面来进行了。下面是设计页面,使用Vue开发:
识别图片中的表格数据 0">保存设计 显示数据 将图片文件拖到此处,或点击上传 只能上传jpg/png文件,且不超过10MB {onDragStop(x,y,c)}" @resizestop="(x,y,w,h) => {onResizeStop(x,y,w,h,c)}" :parent="true" :key="c.id"> {{attributeData}} 点击保存 .ocr-design-wrapper { /deep/.upload-demo { width: 100%; .el-upload { width: 100%; .el-upload-dragger { width: 100%; } } } .design-port-container { width: 100%; position: relative; } .bounding-container { width: 100%; height: 100%; position: absolute; left: 0; top: 0; .sketch-container { /deep/.bound-box { background-color: rgba(100, 255, 187, 0.4); .bound-box-close { display: none; cursor: pointer; position: absolute; right: -7px; top: -7px; background-color: #00c9ff; color: #ffffff; } &.active { .bound-box-close { display: block; } } .handle-tl { top: -5px; left: -5px; } .handle-tm { top: -5px; } .handle-tr { right: -5px; top: -5px; } .handle-mr { right: -5px; } .handle-ml { left: -5px; } .handle-bl { left: -5px; bottom: -5px; } .handle-bm { bottom: -5px; } .handle-br { bottom: -5px; right: -5px; } } } } .top-header { height: 50px; line-height: 50px; padding-left: 20px; background-color: #dedede; width: 100%; left: 0px; z-index: 99999999; border-radius: 6px; } .design-port { margin-top: 10px; .left-view { width: 100%; height: 100%; } } } .btn-wrap { position: fixed; top: 50%; right: 10px; z-index: 14; width: 80px; /deep/ .el-button { margin-bottom: 10px; opacity: 0.6; &:hover { opacity: 1; } } /deep/ .el-button+.el-button { margin-left: 0; } } .innerDom { display: none !important; } .box { padding: 20px; } .comp-wrap { width: 313px; float: left; height: 736px; } .page-wrap { width: 100%; float: left; padding: 0 !important; } .edit-wrap { position: relative; float: left; width: 348px; height: 736px; } .drag-sty { border: 1px solid #e6e6e6; width: 100px; padding: 6px; font-size: 12px; height: 30px; display: inline-block; line-height: 18px; } .iconfont-back { background: #ccc; border-radius: 2px; padding: 0 2px; float: left; height: 18px; margin-right: 6px; } .drag-sty:hover .iconfont { color: #2875e8; } .iconfont { color: #a8a7a7; font-size: 18px; } .bg-purple { background: #d3dce6; } .bg-purple-light { background: #fafafa; } .left-shadow {} .grid-content { border-radius: 4px; overflow: auto; padding: 20px; } .tab-content { border: 1px solid #eee; border-radius: 4px; min-height: 736px; height: 100%; overflow: auto; } .item { height: 60px; border: 0px solid #333; display: inline-block; padding: 10px; margin-bottom: 5px; cursor: pointer; } .el-upload { width: 100%; } .el-upload-dragger { width: 100%; } .item2 { height: 80px; border: 0px solid #333; padding: 10px; margin-bottom: 5px; cursor: pointer; } #removeBox { height: 100px; width: 100px; border: 2px dashed #999; background: rgba(0, 0, 0, 0.3); position: absolute; bottom: 10px; right: 20px; background: url(/static/image/deleteBox.png) no-repeat; background-size: 90%; background-position: center center; } .flxed { position: relative; top: 0; left: 0; } .edit-content .el-form-item { margin-bottom: 0; } .vali-el-input { margin: 0 10px; } .el-checkbox { margin: 4px 0; } .el-divider--horizontal { margin: 4px 0; } h4, h5 { margin: 10px 0; } .submit-btn { float: right; } /deep/ .page-item-group { cursor: pointer; position: relative; .control-btn { right: 0; } } * { box-sizing: border-box; } /deep/ .vali-el-input .el-input__inner { height: 26px !important; padding-right: 0; padding-left: 4px; } /deep/ .sel-options .el-form-item__label { width: 100%; text-align: left; } /deep/ .el-icon-delete { cursor: pointer; } /deep/ .edit-content .el-input { width: 100%; } /deep/ .edit-content .el-date-editor.el-input, /deep/ .edit-content .el-date-editor.el-input__inner { width: 220px; } /deep/ .edit-content .el-input__inner { height: 30px; box-sizing: border-box; } /deep/ .edit-content .el-date-editor--date .el-icon-date, /deep/ .edit-content .time-select .el-icon-circle-close { line-height: 30px; } /deep/ .edit-content .el-form-item__label { height: 30px; } /deep/ .long_input { margin-left: 80px !important; position: relative; } /deep/ .long_input_label { width: 80px; } /deep/ .page-item:hover { background: #e0f2ff; } /deep/ .page-item-select { border: 1px dashed #4db8ff; background: #e0f2ff; } /deep/ .page-item-select .control-btn { display: block; } /deep/ .control-btn { position: absolute; top: 50%; right: -20px; transform: translate(0, -50%); display: none; } /deep/ .control-btn .control-delete { position: absolute; right: 0; bottom: -26px; } /deep/ .control-btn .control-arrow-wrap { height: 20px; cursor: pointer; line-height: 20px; background: #fff; display: block; } /deep/ .control-btn .control-arrow-down { bottom: -2px; } /deep/ .control-btn .control-arrow-up { top: 28px; margin-bottom: 6px; } /deep/ .tab-content .page-item { margin-bottom: 0; padding: 16px 20px; min-height: 90px; } .sel-sty { width: 200px; margin-top: 14px; display: block; // -webkit-appearance: none; background-color: #fff; background-image: none; border-radius: 4px; border: 1px solid #dcdfe6; box-sizing: border-box; color: #606266; font-size: inherit; height: 30px; line-height: 40px; outline: none; padding: 0 15px; transition: border-color 0.2s cubic-bezier(0.645, 0.045, 0.355, 1); }
注意,这里使用到了vue-draggable-resizable这个组件。具体如何引入到Vue中我就不做介绍了,运行效果如下图所示:
上传一个表格图片后,如下图所示:
红色文字不适合展示,我抹掉了。
在设计完毕后,可以查看到设计结果数据,该数据可用于OpenCV的完全识别,代码如下:
ListrecognizeFromSettings(File imagePath) { Mat src = Imgcodecs.imread(imagePath.getAbsolutePath()); String extention = Utils.getUriExtention(imagePath.getPath()); // 1. 生成横桶 List
cellProperty = JSON.parseArray(this.settings.getString("cells"), MatWithProperty.class); List horBuckets = new ArrayList<>(); createRowBuckets(cellProperty, horBuckets); // 遍历横桶, for (Row row : horBuckets) { RowBucket bucket = (RowBucket) row; List
rowMats = bucket.elements; // 生成列桶 List verBuckets = new ArrayList<>(); createColBuckets(rowMats, verBuckets); for (ColBucket verticalBucket : verBuckets) { List colMats = verticalBucket.elements; // 遍历列 for (MatWithProperty mat : colMats) { Mat subMat = src.submat(new Rect(mat.rect.x, mat.rect.y, mat.rect.width, mat.rect.height)).clone(); try { // 识别 BufferedImage image = OpenCVUtils.convertMat2BufferedImage(subMat, extention); String content = tesseract.doOCR(image); row.addCell(content); } catch (Exception e) { logger.error("[OCR Failed]", e); e.printStackTrace(); } finally { try { if (subMat != null) { subMat.release(); subMat = null; } } catch (Exception e) { e.printStackTrace(); } } } } } return horBuckets; }
欢迎分享,转载请注明来源:内存溢出
评论列表(0条)