Java OpenCV+Tesseract实现提取图标中的表格并按行列返回JSON_随笔

Java OpenCV+Tesseract实现提取图标中的表格并按行列返回JSON

在我的其他几篇文章中介绍了Tesseract识别中文+数字+字母以及PDF去水印的一些技巧。当整个PDF都是由图片构成（如扫描件）时，如何提取PDF中的表格并按行列返回JSON数据呢？

一种方法就是将PDF中的图片转存为图片，然后通过对图片的识别来达到目的。Github上有一些诸如：CascadeTabNet、CDecNet的Deep Learning项目，百度和腾讯我也看了，有类似的Deep Learning项目。我试用了CascadeTabNet（目前Github上92颗星）以及百度的图片表格识别Deep Learning项目，其中CascadeTabNet11个G，百度的19个G。试验的结果感觉还可以，对小篇幅的图片识别准确率还可以，但是对大尺寸的图片（如A4纸）识别正确率很低。并且无法以JSON数据返回。

在这里我介绍另外一种通过OpenCV+Tesseract技术实现对图片中表格提取的方法，该方法可以提取更加复杂的表格（如嵌套表）。

思路：本文的思路是通过OpenCV对图片进行检测，检测完毕后返回关键数据，然后通过设计工具在图片上进行划定区域切割，并生成单页元数据，通过元数据对图片进行表格数据。

1. OpenCV表格检测（可完成60%的表格识别）

2. 设计工具，Vue开发的一个小工具，可对OpenCV返回的格子数据进行再次加工。（用于实现100%的表格检测）

3. 通过国OpenCV进行表格识别。

4.通过Tesseract进行OCR识别

5. 转换为JSON返回

下面上源代码（本文并不会对源代码做过多介绍，请仔细深读），第一部分：OpenCV表格检测：

	public Table parseImageTableStructure(String path) {
		// 图像倾斜度调整
		pictureTiltCorrection(path);

		Mat src = Imgcodecs.imread(path);

		// 1. 将图片灰度化
		Mat gray = OpenCVUtils.gray(src);

		// 2. 将图片二值化
		Mat adaptiveThreshold = OpenCVUtils.adaptiveThreshold(gray);

		// 3. 膨胀+腐蚀:补全表格线内的空洞
		Mat element = Imgproc.getStructuringElement(Imgproc.MORPH_RECT, new Size(3, 3));
		Imgproc.dilate(adaptiveThreshold, adaptiveThreshold, element);
		Imgproc.erode(adaptiveThreshold, adaptiveThreshold, element);

		// 4. 获得横线
		Mat horizontalLine = getHorizontal(adaptiveThreshold.clone());

		// 5. 获得竖线
		Mat verticalLine = getVertical(adaptiveThreshold.clone());

		// 6. 横竖线合并
		Mat tableLine = OpenCVUtils.getOr(horizontalLine, verticalLine);

		// 7. 通过 bitwise_and 定位横线、垂直线交汇的点
		Mat points_image = new Mat();
		Core.bitwise_and(horizontalLine, verticalLine, points_image);

		// 8. 查找轮廓
		List contours = new ArrayList();
		Mat rootHierarchy = new Mat();
		Imgproc.findContours(tableLine, contours, rootHierarchy, Imgproc.RETR_LIST, Imgproc.CHAIN_APPROX_TC89_KCOS, new Point(0, 0));

		// 9. 分析轮廓
		List contours_poly = contours;
		Rect[] boundRect = new Rect[contours.size()];
		linkedList tables = new linkedList();
		// 循环所有找到的轮廓-点
		for (int i = 0; i < contours.size(); i++) {

			MatOfPoint point = contours.get(i);
			MatOfPoint contours_poly_point = contours_poly.get(i);

			double area = Imgproc.contourArea(contours.get(i));

			// 如果小于某个值就忽略，代表是杂线不是表格
			if (area < 100) {
				continue;
			}

			Imgproc.approxPolyDP(new MatOfPoint2f(point.toArray()), new MatOfPoint2f(contours_poly_point.toArray()), 3, true);

			// 为将这片区域转化为矩形，此矩形包含输入的形状
			boundRect[i] = Imgproc.boundingRect(contours_poly.get(i));

			// 找到交汇处的的表区域对象
			Mat table_image = points_image.submat(boundRect[i]);

			List table_contours = new ArrayList();
			Mat joint_mat = new Mat();
			Imgproc.findContours(table_image, table_contours, joint_mat, Imgproc.RETR_CCOMP, Imgproc.CHAIN_APPROX_TC89_L1);

			// 从表格的特性看，如果这片区域的点数小于4，那就代表没有一个完整的表格，忽略掉
			if (table_contours.size() < 4)
				continue;

			// 提取矩形数据
			MatWithProperty mp = new MatWithProperty(null, boundRect[i]);
			tables.addFirst(mp);
		}

		ImageTable table = new ImageTable();
		table.setImageHeight(src.rows());
		table.setImageWidth(src.cols());

		// 10. 生成桶
		List horBuckets = new ArrayList<>();

		table.setRows(horBuckets);

		// 生成横桶
		createRowBuckets(tables, horBuckets);

		// 遍历横桶，
		for (Row row : horBuckets) {
			RowBucket bucket = (RowBucket) row;
			List rowMats = bucket.elements;

			// 生成列桶
			List verBuckets = new ArrayList<>();
			createColBuckets(rowMats, verBuckets);
		}

		// 返回结构
		return table;
	}

返回的table就是表结构数据，你可以给它理解为表格蒙板数据。这个识别对于简单、清晰的表格可以100%识别，但是对于大表、嵌套表识别率60%左右，所以为了达到100%识别，我们需要对表格结构数据进行再次设计，这次设计就需要通过UI界面来进行了。下面是设计页面，使用Vue开发：




    .ocr-design-wrapper {
        /deep/.upload-demo {
            width: 100%;

            .el-upload {
                width: 100%;

                .el-upload-dragger {
                    width: 100%;
                }
            }
        }

        .design-port-container {
            width: 100%;
            position: relative;
        }

        .bounding-container {
            width: 100%;
            height: 100%;
            position: absolute;
            left: 0;
            top: 0;

            .sketch-container {
                /deep/.bound-box {
                    background-color: rgba(100, 255, 187, 0.4);

                    .bound-box-close {
                        display: none;
                        cursor: pointer;
                        position: absolute;
                        right: -7px;
                        top: -7px;
                        background-color: #00c9ff;
                        color: #ffffff;
                    }

                    &.active {
                        .bound-box-close {
                            display: block;
                        }
                    }

                    .handle-tl {
                        top: -5px;
                        left: -5px;
                    }

                    .handle-tm {
                        top: -5px;
                    }

                    .handle-tr {
                        right: -5px;
                        top: -5px;
                    }

                    .handle-mr {
                        right: -5px;
                    }

                    .handle-ml {
                        left: -5px;
                    }

                    .handle-bl {
                        left: -5px;
                        bottom: -5px;
                    }

                    .handle-bm {
                        bottom: -5px;
                    }

                    .handle-br {
                        bottom: -5px;
                        right: -5px;
                    }
                }
            }
        }

        .top-header {
            height: 50px;
            line-height: 50px;
            padding-left: 20px;
            background-color: #dedede;
            width: 100%;
            left: 0px;
            z-index: 99999999;
            border-radius: 6px;
        }

        .design-port {
            margin-top: 10px;

            .left-view {
                width: 100%;
                height: 100%;
            }
        }
    }


    .btn-wrap {
        position: fixed;
        top: 50%;
        right: 10px;
        z-index: 14;
        width: 80px;

        /deep/ .el-button {
            margin-bottom: 10px;
            opacity: 0.6;

            &:hover {
                opacity: 1;
            }
        }

        /deep/ .el-button+.el-button {
            margin-left: 0;
        }
    }

    .innerDom {
        display: none !important;
    }

    .box {
        padding: 20px;
    }

    .comp-wrap {
        width: 313px;
        float: left;
        height: 736px;
    }

    .page-wrap {
        width: 100%;
        float: left;
        padding: 0 !important;
    }

    .edit-wrap {
        position: relative;
        float: left;
        width: 348px;
        height: 736px;
    }

    .drag-sty {
        border: 1px solid #e6e6e6;
        width: 100px;
        padding: 6px;
        font-size: 12px;
        height: 30px;
        display: inline-block;
        line-height: 18px;
    }

    .iconfont-back {
        background: #ccc;
        border-radius: 2px;
        padding: 0 2px;
        float: left;
        height: 18px;
        margin-right: 6px;
    }

    .drag-sty:hover .iconfont {
        color: #2875e8;
    }

    .iconfont {
        color: #a8a7a7;
        font-size: 18px;
    }

    .bg-purple {
        background: #d3dce6;
    }

    .bg-purple-light {
        background: #fafafa;
    }

    .left-shadow {}

    .grid-content {
        border-radius: 4px;
        overflow: auto;
        padding: 20px;
    }

    .tab-content {
        border: 1px solid #eee;
        border-radius: 4px;
        min-height: 736px;
        height: 100%;
        overflow: auto;
    }

    .item {
        height: 60px;
        border: 0px solid #333;
        display: inline-block;
        
        padding: 10px;
        margin-bottom: 5px;
        cursor: pointer;
    }

    .el-upload {
        width: 100%;
    }

    .el-upload-dragger {
        width: 100%;
    }

    .item2 {
        height: 80px;
        border: 0px solid #333;
        padding: 10px;
        margin-bottom: 5px;
        cursor: pointer;
    }

    #removeBox {
        height: 100px;
        width: 100px;
        border: 2px dashed #999;
        background: rgba(0, 0, 0, 0.3);
        position: absolute;
        bottom: 10px;
        right: 20px;
        background: url(/static/image/deleteBox.png) no-repeat;
        background-size: 90%;
        background-position: center center;
    }

    .flxed {
        position: relative;
        top: 0;
        left: 0;
    }

    .edit-content .el-form-item {
        margin-bottom: 0;
    }

    .vali-el-input {
        margin: 0 10px;
    }

    .el-checkbox {
        margin: 4px 0;
    }

    .el-divider--horizontal {
        margin: 4px 0;
    }

    h4,
    h5 {
        margin: 10px 0;
    }

    .submit-btn {
        float: right;
    }

    /deep/ .page-item-group {
        cursor: pointer;
        position: relative;

        .control-btn {
            right: 0;
        }
    }

    * {
        box-sizing: border-box;
    }

    /deep/ .vali-el-input .el-input__inner {
        height: 26px !important;
        padding-right: 0;
        padding-left: 4px;
    }

    /deep/ .sel-options .el-form-item__label {
        width: 100%;
        text-align: left;
    }

    /deep/ .el-icon-delete {
        cursor: pointer;
    }

    /deep/ .edit-content .el-input {
        width: 100%;
    }

    /deep/ .edit-content .el-date-editor.el-input,
    /deep/ .edit-content .el-date-editor.el-input__inner {
        width: 220px;
    }

    /deep/ .edit-content .el-input__inner {
        height: 30px;
        box-sizing: border-box;
    }

    /deep/ .edit-content .el-date-editor--date .el-icon-date,
    /deep/ .edit-content .time-select .el-icon-circle-close {
        line-height: 30px;
    }

    /deep/ .edit-content .el-form-item__label {
        height: 30px;
    }

    /deep/ .long_input {
        margin-left: 80px !important;
        position: relative;
    }

    /deep/ .long_input_label {
        width: 80px;
    }

    /deep/ .page-item:hover {
        background: #e0f2ff;
    }

    /deep/ .page-item-select {
        border: 1px dashed #4db8ff;
        background: #e0f2ff;
    }

    /deep/ .page-item-select .control-btn {
        display: block;
    }

    /deep/ .control-btn {
        position: absolute;
        top: 50%;
        right: -20px;
        transform: translate(0, -50%);
        display: none;
    }

    /deep/ .control-btn .control-delete {
        position: absolute;
        right: 0;
        bottom: -26px;
    }

    /deep/ .control-btn .control-arrow-wrap {
        height: 20px;
        cursor: pointer;
        line-height: 20px;
        background: #fff;
        display: block;
    }

    /deep/ .control-btn .control-arrow-down {
        bottom: -2px;
    }

    /deep/ .control-btn .control-arrow-up {
        top: 28px;
        margin-bottom: 6px;
    }

    /deep/ .tab-content .page-item {
        margin-bottom: 0;
        padding: 16px 20px;
        min-height: 90px;
    }

    .sel-sty {
        width: 200px;
        margin-top: 14px;
        display: block;
        // -webkit-appearance: none;
        background-color: #fff;
        background-image: none;
        border-radius: 4px;
        border: 1px solid #dcdfe6;
        box-sizing: border-box;
        color: #606266;
        font-size: inherit;
        height: 30px;
        line-height: 40px;
        outline: none;
        padding: 0 15px;
        transition: border-color 0.2s cubic-bezier(0.645, 0.045, 0.355, 1);
    }

注意，这里使用到了vue-draggable-resizable这个组件。具体如何引入到Vue中我就不做介绍了，运行效果如下图所示：

上传一个表格图片后，如下图所示：

红色文字不适合展示，我抹掉了。

在设计完毕后，可以查看到设计结果数据，该数据可用于OpenCV的完全识别，代码如下：

List recognizeFromSettings(File imagePath) {
			Mat src = Imgcodecs.imread(imagePath.getAbsolutePath());
			String extention = Utils.getUriExtention(imagePath.getPath());

			// 1. 生成横桶
			List cellProperty = JSON.parseArray(this.settings.getString("cells"), MatWithProperty.class);
			List horBuckets = new ArrayList<>();
			createRowBuckets(cellProperty, horBuckets);

			// 遍历横桶，
			for (Row row : horBuckets) {
				RowBucket bucket = (RowBucket) row;
				List rowMats = bucket.elements;

				// 生成列桶
				List verBuckets = new ArrayList<>();
				createColBuckets(rowMats, verBuckets);

				for (ColBucket verticalBucket : verBuckets) {
					List colMats = verticalBucket.elements;

					// 遍历列
					for (MatWithProperty mat : colMats) {
						
						Mat subMat = src.submat(new Rect(mat.rect.x, mat.rect.y, mat.rect.width, mat.rect.height)).clone();
						try {
							
							// 识别
							BufferedImage image = OpenCVUtils.convertMat2BufferedImage(subMat, extention);
							String content = tesseract.doOCR(image);
							row.addCell(content);
						} catch (Exception e) {
							logger.error("[OCR Failed]", e);
							e.printStackTrace();
						} finally {
							try {
								if (subMat != null) {
									subMat.release();
									subMat = null;
								}
							} catch (Exception e) {
								e.printStackTrace();
							}
						}
					}
				}
			}
			return horBuckets;
		}

欢迎分享，转载请注明来源：内存溢出

原文地址: http://outofmemory.cn/zaji/4012611.html

Java OpenCV+Tesseract实现提取图标中的表格并按行列返回JSON

发表评论

评论列表（0条）