tree文字识别


一、tesseract.js

1、简介

Tesseract.js 是一个 JavaScript 库,可以从图像中获取文字。

OCR 光学字符识别 optical character recognition

官网地址

Github地址

2、引入

(1)node引入

npm install tesseract.js --save

js

import Tesseract from 'tesseract.js';

(2)直接引入

<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>

3、Worker使用

(1)创建worker

// 英文识别
const worker = await Tesseract.createWorker('eng');

多语言识别通过+ 连接,如果是中英文识别写成 eng+chi_sim,其他语言包见下面的链接

https://tesseract-ocr.github.io/tessdoc/Data-Files

如果语言包下载太慢,可通过 langPath 修改下载路径

// 从 http://127.0.0.1/chi_sim.traineddata.gz 下载语言包
const worker = await Tesseract.createWorker('eng+chi_sim', 1, {
    langPath: 'http://127.0.0.1'    //结尾不能是/
});

// 或从本地获取
const worker = createWorker({
    workerPath: '../node_modules/tesseract.js/dist/worker.min.js',
    langPath: '../lang-data',
    corePath: '../node_modules/tesseract.js-core/tesseract-core.wasm.js',
    logger: m => console.log(m),
});

其他可用构造参数

Arguments:

  • langs: 一个字符串,用于指示要下载的训练数据的语言,多个语言用 ‘+’ 链接
  • oem :一个枚举,用于指示您使用的OCR引擎模式
  • options an object of customized options
    • corePath 下载 tesseract-core.wasm.jstesseract-core-simd.wasm.js 的路径,两个文件来源Tesseract.js-core package
      • Setting corePath to a specific .js file is strongly discouraged. To provide the best performance on all devices, Tesseract.js needs to be able to pick between tesseract-core.wasm.js and tesseract-core-simd.wasm.js. See this issue for more detail.
    • langPath 下载语言包路径 (traineddata文件), do not include / at the end of the path
    • workerPath path for downloading worker script
    • dataPath path for saving traineddata in WebAssembly file system, not common to modify
    • dataPath path for saving traineddata in WebAssembly file system, not common to modify
    • cachePath path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDB
    • cacheMethod a string to indicate the method of cache management, should be one of the following options
      • write: read cache and write back (default method)
      • readOnly: read cache and not to write back
      • refresh: not to read cache and write back
      • none: not to read cache and not to write back
    • legacyCore set to true to ensure any code downloaded supports the Legacy model (in addition to LSTM model)
    • legacyLang set to true to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)
    • workerBlobURL a boolean to define whether to use Blob URL for worker script, default: true
    • gzip a boolean to define whether the traineddata from the remote is gzipped, default: true
    • logger a function to log the progress, a quick example is m => console.log(m)
    • errorHandler a function to handle worker errors, a quick example is err => console.error(err)

(2)识别文字

let image = '../static/idcard.jpg'

const worker = await Tesseract.createWorker('eng');

const { data: { text } } = await worker.recognize(image);

console.log('识别结果: ', text);

如果需要只针对部分区域进行识别

let image = 'http://127.0.0.1/static/idcard.jpg'

const worker = await Tesseract.createWorker('eng');

const { data: { text } } = await worker.recognize(image, {
    rectangle: { top: 0, left: 0, width: 100, height: 100 },
});

console.log('识别结果: ', text);

image参数。下面列出了支持的图像格式和数据类型。

支持图像格式:bmp, jpg, png, pbm, webp

对于浏览器和Node,支持的数据类型是:

  • 字符串与base64编码的图像,适合data:image\/([a-zA-Z]*);base64,([^"]*)

  • buffer

仅对于浏览器,支持的数据类型是:

  • File 或 Blob 对象

  • img 或 canvas 元素

仅对于Node,支持的数据类型是:

  • 包含本地镜像路径的字符串

注意:images必须是支持的图像格式支持的数据类型。例如,支持包含png图像的缓冲区。不支持包含原始像素数据的缓冲区。

(3)终止任务

const worker = await Tesseract.createWorker('eng');

// 终止任务
worker.terminate();

(4)向内存文件写

MEMFS是一种内存文件系统,数据完全存储于内存中,程序运行时写入的数据在页面刷新或程序重载后将丢失。这种文件系统通常用于临时存储数据,或者在嵌入式系统中使用。

Worker.writeText() 方法可以向 MEMFS 中写入数据

参数:

  • path text file path
  • text content of the text file
  • jobId Please see details above
await worker.writeText('tmp.txt', 'hello world!!!').then(()=>{

}).catch((e)=>{
    console.log(e)
});

(5)读取内存文件

Arguments:

  • path text file path
  • jobId Please see details above
const { data } = await worker.readText('tmp.txt');
console.log(data);

(6)删除内存文件

Arguments:

  • path file path
  • jobId Please see details above
await worker.removeFile('tmp.txt');

4、Scheduler

(1)创建调度器

createScheduler() 是一个用于创建调度器的工厂函数,调度器管理作业 queue 和 woker,使多个woker能够一起工作,当您想要加快性能时,它很有用。

const { createScheduler } = Tesseract;
const scheduler = createScheduler();

(2)addWorker

scheduler. addworker() 将一个worker添加到scheduler内部的worker池中,建议只向一个scheduler添加一个worker。

const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = await createWorker();
scheduler.addWorker(worker);

(3)addJob

addjob() 将一个作业添加到作业队列中,调度器等待并找到一个空闲的工作者来接受该作业。

Arguments:

  • action 一个字符串来指示你想要做的动作,现在只支持recognitiondetect
  • payload 任意数量的参数,取决于所调用的操作。
const { data: { text } } = await scheduler.addJob('recognize', image, options);
const { data } = await scheduler.addJob('detect', image);

(4)getQueueLen

返回作业队列长度

var queueLen = scheduler.getQueueLen();

(5)getNumWorkers

返回 worder 数

(6)terminate

终止

5、setLogging

用于输出详细日志

Arguments:

  • logging boolean to define whether to see detailed logs, default: false

Examples:

const { setLogging } = Tesseract;
setLogging(true);

6、案例:框选图片识别文字

image-20231207174046129

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>矩形绘制工具</title>
    <style type="text/css">
        .box {
            background: #f00;
            width: 0px;
            height: 0px;
            position: absolute;
            opacity: 0.5;
            cursor: move;
        }

        .droptarget {
            float: left;
            width: 100px;
            height: 1000px;
            margin: 15px;
            padding: 10px;
            border: 1px solid #aaaaaa;
        }
    </style>
</head>

<body>

<!-- 框选代码来源https://blog.csdn.net/kuronekonano/article/details/98498613 -->


<script>

  //框内移动显示坐标
  function cnvs_getCoordinates(e) {
    x = e.pageX - e.target.offsetLeft;//不能用clientX,pageX为文档坐标,clientX表示浏览器界面坐标,会随滚动条改变
    y = e.pageY - e.target.offsetTop;
    document.getElementById("xycoordinates").innerHTML = "Coordinates: (" + x + "," + y + ")";
  }

  //出框后不显示坐标
  function cnvs_clearCoordinates() {
    document.getElementById("xycoordinates").innerHTML = "Coordinates: (0,0)";
  }

</script>
<div id="coordiv" style="" onmousemove="cnvs_getCoordinates(event)" onmouseout="cnvs_clearCoordinates()">
    <img src="../static/idcard.jpg" ondragstart="return false;">
</div>
<br>
<input type="text" name="coor" placeholder="坐标" readonly="true" id="x_y" style="width: 300px"><br>
<button id="ocr_btn">识别文字</button>
<button id="reset">重置</button>

<div id="xycoordinates"></div>

<!--https://github.com/naptha/tesseract.js#tesseractjs-->
<!--<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>-->
<script src='../js/tesseract.min.js'></script>
<script>

  function dragStart(event) {
    event.dataTransfer.setData("Text", event.target.id);
  }

  function allowDrop(event) {
    event.preventDefault();
  }

  function drop(event) {
    event.preventDefault();
    var data = event.dataTransfer.getData("Text");
    event.target.appendChild(document.getElementById(data));
  }


  window.onload = function (e) {
    e = e || window.event;
    // startX, startY 为鼠标点击时初始坐标
    // diffX, diffY 为鼠标初始坐标与 box 左上角坐标之差,用于拖动
    var startX, startY, diffX, diffY;
    // 是否拖动,初始为 false
    var dragging = false;

    // 起始点位
    var begin_x = 0;
    var begin_y = 0;
    var end_x = 0;
    var end_y = 0;

    var coordiv = document.getElementById('coordiv');

    // 鼠标按下
    document.onmousedown = function (e) {

      startX = e.pageX;
      startY = e.pageY;

      // 如果鼠标在 box 上被按下,坐标判定防止在box之外
      if (startY <= coordiv.offsetTop + coordiv.offsetHeight && startY >= coordiv.offsetTop && startX >= coordiv.offsetLeft && startX <= coordiv.offsetLeft + coordiv.offsetWidth) {
        // 不允许框选多个
        reset();

        if (e.target.className.match(/box/)) {
          // 允许拖动
          dragging = true;

          // 设置当前 box 的 id 为 moving_box
          if (document.getElementById("moving_box") !== null) {
            document.getElementById("moving_box").removeAttribute("id");
          }
          e.target.id = "moving_box";

          // 计算坐标差值
          diffX = startX - e.target.offsetLeft;
          diffY = startY - e.target.offsetTop;
        } else {
          // 在页面创建 box
          var active_box = document.createElement("div");
          active_box.id = "active_box";
          active_box.className = "box";
          active_box.style.top = startY + 'px';
          active_box.style.left = startX + 'px';
          active_box.setAttribute("ondrop", "drop(event)");
          active_box.setAttribute("ondragover", "allowDrop(event)");
          document.body.appendChild(active_box);
          active_box = null;
        }
      }

    };

    // 鼠标移动
    document.onmousemove = function (e) {
      if (e.pageY <= coordiv.offsetTop + coordiv.offsetHeight && e.pageY >= coordiv.offsetTop && e.pageX >= coordiv.offsetLeft && e.pageX <= coordiv.offsetLeft + coordiv.offsetWidth) {
        // 更新 box 尺寸
        var ab = document.getElementById("active_box");

        //如果document中有active_box,就改变box大小
        if (document.getElementById("active_box") !== null) {
          ab.style.width = e.pageX - startX + 'px';
          ab.style.height = e.pageY - startY + 'px';
        }

        // 移动,更新 box 坐标
        if (document.getElementById("moving_box") !== null && dragging) {
          var mb = document.getElementById("moving_box");
          var xy_div = document.getElementById(mb.style.left.substring(0, mb.style.left.length - 2) + mb.style.top.substring(0, mb.style.top.length - 2));

          var tmx = e.pageX - diffX;
          var tmy = e.pageY - diffY;


          if (tmx + mb.offsetWidth <= coordiv.offsetLeft + coordiv.offsetWidth && tmx >= coordiv.offsetLeft && tmy + mb.offsetHeight <= coordiv.offsetTop + coordiv.offsetHeight && tmy >= coordiv.offsetTop) {
            mb.style.top = e.pageY - diffY + 'px';
            mb.style.left = e.pageX - diffX + 'px';

            if (xy_div !== null) {
              var new_x = mb.style.left.substring(0, mb.style.left.length - 2);
              var new_y = mb.style.top.substring(0, mb.style.top.length - 2);
              xy_div.id = new_x + new_y;
              new_x -= coordiv.offsetLeft;
              new_y -= coordiv.offsetTop;
              var new_r = parseInt(mb.style.width.substring(0, mb.style.width.length - 2)) + parseInt(new_x) - coordiv.offsetLeft;
              var new_b = parseInt(mb.style.height.substring(0, mb.style.height.length - 2)) + parseInt(new_y) - coordiv.offsetTop;//"[ left: "+ new_x +", top: "+ new_y + ", right: " + new_r +" , bottom: "+ new_b +" ]";
              xy_div.innerText = new_x + "," + new_y + "," + new_r + "," + new_b;
              var input_div = document.getElementById("x_y")
              input_div.value = xy_div.innerHTML
            }
          }
        }
      }
    }

    // 鼠标抬起
    document.onmouseup = function (e) {
      // 禁止拖动
      dragging = false;
      if (document.getElementById("active_box") !== null) {
        var ab = document.getElementById("active_box");
        ab.removeAttribute("id");
        // 如果长宽均小于 3px,移除 box
        if (ab.offsetWidth < 10 || ab.offsetHeight < 10) {
          document.body.removeChild(ab);
        }
        if (ab.offsetHeight >= 10 && ab.offsetHeight >= 10) {
          var xy_div = document.createElement("div");
          xy_div.id = startX.toString() + startY.toString();
          xy_div.className = 'show_point';
          document.body.appendChild(xy_div);

          begin_x = startX - coordiv.offsetLeft;
          begin_y = startY - coordiv.offsetTop;
          end_x = e.pageX - coordiv.offsetLeft;
          end_y = e.pageY - coordiv.offsetTop;
          xy_div.innerHTML = begin_x + "," + begin_y + "," + end_x + "," + end_y;
          var input_div = document.getElementById("x_y")
          input_div.value = xy_div.innerHTML
        }
      }
    };

    //双击鼠标
    document.ondblclick = function (e) {
      if (e.target.className.match(/box/)) {

        if (document.getElementById("del_box") !== null) {
          document.getElementById("del_box").removeAttribute("id");
        }
        e.target.id = "del_box";
        var ab = document.getElementById("del_box");

        var xy_div = document.getElementById(ab.style.left.substring(0, ab.style.left.length - 2) + ab.style.top.substring(0, ab.style.top.length - 2))
        if (xy_div !== null) {
          xy_div.removeAttribute("id");
          document.body.removeChild(xy_div);
        }
        document.body.removeChild(ab);
      }

    }

    /**
     * 提交坐标 ocr 识别
     */
    document.getElementById('ocr_btn').addEventListener('click', async function (e) {
      console.log('begin_x=', begin_x)
      console.log('begin_y=', begin_y)
      console.log('end_x=', end_x)
      console.log('end_y=', end_y)

      let image = '../static/idcard.jpg'

      const worker = await Tesseract.createWorker('eng');
      const { data: { text } } = await worker.recognize(image, {
        rectangle: { top: begin_y, left: begin_x, width: end_x - begin_x, height: end_y - begin_y },
      });
      await worker.terminate();

      alert('识别结果: ' + text);
      console.log('识别结果: ' , text);
    })

    /**
     * 重置框选
     */
    document.getElementById('reset').addEventListener('click', reset)

    /**
     * 删除所有选框
     */
    function reset() {
      var boxArr = document.getElementsByClassName('box');

      while (boxArr.length > 0){
        document.body.removeChild(boxArr.item(0));
      }

      var pointArr = document.getElementsByClassName('show_point');

      while (pointArr.length > 0){
        document.body.removeChild(pointArr.item(0));
      }
    }
  };
</script>
</body>
</html>

二、tess4j

1、简介

Tess4J 是 Tesseract OCR 的 java api 实现库,你可以通过 java 调用来轻松的实现图片识别并提取文字,也就是 OCR 图片提取文字技术。

Tess4J 支持识别的的图片格式:

  • TIFF、JPEG、GIF、PNG 和 BMP 图像格式
  • 多页 TIFF 图像
  • PDF文档格式

tesseract 在 GitHub 上的有三个独立的语言模型存储库 tessdata、tessdata-best、tessdata-fast 他们分别都存储了语言模型,他们的区别是:

如何训练得到的 速度 准确性 支持旧版 支持再训练
tessdata 传统+LSTM(并整合tessdata-best) 中等 中等 支持 不支持
tessdata-best 仅 LSTM(基于langdata) 最慢 最准确 不支持 支持
tessdata-fast 比 tessdata-best 更小的 LSTM网络整合 最快的 最不准确 不支持 不支持

2、识别文字

BufferedImage bufferedImage = ImageIO.read(new File("D:\\static\\idcard.jpg"));
// 截取原始图片, 左上角距离原图最左边100px,最上方200px,宽度300px,高度400px
BufferedImage subimage = bufferedImage.getSubimage(100, 200, 300, 400);
// 将截取的图片保存为临时图片
File tempFile = File.createTempFile("subImage", ".jpg");
ImageIO.write(subimage, "jpg", tempFile);


Tesseract tesseract = new Tesseract();
//设置训练文件路径,不建议放在 resources 下
tesseract.setDatapath("D:\\data\\traineddata");
//设置识别语言为中文简体,(如果要设置为英文可改为"eng")
tesseract.setLanguage("chi_sim");
//使用 OSD 进行自动页面分割以进行图像处理
//tesseract.setPageSegMode(1);
//设置引擎模式是神经网络LSTM引擎
//tesseract.setOcrEngineMode(1);
// 识别文字
String result = tesseract.doOCR(subimage);

System.out.println(result);

tempFile.delete();

三、自定义文字库(训练)

下载jTessBoxEditor:

https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/jTessBoxEditor-2.6.0.zip/download

解压 jTessBoxEditor-2.6.0.zip

双击 jTessBoxEditor.jartrain.bat 运行


  目录