一、tesseract.js
1、简介
Tesseract.js 是一个 JavaScript 库,可以从图像中获取文字。
OCR 光学字符识别 optical character recognition
2、引入
(1)node引入
npm install tesseract.js --save
js
import Tesseract from 'tesseract.js';
(2)直接引入
<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>
3、Worker使用
(1)创建worker
// 英文识别
const worker = await Tesseract.createWorker('eng');
多语言识别通过+
连接,如果是中英文识别写成 eng+chi_sim
,其他语言包见下面的链接
https://tesseract-ocr.github.io/tessdoc/Data-Files
如果语言包下载太慢,可通过 langPath
修改下载路径
// 从 http://127.0.0.1/chi_sim.traineddata.gz 下载语言包
const worker = await Tesseract.createWorker('eng+chi_sim', 1, {
langPath: 'http://127.0.0.1' //结尾不能是/
});
// 或从本地获取
const worker = createWorker({
workerPath: '../node_modules/tesseract.js/dist/worker.min.js',
langPath: '../lang-data',
corePath: '../node_modules/tesseract.js-core/tesseract-core.wasm.js',
logger: m => console.log(m),
});
其他可用构造参数
Arguments:
langs
: 一个字符串,用于指示要下载的训练数据的语言,多个语言用 ‘+’ 链接oem
:一个枚举,用于指示您使用的OCR引擎模式options
an object of customized optionscorePath
下载tesseract-core.wasm.js
和tesseract-core-simd.wasm.js
的路径,两个文件来源Tesseract.js-core package- Setting
corePath
to a specific.js
file is strongly discouraged. To provide the best performance on all devices, Tesseract.js needs to be able to pick betweentesseract-core.wasm.js
andtesseract-core-simd.wasm.js
. See this issue for more detail.
- Setting
langPath
下载语言包路径 (traineddata文件), do not include/
at the end of the pathworkerPath
path for downloading worker scriptdataPath
path for saving traineddata in WebAssembly file system, not common to modifydataPath
path for saving traineddata in WebAssembly file system, not common to modifycachePath
path for the cached traineddata, more useful for Node, for browser it only changes the key in IndexDBcacheMethod
a string to indicate the method of cache management, should be one of the following options- write: read cache and write back (default method)
- readOnly: read cache and not to write back
- refresh: not to read cache and write back
- none: not to read cache and not to write back
legacyCore
set totrue
to ensure any code downloaded supports the Legacy model (in addition to LSTM model)legacyLang
set totrue
to ensure any language data downloaded supports the Legacy model (in addition to LSTM model)workerBlobURL
a boolean to define whether to use Blob URL for worker script, default: truegzip
a boolean to define whether the traineddata from the remote is gzipped, default: truelogger
a function to log the progress, a quick example ism => console.log(m)
errorHandler
a function to handle worker errors, a quick example iserr => console.error(err)
(2)识别文字
let image = '../static/idcard.jpg'
const worker = await Tesseract.createWorker('eng');
const { data: { text } } = await worker.recognize(image);
console.log('识别结果: ', text);
如果需要只针对部分区域进行识别
let image = 'http://127.0.0.1/static/idcard.jpg'
const worker = await Tesseract.createWorker('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: 0, left: 0, width: 100, height: 100 },
});
console.log('识别结果: ', text);
image参数。下面列出了支持的图像格式和数据类型。
支持图像格式:bmp, jpg, png, pbm, webp
对于浏览器和Node,支持的数据类型是:
字符串与base64编码的图像,适合
data:image\/([a-zA-Z]*);base64,([^"]*)
buffer
仅对于浏览器,支持的数据类型是:
File 或 Blob 对象
img 或 canvas 元素
仅对于Node,支持的数据类型是:
- 包含本地镜像路径的字符串
注意:images必须是支持的图像格式和支持的数据类型。例如,支持包含png图像的缓冲区。不支持包含原始像素数据的缓冲区。
(3)终止任务
const worker = await Tesseract.createWorker('eng');
// 终止任务
worker.terminate();
(4)向内存文件写
MEMFS是一种内存文件系统,数据完全存储于内存中,程序运行时写入的数据在页面刷新或程序重载后将丢失。这种文件系统通常用于临时存储数据,或者在嵌入式系统中使用。
Worker.writeText()
方法可以向 MEMFS 中写入数据
参数:
path
text file pathtext
content of the text filejobId
Please see details above
await worker.writeText('tmp.txt', 'hello world!!!').then(()=>{
}).catch((e)=>{
console.log(e)
});
(5)读取内存文件
Arguments:
path
text file pathjobId
Please see details above
const { data } = await worker.readText('tmp.txt');
console.log(data);
(6)删除内存文件
Arguments:
path
file pathjobId
Please see details above
await worker.removeFile('tmp.txt');
4、Scheduler
(1)创建调度器
createScheduler()
是一个用于创建调度器的工厂函数,调度器管理作业 queue 和 woker,使多个woker能够一起工作,当您想要加快性能时,它很有用。
const { createScheduler } = Tesseract;
const scheduler = createScheduler();
(2)addWorker
scheduler. addworker()
将一个worker添加到scheduler内部的worker池中,建议只向一个scheduler添加一个worker。
const { createWorker, createScheduler } = Tesseract;
const scheduler = createScheduler();
const worker = await createWorker();
scheduler.addWorker(worker);
(3)addJob
addjob()
将一个作业添加到作业队列中,调度器等待并找到一个空闲的工作者来接受该作业。
Arguments:
action
一个字符串来指示你想要做的动作,现在只支持recognition
和detect
payload
任意数量的参数,取决于所调用的操作。
const { data: { text } } = await scheduler.addJob('recognize', image, options);
const { data } = await scheduler.addJob('detect', image);
(4)getQueueLen
返回作业队列长度
var queueLen = scheduler.getQueueLen();
(5)getNumWorkers
返回 worder 数
(6)terminate
终止
5、setLogging
用于输出详细日志
Arguments:
logging
boolean to define whether to see detailed logs, default: false
Examples:
const { setLogging } = Tesseract;
setLogging(true);
6、案例:框选图片识别文字
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>矩形绘制工具</title>
<style type="text/css">
.box {
background: #f00;
width: 0px;
height: 0px;
position: absolute;
opacity: 0.5;
cursor: move;
}
.droptarget {
float: left;
width: 100px;
height: 1000px;
margin: 15px;
padding: 10px;
border: 1px solid #aaaaaa;
}
</style>
</head>
<body>
<!-- 框选代码来源https://blog.csdn.net/kuronekonano/article/details/98498613 -->
<script>
//框内移动显示坐标
function cnvs_getCoordinates(e) {
x = e.pageX - e.target.offsetLeft;//不能用clientX,pageX为文档坐标,clientX表示浏览器界面坐标,会随滚动条改变
y = e.pageY - e.target.offsetTop;
document.getElementById("xycoordinates").innerHTML = "Coordinates: (" + x + "," + y + ")";
}
//出框后不显示坐标
function cnvs_clearCoordinates() {
document.getElementById("xycoordinates").innerHTML = "Coordinates: (0,0)";
}
</script>
<div id="coordiv" style="" onmousemove="cnvs_getCoordinates(event)" onmouseout="cnvs_clearCoordinates()">
<img src="../static/idcard.jpg" ondragstart="return false;">
</div>
<br>
<input type="text" name="coor" placeholder="坐标" readonly="true" id="x_y" style="width: 300px"><br>
<button id="ocr_btn">识别文字</button>
<button id="reset">重置</button>
<div id="xycoordinates"></div>
<!--https://github.com/naptha/tesseract.js#tesseractjs-->
<!--<script src='https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js'></script>-->
<script src='../js/tesseract.min.js'></script>
<script>
function dragStart(event) {
event.dataTransfer.setData("Text", event.target.id);
}
function allowDrop(event) {
event.preventDefault();
}
function drop(event) {
event.preventDefault();
var data = event.dataTransfer.getData("Text");
event.target.appendChild(document.getElementById(data));
}
window.onload = function (e) {
e = e || window.event;
// startX, startY 为鼠标点击时初始坐标
// diffX, diffY 为鼠标初始坐标与 box 左上角坐标之差,用于拖动
var startX, startY, diffX, diffY;
// 是否拖动,初始为 false
var dragging = false;
// 起始点位
var begin_x = 0;
var begin_y = 0;
var end_x = 0;
var end_y = 0;
var coordiv = document.getElementById('coordiv');
// 鼠标按下
document.onmousedown = function (e) {
startX = e.pageX;
startY = e.pageY;
// 如果鼠标在 box 上被按下,坐标判定防止在box之外
if (startY <= coordiv.offsetTop + coordiv.offsetHeight && startY >= coordiv.offsetTop && startX >= coordiv.offsetLeft && startX <= coordiv.offsetLeft + coordiv.offsetWidth) {
// 不允许框选多个
reset();
if (e.target.className.match(/box/)) {
// 允许拖动
dragging = true;
// 设置当前 box 的 id 为 moving_box
if (document.getElementById("moving_box") !== null) {
document.getElementById("moving_box").removeAttribute("id");
}
e.target.id = "moving_box";
// 计算坐标差值
diffX = startX - e.target.offsetLeft;
diffY = startY - e.target.offsetTop;
} else {
// 在页面创建 box
var active_box = document.createElement("div");
active_box.id = "active_box";
active_box.className = "box";
active_box.style.top = startY + 'px';
active_box.style.left = startX + 'px';
active_box.setAttribute("ondrop", "drop(event)");
active_box.setAttribute("ondragover", "allowDrop(event)");
document.body.appendChild(active_box);
active_box = null;
}
}
};
// 鼠标移动
document.onmousemove = function (e) {
if (e.pageY <= coordiv.offsetTop + coordiv.offsetHeight && e.pageY >= coordiv.offsetTop && e.pageX >= coordiv.offsetLeft && e.pageX <= coordiv.offsetLeft + coordiv.offsetWidth) {
// 更新 box 尺寸
var ab = document.getElementById("active_box");
//如果document中有active_box,就改变box大小
if (document.getElementById("active_box") !== null) {
ab.style.width = e.pageX - startX + 'px';
ab.style.height = e.pageY - startY + 'px';
}
// 移动,更新 box 坐标
if (document.getElementById("moving_box") !== null && dragging) {
var mb = document.getElementById("moving_box");
var xy_div = document.getElementById(mb.style.left.substring(0, mb.style.left.length - 2) + mb.style.top.substring(0, mb.style.top.length - 2));
var tmx = e.pageX - diffX;
var tmy = e.pageY - diffY;
if (tmx + mb.offsetWidth <= coordiv.offsetLeft + coordiv.offsetWidth && tmx >= coordiv.offsetLeft && tmy + mb.offsetHeight <= coordiv.offsetTop + coordiv.offsetHeight && tmy >= coordiv.offsetTop) {
mb.style.top = e.pageY - diffY + 'px';
mb.style.left = e.pageX - diffX + 'px';
if (xy_div !== null) {
var new_x = mb.style.left.substring(0, mb.style.left.length - 2);
var new_y = mb.style.top.substring(0, mb.style.top.length - 2);
xy_div.id = new_x + new_y;
new_x -= coordiv.offsetLeft;
new_y -= coordiv.offsetTop;
var new_r = parseInt(mb.style.width.substring(0, mb.style.width.length - 2)) + parseInt(new_x) - coordiv.offsetLeft;
var new_b = parseInt(mb.style.height.substring(0, mb.style.height.length - 2)) + parseInt(new_y) - coordiv.offsetTop;//"[ left: "+ new_x +", top: "+ new_y + ", right: " + new_r +" , bottom: "+ new_b +" ]";
xy_div.innerText = new_x + "," + new_y + "," + new_r + "," + new_b;
var input_div = document.getElementById("x_y")
input_div.value = xy_div.innerHTML
}
}
}
}
}
// 鼠标抬起
document.onmouseup = function (e) {
// 禁止拖动
dragging = false;
if (document.getElementById("active_box") !== null) {
var ab = document.getElementById("active_box");
ab.removeAttribute("id");
// 如果长宽均小于 3px,移除 box
if (ab.offsetWidth < 10 || ab.offsetHeight < 10) {
document.body.removeChild(ab);
}
if (ab.offsetHeight >= 10 && ab.offsetHeight >= 10) {
var xy_div = document.createElement("div");
xy_div.id = startX.toString() + startY.toString();
xy_div.className = 'show_point';
document.body.appendChild(xy_div);
begin_x = startX - coordiv.offsetLeft;
begin_y = startY - coordiv.offsetTop;
end_x = e.pageX - coordiv.offsetLeft;
end_y = e.pageY - coordiv.offsetTop;
xy_div.innerHTML = begin_x + "," + begin_y + "," + end_x + "," + end_y;
var input_div = document.getElementById("x_y")
input_div.value = xy_div.innerHTML
}
}
};
//双击鼠标
document.ondblclick = function (e) {
if (e.target.className.match(/box/)) {
if (document.getElementById("del_box") !== null) {
document.getElementById("del_box").removeAttribute("id");
}
e.target.id = "del_box";
var ab = document.getElementById("del_box");
var xy_div = document.getElementById(ab.style.left.substring(0, ab.style.left.length - 2) + ab.style.top.substring(0, ab.style.top.length - 2))
if (xy_div !== null) {
xy_div.removeAttribute("id");
document.body.removeChild(xy_div);
}
document.body.removeChild(ab);
}
}
/**
* 提交坐标 ocr 识别
*/
document.getElementById('ocr_btn').addEventListener('click', async function (e) {
console.log('begin_x=', begin_x)
console.log('begin_y=', begin_y)
console.log('end_x=', end_x)
console.log('end_y=', end_y)
let image = '../static/idcard.jpg'
const worker = await Tesseract.createWorker('eng');
const { data: { text } } = await worker.recognize(image, {
rectangle: { top: begin_y, left: begin_x, width: end_x - begin_x, height: end_y - begin_y },
});
await worker.terminate();
alert('识别结果: ' + text);
console.log('识别结果: ' , text);
})
/**
* 重置框选
*/
document.getElementById('reset').addEventListener('click', reset)
/**
* 删除所有选框
*/
function reset() {
var boxArr = document.getElementsByClassName('box');
while (boxArr.length > 0){
document.body.removeChild(boxArr.item(0));
}
var pointArr = document.getElementsByClassName('show_point');
while (pointArr.length > 0){
document.body.removeChild(pointArr.item(0));
}
}
};
</script>
</body>
</html>
二、tess4j
1、简介
Tess4J 是 Tesseract OCR 的 java api 实现库,你可以通过 java 调用来轻松的实现图片识别并提取文字,也就是 OCR 图片提取文字技术。
Tess4J 支持识别的的图片格式:
- TIFF、JPEG、GIF、PNG 和 BMP 图像格式
- 多页 TIFF 图像
- PDF文档格式
tesseract 在 GitHub 上的有三个独立的语言模型存储库 tessdata、tessdata-best、tessdata-fast 他们分别都存储了语言模型,他们的区别是:
如何训练得到的 | 速度 | 准确性 | 支持旧版 | 支持再训练 | |
---|---|---|---|---|---|
tessdata | 传统+LSTM(并整合tessdata-best) | 中等 | 中等 | 支持 | 不支持 |
tessdata-best | 仅 LSTM(基于langdata) | 最慢 | 最准确 | 不支持 | 支持 |
tessdata-fast | 比 tessdata-best 更小的 LSTM网络整合 | 最快的 | 最不准确 | 不支持 | 不支持 |
2、识别文字
BufferedImage bufferedImage = ImageIO.read(new File("D:\\static\\idcard.jpg"));
// 截取原始图片, 左上角距离原图最左边100px,最上方200px,宽度300px,高度400px
BufferedImage subimage = bufferedImage.getSubimage(100, 200, 300, 400);
// 将截取的图片保存为临时图片
File tempFile = File.createTempFile("subImage", ".jpg");
ImageIO.write(subimage, "jpg", tempFile);
Tesseract tesseract = new Tesseract();
//设置训练文件路径,不建议放在 resources 下
tesseract.setDatapath("D:\\data\\traineddata");
//设置识别语言为中文简体,(如果要设置为英文可改为"eng")
tesseract.setLanguage("chi_sim");
//使用 OSD 进行自动页面分割以进行图像处理
//tesseract.setPageSegMode(1);
//设置引擎模式是神经网络LSTM引擎
//tesseract.setOcrEngineMode(1);
// 识别文字
String result = tesseract.doOCR(subimage);
System.out.println(result);
tempFile.delete();
三、自定义文字库(训练)
下载jTessBoxEditor:
https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/jTessBoxEditor-2.6.0.zip/download
解压 jTessBoxEditor-2.6.0.zip
双击 jTessBoxEditor.jar
或 train.bat
运行