翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

jackiehejian · 2024-11-07T03:32:16Z

当pdf文件均为图像，而不是可编辑（复制）状态时，翻译完全失败，具体见图

Byaidu · 2024-11-07T03:59:48Z

图片型的 PDF 文档暂时还没办法翻译，目前主要还是在优化电子书和论文的翻译效果

jackiehejian · 2024-11-07T05:30:21Z

图片型的 PDF 文档暂时还没办法翻译，目前主要还是在优化电子书和论文的翻译效果

好的，非常感谢

fireinrain · 2024-11-08T02:11:05Z

均为图像有点为难人了，ocr的质量影响文字的质量影响翻译的效果

xxsunyxx · 2024-11-19T03:24:03Z

加一个可选流程paddleOCR，

xxnuo · 2024-11-20T15:41:20Z

sayura
这个模型非常准确，就是对算力的要求会高于 Paddle OCR

Byaidu · 2024-11-20T15:42:33Z

sayura 这个模型非常准确，就是对算力的要求会高于 Paddle OCR

和 minerU/marker 比较怎么样呀

xxnuo · 2024-11-21T08:33:36Z

Owner

sayura 就是 marker 的作者做的开源多国语言和表格的 OCR 模型😂
minerU 这个我没有测试，我只测试了 PaddleOCR 高精度模型，Sayura 效果比它好很多，而且支持多国语言效果很好。
我看 minerU 的 issue，对多国语言的支持好像不佳
缺点就是 Sayura 对 GPU 显存要求有点高，头疼，不太会量化模型。

xxnuo · 2024-12-02T01:57:30Z

佬们 ocr 的进展如何，我觉得用 paddleocr 撸一个不错，如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

Byaidu · 2024-12-02T02:24:49Z

佬们 ocr 的进展如何，我觉得用 paddleocr 撸一个不错，如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

目前还一点没做…

如果写好了的话欢迎来贡献代码

hellofinch · 2024-12-06T07:35:21Z

from typing import BinaryIO
import numpy as np
import tqdm
from pymupdf import Document
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdf2zh.converter import TranslateConverter
from pdf2zh.pdfinterp import PDFPageInterpreterEx
from pymupdf import Font
import numpy as np
from paddleocr import PaddleOCR

file=""

def extract_text_to_fp(
    inf: BinaryIO,
    pages=None,
    password: str = "",
    debug: bool = False,
    page_count: int = 0,
    vfont: str = "",
    vchar: str = "",
    thread: int = 0,
    doc_en: Document = None,
    model=None,
    lang_in: str = "",
    lang_out: str = "",
    service: str = "",
    resfont: str = "",
    noto: Font = None,
    callback: object = None,
    **kwarg,
) -> None:
    ocr = PaddleOCR(use_angle_cls=True, lang="en")
    rsrcmgr = PDFResourceManager()
    layout = {}
    device = TranslateConverter(
        rsrcmgr, vfont, vchar, thread, layout, lang_in, lang_out, service, resfont, noto
    )

    assert device is not None
    obj_patch = {}
    interpreter = PDFPageInterpreterEx(rsrcmgr, device, obj_patch)
    if pages:
        total_pages = len(pages)
    else:
        total_pages = page_count

    parser = PDFParser(inf)
    doc = PDFDocument(parser, password=password)
    with tqdm.tqdm(
        enumerate(PDFPage.create_pages(doc)),
        total=total_pages,
    ) as progress:
        for pageno, page in progress:
            if pages and (pageno not in pages):
                continue
            if callback:
                callback(progress)
            page.pageno = pageno
            pix = doc_en[page.pageno].get_pixmap()
            image = np.fromstring(pix.samples, np.uint8).reshape(
                pix.height, pix.width, 3
            )[:, :, ::-1]
            page_layout = model.predict(image, imgsz=int(pix.height / 32) * 32)[0]
            # kdtree 是不可能 kdtree 的，不如直接渲染成图片，用空间换时间
            box = np.ones((pix.height, pix.width))
            h, w = box.shape
            result_text=[]
            vcls = ["abandon", "figure", "table", "isolate_formula", "formula_caption"]
            for i, d in enumerate(page_layout.boxes):
                text=''
                if not page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = i + 2
                    if page_layout.names[int(d.cls)]=="plain text":
                        imagex = image[y0:y1,x0:x1]
                        result = ocr.ocr(imagex, cls=False)
                        for idx in range(len(result)):
                            res = result[idx]
                            for line in res:
                                text+=line[1][0]
                        result_text.append(text)
            for i, d in enumerate(page_layout.boxes):
                if page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = 0
            layout[page.pageno] = box
            # 新建一个 xref 存放新指令流
            page.page_xref = doc_en.get_new_xref()  # hack 插入页面的新 xref
            doc_en.update_object(page.page_xref, "<<>>")
            doc_en.update_stream(page.page_xref, b"")
            doc_en[page.pageno].set_contents(page.page_xref)
            interpreter.process_page(page)

    device.close()
    return obj_patch,result_text

只有一段OCR的内容，实在是看不懂怎么把OCR出来的结果往后传了。
:(

xxnuo · 2024-12-15T13:43:33Z

https://github.com/jingsongliujing/OnnxOCR

xxnuo · 2024-12-20T10:09:36Z

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

gfhdhytghd · 2024-12-22T16:19:15Z

建议使用有道来进行OCR翻译
钱可不是白交的

gfhdhytghd · 2024-12-22T16:19:38Z

尝试集成tesseract来实现OCR

jj-a-li · 2025-01-11T21:05:11Z

实际上pdf非常大一部分都是扫描版的，如果不能处理，使用范围会锐减

su77ungr · 2025-02-15T00:44:41Z

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

hellofinch · 2025-02-17T01:07:09Z

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

这个识别其实不是问题，主要是这个识别之后的排版信息没有了。需要对应排版信息。

1VeniVediVeci1 · 2025-02-19T03:18:24Z

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

这个识别其实不是问题，主要是这个识别之后的排版信息没有了。需要对应排版信息。

sayura的ocr识别会输出包含bbox坐标和识别结果的json文件

hellofinch · 2025-02-19T06:15:17Z

排版信息指的不单纯是位置，比如字体，斜体，加粗。现在doclayout已经可以确认box，OCR直接跟后边也可以获取内容，但是行间距，字间距这些信息没有了。

awwaawwa · 2025-02-19T06:17:07Z

I will explore related things recently, thanks for everyone's suggestions!

keonchennl · 2025-02-27T10:19:06Z

目前只有这个issue是关于处理扫描版PDF的，不知道当前功能开发的优先级如何，先关注了

awwaawwa · 2025-02-27T10:20:03Z

下个月会看一看

NullYing · 2025-03-25T07:35:35Z

先关注了MinerU，扫描件的准确读取率挺高的（不是手机拍照）；想结合这两个项目看起来还是有点难度

NullYing · 2025-03-25T08:03:23Z

先关注了MinerU，扫描件的准确读取率挺高的（不是手机拍照）；想结合这两个项目看起来还是有点难度

扫描件可以直接理解为图片，实际上是保持排版的图片翻译功能，可以参考微信的实现，长按图片点翻译可以自动翻译

anbian123 · 2025-04-16T09:17:51Z

mathTranslate对于扫描版的pdf文件的翻译效果咋样呢？

awwaawwa · 2025-04-16T11:08:13Z

mathTranslate对于扫描版的pdf文件的翻译效果咋样呢？

压根不支持😂

awwaawwa · 2025-04-19T13:19:49Z

BabelDOC 0.3.17 可以在文字区域底下加个白色背景，来部分支持OCR版PDF文档

one-word · 2025-04-23T08:49:00Z

mark

Jose-Maria-Martins · 2025-04-30T10:36:11Z

What about https://ocrmypdf.readthedocs.io/en/latest/. Couldn't it improve detection? And make it work for ocr pdfs?

awwaawwa · 2025-04-30T14:08:52Z

What about https://ocrmypdf.readthedocs.io/en/latest/. Couldn't it improve detection? And make it work for ocr pdfs?

#860 thanks

awwaawwa · 2025-05-06T09:26:38Z

遇到此问题时，请尝试使用 2.0 预览版 #586 并启用高级选项中的 OCR Workaround 来翻译。

Byaidu added the enhancement New feature or request label Nov 7, 2024

Byaidu mentioned this issue Nov 19, 2024

进度条走完了但是并没有翻译 #64

Closed

reycn changed the title ~~当PDF每一页均为图像时，无法进行翻译~~ feat (main): supports ocr on scanned document Nov 21, 2024

reycn added the help wanted Extra attention is needed label Nov 21, 2024

Byaidu mentioned this issue Nov 28, 2024

pdf扫描件问题 #140

Closed

hellofinch mentioned this issue Dec 9, 2024

无法正常翻译 #185

Closed

Byaidu mentioned this issue Dec 11, 2024

翻译后仍是英语, 并且与原文堆叠在一起 #212

Closed

hellofinch mentioned this issue Dec 13, 2024

译文存在大量重叠 #62

Closed

Byaidu changed the title ~~feat (main): supports ocr on scanned document~~ 翻译扫描档存在重影 / feat (main): supports ocr on scanned document Dec 13, 2024

Byaidu pinned this issue Dec 13, 2024

Byaidu mentioned this issue Dec 15, 2024

翻译后的 PDF 文本覆盖原文（高质量扫描） #235

Closed

This was referenced Dec 16, 2024

后期是否会支持"图片类PDF"进行翻译 #239

Closed

无法翻译PDF中图片里的文字 #269

Closed

Byaidu mentioned this issue Dec 18, 2024

非标准PDF会导致翻译失效 #280

Closed

This was referenced Dec 18, 2024

扫描件检测&输出警告 #264

Closed

翻译出来还是英文 #296

Closed

This was referenced Jan 9, 2025

Where OCR? #439

Closed

原文档内容和翻译结果重叠，是否存在屏蔽原文档内容选项 #446

Closed

This was referenced Jan 13, 2025

Issues with translated files #454

Closed

英文未刪除 #473

Closed

awwaawwa marked this as a duplicate of #594 Feb 10, 2025

awwaawwa marked this as a duplicate of #742 Mar 10, 2025

awwaawwa marked this as a duplicate of #757 Mar 13, 2025

awwaawwa marked this as a duplicate of #773 Mar 16, 2025

awwaawwa marked this as a duplicate of #801 Mar 25, 2025

hellofinch mentioned this issue Mar 26, 2025

解决PDF翻译重影问题 #803

Closed

awwaawwa marked this as a duplicate of #803 Mar 26, 2025

awwaawwa marked this as a duplicate of #807 Mar 26, 2025

awwaawwa marked this as a duplicate of #845 Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

jackiehejian commented Nov 7, 2024 •

edited

Loading

Byaidu commented Nov 7, 2024

jackiehejian commented Nov 7, 2024

fireinrain commented Nov 8, 2024

xxsunyxx commented Nov 19, 2024

xxnuo commented Nov 20, 2024

Byaidu commented Nov 20, 2024

xxnuo commented Nov 21, 2024 •

edited

Loading

xxnuo commented Dec 2, 2024

Byaidu commented Dec 2, 2024

hellofinch commented Dec 6, 2024

xxnuo commented Dec 15, 2024

xxnuo commented Dec 20, 2024

gfhdhytghd commented Dec 22, 2024

gfhdhytghd commented Dec 22, 2024

jj-a-li commented Jan 11, 2025

su77ungr commented Feb 15, 2025

hellofinch commented Feb 17, 2025

1VeniVediVeci1 commented Feb 19, 2025

hellofinch commented Feb 19, 2025

awwaawwa commented Feb 19, 2025

keonchennl commented Feb 27, 2025

awwaawwa commented Feb 27, 2025

NullYing commented Mar 25, 2025

NullYing commented Mar 25, 2025

anbian123 commented Apr 16, 2025

awwaawwa commented Apr 16, 2025

awwaawwa commented Apr 19, 2025

one-word commented Apr 23, 2025

Jose-Maria-Martins commented Apr 30, 2025

awwaawwa commented Apr 30, 2025

awwaawwa commented May 6, 2025

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

Comments

jackiehejian commented Nov 7, 2024 • edited Loading

Byaidu commented Nov 7, 2024

jackiehejian commented Nov 7, 2024

fireinrain commented Nov 8, 2024

xxsunyxx commented Nov 19, 2024

xxnuo commented Nov 20, 2024

Byaidu commented Nov 20, 2024

xxnuo commented Nov 21, 2024 • edited Loading

xxnuo commented Dec 2, 2024

Byaidu commented Dec 2, 2024

hellofinch commented Dec 6, 2024

xxnuo commented Dec 15, 2024

xxnuo commented Dec 20, 2024

gfhdhytghd commented Dec 22, 2024

gfhdhytghd commented Dec 22, 2024

jj-a-li commented Jan 11, 2025

su77ungr commented Feb 15, 2025

hellofinch commented Feb 17, 2025

1VeniVediVeci1 commented Feb 19, 2025

hellofinch commented Feb 19, 2025

awwaawwa commented Feb 19, 2025

keonchennl commented Feb 27, 2025

awwaawwa commented Feb 27, 2025

NullYing commented Mar 25, 2025

NullYing commented Mar 25, 2025

anbian123 commented Apr 16, 2025

awwaawwa commented Apr 16, 2025

awwaawwa commented Apr 19, 2025

one-word commented Apr 23, 2025

Jose-Maria-Martins commented Apr 30, 2025

awwaawwa commented Apr 30, 2025

awwaawwa commented May 6, 2025

jackiehejian commented Nov 7, 2024 •

edited

Loading

xxnuo commented Nov 21, 2024 •

edited

Loading