浏览代码

Merge branch 'main' of https://github.com/Byaidu/PDFMathTranslate

Byaidu 1 年之前
父节点
当前提交
3e09e6f7ff
共有 9 个文件被更改,包括 276 次插入41 次删除
  1. 2 0
      .gitignore
  2. 6 4
      Dockerfile.Demo
  3. 12 6
      README.md
  4. 11 6
      README_zh-CN.md
  5. 213 0
      pdf2zh/doclayout.py
  6. 2 7
      pdf2zh/high_level.py
  7. 9 15
      pdf2zh/pdf2zh.py
  8. 13 0
      pdf2zh/utils.py
  9. 8 3
      pyproject.toml

+ 2 - 0
.gitignore

@@ -1,3 +1,5 @@
+pdf2zh_files
+gui/pdf2zh_files
 gradio_files
 gradio_files
 tmp
 tmp
 gui/gradio_files
 gui/gradio_files

+ 6 - 4
Dockerfile.Demo

@@ -2,12 +2,14 @@ FROM python:3.12
 
 
 WORKDIR /app
 WORKDIR /app
 
 
+COPY . .
+
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONUNBUFFERED=1
 
 
-RUN apt-get update && apt-get install -y libgl1 \
-    && rm -rf /var/lib/apt/lists/*
+RUN apt-get update && apt-get install -y libgl1
+
+RUN pip install .
 
 
-RUN pip install pdf2zh
 RUN mkdir -p /data
 RUN mkdir -p /data
 RUN chmod 777 /data
 RUN chmod 777 /data
 RUN mkdir -p /app
 RUN mkdir -p /app
@@ -17,4 +19,4 @@ RUN chmod 777 /.cache
 RUN mkdir -p ./gradio_files
 RUN mkdir -p ./gradio_files
 RUN chmod 777 ./gradio_files
 RUN chmod 777 ./gradio_files
 
 
-CMD ["pdf2zh", "-i"]
+CMD ["pdf2zh", "-i"]

+ 12 - 6
README.md

@@ -37,15 +37,15 @@ Feel free to provide feedback in [GitHub Issues](https://github.com/Byaidu/PDFMa
 
 
 <h2 id="updates">Updates</h2>
 <h2 id="updates">Updates</h2>
 
 
+- [Nov. 23 2024] [ONNX](https://github.com/onnx/onnx) support to reduce dependency sizes *(by [@Wybxc](https://github.com/Wybxc))*  
+- [Nov. 23 2024] 🌟 [Public Service](#demo)  online! *(by [@Byaidu](https://github.com/Byaidu))*  
+- [Nov. 23 2024] Non-PDF/A documents are now supported *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 23 2024] Firewall for preventing web bots *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] Firewall for preventing web bots *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 22 2024] GUI now supports Italian, and has been improved *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
 - [Nov. 22 2024] GUI now supports Italian, and has been improved *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
 - [Nov. 22 2024] You can now share your deployed service to others *(by [@Zxis233](https://github.com/Zxis233))*  
 - [Nov. 22 2024] You can now share your deployed service to others *(by [@Zxis233](https://github.com/Zxis233))*  
-- [Nov. 22 2024] Now supportsTencent Translation *(by [@hellofinch](https://github.com/hellofinch))*  
+- [Nov. 22 2024] Now supports Tencent Translation *(by [@hellofinch](https://github.com/hellofinch))*  
 - [Nov. 21 2024] GUI now supports downloading dual-document *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 21 2024] GUI now supports downloading dual-document *(by [@reycn](https://github.com/reycn))*  
-- [Nov. 20 2024] GUI now supports specifying Ollama and OpenAI models *(by [@IuvenisSapiens](https://github.com/IuvenisSapiens), [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 20 2024] 🌟 [Demo](#demo)  online! *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 20 2024] 🌟 [Demo](#demo)  online! *(by [@reycn](https://github.com/reycn))*  
-- [Nov. 20 2024] Supports [Docker](#docker) *(by [@Byaidu](https://github.com/Byaidu))*  
-- [Nov. 20 2024] Supports [multiple-threads translation](#threads) *(by [@Byaidu](https://github.com/Byaidu))*  
 
 
 <h2 id="preview">Preview</h2>
 <h2 id="preview">Preview</h2>
 
 
@@ -53,9 +53,15 @@ Feel free to provide feedback in [GitHub Issues](https://github.com/Byaidu/PDFMa
 <img src="./docs/images/preview.gif" width="80%"/>
 <img src="./docs/images/preview.gif" width="80%"/>
 </div>
 </div>
 
 
-<h2 id="demo">Demo 🌟</h2>
+<h2 id="demo">Public Service 🌟</h2>
 
 
-You can try [our demo on HuggingFace](https://huggingface.co/spaces/reycn/PDFMathTranslate-Docker) without installation.  
+### Free Service (<https://pdf2zh.com/>)
+
+You can try our [public service](https://pdf2zh.com/) online without installation.  
+
+### Hugging Face Demo
+
+You can try [our demo on HuggingFace](https://huggingface.co/spaces/reycn/PDFMathTranslate-Docker) without installation.
 Note that the computing resources of the demo are limited, so please avoid abusing them.
 Note that the computing resources of the demo are limited, so please avoid abusing them.
 
 
 <h2 id="install">Installation and Usage</h2>
 <h2 id="install">Installation and Usage</h2>

+ 11 - 6
README_zh-CN.md

@@ -37,16 +37,15 @@
 
 
 <h2 id="updates">近期更新</h2>
 <h2 id="updates">近期更新</h2>
 
 
-
+- [Nov. 24 2024] 为降低依赖大小,提供 [ONNX](https://github.com/onnx/onnx) 支持 *(by [@Wybxc](https://github.com/Wybxc))*  
+- [Nov. 23 2024] 🌟 [免费公共服务](#demo) 上线! *(by [@Byaidu](https://github.com/Byaidu))*  
+- [Nov. 23 2024] 非 PDF/A 文档也能正常翻译了 *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 23 2024] 防止网页爬虫的防火墙 *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 防止网页爬虫的防火墙 *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 22 2024] 图形用户界面现已支持意大利语,并获得了一些更新 *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
 - [Nov. 22 2024] 图形用户界面现已支持意大利语,并获得了一些更新 *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
 - [Nov. 22 2024] 现在你可以将自己部署的服务分享给朋友了 *(by [@Zxis233](https://github.com/Zxis233))*  
 - [Nov. 22 2024] 现在你可以将自己部署的服务分享给朋友了 *(by [@Zxis233](https://github.com/Zxis233))*  
-- [Nov. 22 2024] Now supportsTencent Translation *(by [@hellofinch](https://github.com/hellofinch))*  
+- [Nov. 22 2024] 支持腾讯翻译 *(by [@hellofinch](https://github.com/hellofinch))*  
 - [Nov. 21 2024] 图形用户界面现在支持下载双语文档 *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 21 2024] 图形用户界面现在支持下载双语文档 *(by [@reycn](https://github.com/reycn))*  
-- [Nov. 20 2024] 图形用户界面现在支持指定 Ollama 和 OpenAI 的模型 *(by [@IuvenisSapiens](https://github.com/IuvenisSapiens), [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 20 2024] 🌟 提供了 [在线演示](#demo)! *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 20 2024] 🌟 提供了 [在线演示](#demo)! *(by [@reycn](https://github.com/reycn))*  
-- [Nov. 20 2024] 支持 [容器化部署](#docker) *(by [@Byaidu](https://github.com/Byaidu))*  
-- [Nov. 20 2024] 支持速度更快的 [多线程翻译](#threads) *(by [@Byaidu](https://github.com/Byaidu))* 
 
 
 <h2 id="preview">效果预览</h2>
 <h2 id="preview">效果预览</h2>
 
 
@@ -56,7 +55,13 @@
 
 
 <h2 id="demo">在线演示 🌟</h2>
 <h2 id="demo">在线演示 🌟</h2>
 
 
-你可以立即尝试 [在 HuggingFace 上的在线演示](https://huggingface.co/spaces/reycn/PDFMathTranslate-Docker) 而无需安装.  
+### 免费服务 (<https://pdf2zh.com/>)
+
+你可以立即尝试 [免费公共服务](https://pdf2zh.com/) 而无需安装。
+
+### Hugging Face 在线演示
+
+你可以立即尝试 [在 HuggingFace 上的在线演示](https://huggingface.co/spaces/reycn/PDFMathTranslate-Docker) 而无需安装。
 请注意,演示的计算资源有限,因此请避免滥用。
 请注意,演示的计算资源有限,因此请避免滥用。
 
 
 <h2 id="install">安装和使用</h2>
 <h2 id="install">安装和使用</h2>

+ 213 - 0
pdf2zh/doclayout.py

@@ -0,0 +1,213 @@
+import abc
+import cv2
+import numpy as np
+import contextlib
+from huggingface_hub import hf_hub_download
+
+
+class DocLayoutModel(abc.ABC):
+    @staticmethod
+    def load_torch():
+        model = TorchModel.from_pretrained(
+            repo_id="juliozhao/DocLayout-YOLO-DocStructBench",
+            filename="doclayout_yolo_docstructbench_imgsz1024.pt",
+        )
+        return model
+
+    @staticmethod
+    def load_onnx():
+        model = OnnxModel.from_pretrained(
+            repo_id="wybxc/DocLayout-YOLO-DocStructBench-onnx",
+            filename="doclayout_yolo_docstructbench_imgsz1024.onnx",
+        )
+        return model
+
+    @staticmethod
+    def load_available():
+        with contextlib.suppress(ImportError):
+            return DocLayoutModel.load_torch()
+
+        with contextlib.suppress(ImportError):
+            return DocLayoutModel.load_onnx()
+
+        raise ImportError(
+            "Please install the `torch` or `onnx` feature to use the DocLayout model."
+        )
+
+    @property
+    @abc.abstractmethod
+    def stride(self) -> int:
+        """Stride of the model input."""
+        pass
+
+    @abc.abstractmethod
+    def predict(self, image, imgsz=1024, **kwargs) -> list:
+        """
+        Predict the layout of a document page.
+
+        Args:
+            image: The image of the document page.
+            imgsz: Resize the image to this size. Must be a multiple of the stride.
+            **kwargs: Additional arguments.
+        """
+        pass
+
+
+class TorchModel(DocLayoutModel):
+    def __init__(self, model_path: str):
+        try:
+            import doclayout_yolo
+        except ImportError:
+            raise ImportError(
+                "Please install the `torch` feature to use the Torch model."
+            )
+
+        self.model_path = model_path
+        self.model = doclayout_yolo.YOLOv10(model_path)
+
+    @staticmethod
+    def from_pretrained(repo_id: str, filename: str):
+        pth = hf_hub_download(repo_id=repo_id, filename=filename)
+        return TorchModel(pth)
+
+    @property
+    def stride(self):
+        return 32
+
+    def predict(self, *args, **kwargs):
+        return self.model.predict(*args, **kwargs)
+
+
+class YoloResult:
+    """Helper class to store detection results from ONNX model."""
+
+    def __init__(self, boxes, names):
+        self.boxes = [YoloBox(data=d) for d in boxes]
+        self.boxes.sort(key=lambda x: x.conf, reverse=True)
+        self.names = names
+
+
+class YoloBox:
+    """Helper class to store detection results from ONNX model."""
+
+    def __init__(self, data):
+        self.xyxy = data[:4]
+        self.conf = data[-2]
+        self.cls = data[-1]
+
+
+class OnnxModel(DocLayoutModel):
+    def __init__(self, model_path: str):
+        import ast
+
+        try:
+
+            import onnx
+            import onnxruntime
+        except ImportError:
+            raise ImportError(
+                "Please install the `onnx` feature to use the ONNX model."
+            )
+
+        self.model_path = model_path
+
+        model = onnx.load(model_path)
+        metadata = {d.key: d.value for d in model.metadata_props}
+        self._stride = ast.literal_eval(metadata["stride"])
+        self._names = ast.literal_eval(metadata["names"])
+
+        self.model = onnxruntime.InferenceSession(model.SerializeToString())
+
+    @staticmethod
+    def from_pretrained(repo_id: str, filename: str):
+        pth = hf_hub_download(repo_id=repo_id, filename=filename)
+        return OnnxModel(pth)
+
+    @property
+    def stride(self):
+        return self._stride
+
+    def resize_and_pad_image(self, image, new_shape):
+        """
+        Resize and pad the image to the specified size, ensuring dimensions are multiples of stride.
+
+        Parameters:
+        - image: Input image
+        - new_shape: Target size (integer or (height, width) tuple)
+        - stride: Padding alignment stride, default 32
+
+        Returns:
+        - Processed image
+        """
+        if isinstance(new_shape, int):
+            new_shape = (new_shape, new_shape)
+
+        h, w = image.shape[:2]
+        new_h, new_w = new_shape
+
+        # Calculate scaling ratio
+        r = min(new_h / h, new_w / w)
+        resized_h, resized_w = int(round(h * r)), int(round(w * r))
+
+        # Resize image
+        image = cv2.resize(
+            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR
+        )
+
+        # Calculate padding size and align to stride multiple
+        pad_w = (new_w - resized_w) % self.stride
+        pad_h = (new_h - resized_h) % self.stride
+        top, bottom = pad_h // 2, pad_h - pad_h // 2
+        left, right = pad_w // 2, pad_w - pad_w // 2
+
+        # Add padding
+        image = cv2.copyMakeBorder(
+            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)
+        )
+
+        return image
+
+    def scale_boxes(self, img1_shape, boxes, img0_shape):
+        """
+        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally
+        specified in (img1_shape) to the shape of a different image (img0_shape).
+
+        Args:
+            img1_shape (tuple): The shape of the image that the bounding boxes are for,
+                in the format of (height, width).
+            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)
+            img0_shape (tuple): the shape of the target image, in the format of (height, width).
+
+        Returns:
+            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)
+        """
+
+        # Calculate scaling ratio
+        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])
+
+        # Calculate padding size
+        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)
+        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)
+
+        # Remove padding and scale boxes
+        boxes[..., :4] = (boxes[..., :4] - [pad_x, pad_y, pad_x, pad_y]) / gain
+        return boxes
+
+    def predict(self, image, imgsz=1024, **kwargs):
+        # Preprocess input image
+        orig_h, orig_w = image.shape[:2]
+        pix = self.resize_and_pad_image(image, new_shape=imgsz)
+        pix = np.transpose(pix, (2, 0, 1))  # CHW
+        pix = np.expand_dims(pix, axis=0)  # BCHW
+        pix = pix.astype(np.float32) / 255.0  # Normalize to [0, 1]
+        new_h, new_w = pix.shape[2:]
+
+        # Run inference
+        preds = self.model.run(None, {"images": pix})[0]
+
+        # Postprocess predictions
+        preds = preds[preds[..., 4] > 0.25]
+        preds[..., :4] = self.scale_boxes(
+            (new_h, new_w), preds[..., :4], (orig_h, orig_w)
+        )
+        return [YoloResult(boxes=preds, names=self._names)]

+ 2 - 7
pdf2zh/high_level.py

@@ -4,7 +4,6 @@ import logging
 import sys
 import sys
 from io import StringIO
 from io import StringIO
 from typing import Any, BinaryIO, Container, Iterator, Optional, cast
 from typing import Any, BinaryIO, Container, Iterator, Optional, cast
-import torch
 import numpy as np
 import numpy as np
 import tqdm
 import tqdm
 from pymupdf import Document
 from pymupdf import Document
@@ -22,7 +21,7 @@ from pdf2zh.pdfdevice import PDFDevice, TagExtractor
 from pdf2zh.pdfexceptions import PDFValueError
 from pdf2zh.pdfexceptions import PDFValueError
 from pdf2zh.pdfinterp import PDFPageInterpreter, PDFResourceManager
 from pdf2zh.pdfinterp import PDFPageInterpreter, PDFResourceManager
 from pdf2zh.pdfpage import PDFPage
 from pdf2zh.pdfpage import PDFPage
-from pdf2zh.utils import AnyIO, FileOrName, open_filename
+from pdf2zh.utils import AnyIO, FileOrName, open_filename, get_device
 
 
 
 
 def extract_text_to_fp(
 def extract_text_to_fp(
@@ -176,11 +175,7 @@ def extract_text_to_fp(
                 pix.height, pix.width, 3
                 pix.height, pix.width, 3
             )[:, :, ::-1]
             )[:, :, ::-1]
             page_layout = model.predict(
             page_layout = model.predict(
-                image,
-                imgsz=int(pix.height / 32) * 32,
-                device=(
-                    "cuda:0" if torch.cuda.is_available() else "cpu"
-                ),  # Auto-select GPU if available
+                image, imgsz=int(pix.height / 32) * 32, device=get_device()
             )[0]
             )[0]
             # kdtree 是不可能 kdtree 的,不如直接渲染成图片,用空间换时间
             # kdtree 是不可能 kdtree 的,不如直接渲染成图片,用空间换时间
             box = np.ones((pix.height, pix.width))
             box = np.ones((pix.height, pix.width))

+ 9 - 15
pdf2zh/pdf2zh.py

@@ -14,7 +14,6 @@ from pathlib import Path
 from typing import TYPE_CHECKING, Any, Container, Iterable, List, Optional
 from typing import TYPE_CHECKING, Any, Container, Iterable, List, Optional
 
 
 import pymupdf
 import pymupdf
-from huggingface_hub import hf_hub_download
 
 
 from pdf2zh import __version__
 from pdf2zh import __version__
 from pdf2zh.pdfexceptions import PDFValueError
 from pdf2zh.pdfexceptions import PDFValueError
@@ -27,10 +26,14 @@ OUTPUT_TYPES = ((".htm", "html"), (".html", "html"), (".xml", "xml"), (".tag", "
 
 
 
 
 def setup_log() -> None:
 def setup_log() -> None:
-    import doclayout_yolo
-
     logging.basicConfig()
     logging.basicConfig()
-    doclayout_yolo.utils.LOGGER.setLevel(logging.WARNING)
+
+    try:
+        import doclayout_yolo
+
+        doclayout_yolo.utils.LOGGER.setLevel(logging.WARNING)
+    except ImportError:
+        pass
 
 
 
 
 def check_files(files: List[str]) -> List[str]:
 def check_files(files: List[str]) -> List[str]:
@@ -73,8 +76,7 @@ def extract_text(
     output: str = "",
     output: str = "",
     **kwargs: Any,
     **kwargs: Any,
 ) -> AnyIO:
 ) -> AnyIO:
-    import doclayout_yolo
-
+    from pdf2zh.doclayout import DocLayoutModel
     import pdf2zh.high_level
     import pdf2zh.high_level
 
 
     if not files:
     if not files:
@@ -86,15 +88,7 @@ def extract_text(
                 output_type = alttype
                 output_type = alttype
 
 
     outfp: AnyIO = sys.stdout
     outfp: AnyIO = sys.stdout
-    # pth = os.path.join(tempfile.gettempdir(), 'doclayout_yolo_docstructbench_imgsz1024.pt')
-    # if not os.path.exists(pth):
-    #     print('Downloading...')
-    #     urllib.request.urlretrieve("http://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench/resolve/main/doclayout_yolo_docstructbench_imgsz1024.pt",pth)
-    pth = hf_hub_download(
-        repo_id="juliozhao/DocLayout-YOLO-DocStructBench",
-        filename="doclayout_yolo_docstructbench_imgsz1024.pt",
-    )
-    model = doclayout_yolo.YOLOv10(pth)
+    model = DocLayoutModel.load_available()
 
 
     for file in files:
     for file in files:
         filename = os.path.splitext(os.path.basename(file))[0]
         filename = os.path.splitext(os.path.basename(file))[0]

+ 13 - 0
pdf2zh/utils.py

@@ -819,3 +819,16 @@ def format_int_alpha(value: int) -> str:
 
 
     result.reverse()
     result.reverse()
     return "".join(result)
     return "".join(result)
+
+
+def get_device():
+    """Get the device to use for computation."""
+    try:
+        import torch
+
+        if torch.cuda.is_available():
+            return "cuda:0"
+    except ImportError:
+        pass
+
+    return "cpu"

+ 8 - 3
pyproject.toml

@@ -5,7 +5,7 @@ description = "Latex PDF Translator"
 authors = [{ name = "Byaidu", email = "byaidux@gmail.com" }]
 authors = [{ name = "Byaidu", email = "byaidux@gmail.com" }]
 license = "AGPL-3.0"
 license = "AGPL-3.0"
 readme = "README.md"
 readme = "README.md"
-requires-python = ">=3.8,<3.13"
+requires-python = ">=3.9,<3.13"
 classifiers = [
 classifiers = [
     "Programming Language :: Python :: 3",
     "Programming Language :: Python :: 3",
     "Operating System :: OS Independent",
     "Operating System :: OS Independent",
@@ -17,7 +17,6 @@ dependencies = [
     "pymupdf",
     "pymupdf",
     "tqdm",
     "tqdm",
     "tenacity",
     "tenacity",
-    "doclayout-yolo",
     "numpy",
     "numpy",
     "ollama",
     "ollama",
     "deepl<1.19.1",
     "deepl<1.19.1",
@@ -25,10 +24,16 @@ dependencies = [
     "azure-ai-translation-text<=1.0.1",
     "azure-ai-translation-text<=1.0.1",
     "gradio",
     "gradio",
     "huggingface_hub",
     "huggingface_hub",
-    "torch",
+    "onnx",
+    "onnxruntime",
+    "opencv-python-headless",
 ]
 ]
 
 
 [project.optional-dependencies]
 [project.optional-dependencies]
+torch = [
+    "doclayout-yolo",
+    "torch",
+]
 dev = [
 dev = [
     "black",
     "black",
     "flake8",
     "flake8",