ソースを参照

Merge branch 'main' of https://github.com/Byaidu/PDFMathTranslate

Byaidu 1 年間 前
コミット
84f9263d83
6 ファイル変更189 行追加3 行削除
  1. 15 1
      README.md
  2. 16 0
      README_zh-CN.md
  3. 5 0
      pdf2zh/converter.py
  4. 27 1
      pdf2zh/pdf2zh.py
  5. 125 1
      pdf2zh/translator.py
  6. 1 0
      pyproject.toml

+ 15 - 1
README.md

@@ -37,7 +37,8 @@ Feel free to provide feedback in [GitHub Issues](https://github.com/Byaidu/PDFMa
 
 
 <h2 id="updates">Updates</h2>
 <h2 id="updates">Updates</h2>
 
 
-- [Nov. 23 2024] [ONNX](https://github.com/onnx/onnx) support to reduce dependency sizes *(by [@Wybxc](https://github.com/Wybxc))*  
+- [Nov. 26 2024] CLI now supports online file(s) *(by [@reycn](https://github.com/reycn))*  
+- [Nov. 24 2024] [ONNX](https://github.com/onnx/onnx) support to reduce dependency sizes *(by [@Wybxc](https://github.com/Wybxc))*  
 - [Nov. 23 2024] 🌟 [Public Service](#demo)  online! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 🌟 [Public Service](#demo)  online! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] Firewall for preventing web bots *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] Firewall for preventing web bots *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 22 2024] GUI now supports Italian, and has been improved *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
 - [Nov. 22 2024] GUI now supports Italian, and has been improved *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
@@ -146,6 +147,8 @@ In the following table, we list all advanced options for reference:
 
 
 | Option    | Function | Example |
 | Option    | Function | Example |
 | -------- | ------- |------- |
 | -------- | ------- |------- |
+| (document)  | Local file(s) |  `pdf2zh ~/local.pdf` |
+|  | Online files(s) |  `pdf2zh http://web.com/online.pdf` |
 | `-i`  | [Enter GUI](#gui) |  `pdf2zh -i` |
 | `-i`  | [Enter GUI](#gui) |  `pdf2zh -i` |
 | `-p`  | [Partial document translation](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-p`  | [Partial document translation](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-li` | [Source language](#languages) |  `pdf2zh example.pdf -li en` |
 | `-li` | [Source language](#languages) |  `pdf2zh example.pdf -li en` |
@@ -239,6 +242,17 @@ pdf2zh example.pdf -li en -lo ja
   ```bash
   ```bash
   pdf2zh example.pdf -s azure
   pdf2zh example.pdf -s azure
   ```
   ```
+- **Tencent Machine Translation**
+
+  See [Tencent Machine Translation](https://www.tencentcloud.com/products/tmt?from_qcintl=122110104)
+
+  Following ENVs are required:
+  - `TENCENT_SECRET_ID`, e.g., `export TENCENT_SECRET_ID=AKIDxxx`
+  - `TENCENT_SECRET_KEY`, e.g, `export TENCENT_SECRET_KEY=xxx`
+
+  ```bash
+  pdf2zh example.pdf -s tmt
+  ```
 
 
 <h3 id="exceptions">Translate wih exceptions</h3>
 <h3 id="exceptions">Translate wih exceptions</h3>
 
 

+ 16 - 0
README_zh-CN.md

@@ -37,6 +37,7 @@
 
 
 <h2 id="updates">近期更新</h2>
 <h2 id="updates">近期更新</h2>
 
 
+- [Nov. 26 2024] CLI 现在已支持(多个)在线 PDF 文件 *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 24 2024] 为降低依赖大小,提供 [ONNX](https://github.com/onnx/onnx) 支持 *(by [@Wybxc](https://github.com/Wybxc))*  
 - [Nov. 24 2024] 为降低依赖大小,提供 [ONNX](https://github.com/onnx/onnx) 支持 *(by [@Wybxc](https://github.com/Wybxc))*  
 - [Nov. 23 2024] 🌟 [免费公共服务](#demo) 上线! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 🌟 [免费公共服务](#demo) 上线! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 防止网页爬虫的防火墙 *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 防止网页爬虫的防火墙 *(by [@Byaidu](https://github.com/Byaidu))*  
@@ -146,6 +147,8 @@
 
 
 | Option    | Function | Example |
 | Option    | Function | Example |
 | -------- | ------- |------- |
 | -------- | ------- |------- |
+| (文档)  | 本地(多个)文件 |  `pdf2zh ~/local.pdf` |
+|  | 在线(多个)文件|  `pdf2zh http://web.com/online.pdf` |
 | `-i`  | [进入图形界面](#gui) |  `pdf2zh -i` |
 | `-i`  | [进入图形界面](#gui) |  `pdf2zh -i` |
 | `-p`  | [仅翻译部分文档](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-p`  | [仅翻译部分文档](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-li` | [源语言](#languages) |  `pdf2zh example.pdf -li en` |
 | `-li` | [源语言](#languages) |  `pdf2zh example.pdf -li en` |
@@ -245,6 +248,19 @@ pdf2zh example.pdf -s openai:gpt-4o
 pdf2zh example.pdf -s azure
 pdf2zh example.pdf -s azure
 ```
 ```
 
 
+- **腾讯机器翻译**
+
+参考 [腾讯机器翻译](https://cloud.tencent.com/product/tmt)
+
+需设置以下环境变量:
+
+- `TENCENT_SECRET_ID`, e.g., `export TENCENT_SECRET_ID=AKIDxxx`
+- `TENCENT_SECRET_KEY`, e.g., `export TENCENT_SECRET_KEY=xxx`
+
+```bash
+pdf2zh example.pdf -s tmt
+```
+
 <h3 id="exceptions">指定例外规则</h3>
 <h3 id="exceptions">指定例外规则</h3>
 
 
 使用正则表达式指定需保留的公式字体与字符
 使用正则表达式指定需保留的公式字体与字符

+ 5 - 0
pdf2zh/converter.py

@@ -69,6 +69,7 @@ from pdf2zh.translator import (
     OllamaTranslator,
     OllamaTranslator,
     OpenAITranslator,
     OpenAITranslator,
     AzureTranslator,
     AzureTranslator,
+    TencentTranslator,
 )
 )
 
 
 
 
@@ -394,6 +395,10 @@ class TextConverter(PDFConverter[AnyIO]):
             self.translator: BaseTranslator = AzureTranslator(
             self.translator: BaseTranslator = AzureTranslator(
                 service, lang_out, lang_in, None
                 service, lang_out, lang_in, None
             )
             )
+        elif param[0] == "tencent":
+            self.translator: BaseTranslator = TencentTranslator(
+                service, lang_out, lang_in, None
+            )
         else:
         else:
             raise ValueError("Unsupported translation service")
             raise ValueError("Unsupported translation service")
 
 

+ 27 - 1
pdf2zh/pdf2zh.py

@@ -13,6 +13,7 @@ from pathlib import Path
 from typing import TYPE_CHECKING, Any, Container, Iterable, List, Optional
 from typing import TYPE_CHECKING, Any, Container, Iterable, List, Optional
 
 
 import pymupdf
 import pymupdf
+import requests
 
 
 from pdf2zh import __version__
 from pdf2zh import __version__
 from pdf2zh.pdfexceptions import PDFValueError
 from pdf2zh.pdfexceptions import PDFValueError
@@ -36,6 +37,12 @@ def setup_log() -> None:
 
 
 
 
 def check_files(files: List[str]) -> List[str]:
 def check_files(files: List[str]) -> List[str]:
+    files = [
+        f for f in files if not f.startswith("http://")
+    ]  # exclude online files, http
+    files = [
+        f for f in files if not f.startswith("https://")
+    ]  # exclude online files, https
     missing_files = [file for file in files if not os.path.exists(file)]
     missing_files = [file for file in files if not os.path.exists(file)]
     return missing_files
     return missing_files
 
 
@@ -75,8 +82,8 @@ def extract_text(
     output: str = "",
     output: str = "",
     **kwargs: Any,
     **kwargs: Any,
 ) -> AnyIO:
 ) -> AnyIO:
-    from pdf2zh.doclayout import DocLayoutModel
     import pdf2zh.high_level
     import pdf2zh.high_level
+    from pdf2zh.doclayout import DocLayoutModel
 
 
     if not files:
     if not files:
         raise PDFValueError("Must provide files to work upon!")
         raise PDFValueError("Must provide files to work upon!")
@@ -90,6 +97,24 @@ def extract_text(
     model = DocLayoutModel.load_available()
     model = DocLayoutModel.load_available()
 
 
     for file in files:
     for file in files:
+        if file.startswith("http://") or file.startswith("https://"):
+            print("Online files detected, downloading...")
+            try:
+                r = requests.get(file, allow_redirects=True)
+                if r.status_code == 200:
+                    if not os.path.exists("./pdf2zh_files"):
+                        print("Making a temporary dir for downloading PDF files...")
+                        os.mkdir(os.path.dirname("./pdf2zh_files"))
+                    with open("./pdf2zh_files/tmp_download.pdf", "wb") as f:
+                        print(f"Writing the file: {file}...")
+                        f.write(r.content)
+                    file = "./pdf2zh_files/tmp_download.pdf"
+                else:
+                    r.raise_for_status()
+            except Exception as e:
+                raise PDFValueError(
+                    f"Errors occur in downloading the PDF file. Please check the link(s).\nError:\n{e}"
+                )
         filename = os.path.splitext(os.path.basename(file))[0]
         filename = os.path.splitext(os.path.basename(file))[0]
 
 
         doc_en = pymupdf.open(file)
         doc_en = pymupdf.open(file)
@@ -282,3 +307,4 @@ def main(args: Optional[List[str]] = None) -> int:
 
 
 if __name__ == "__main__":
 if __name__ == "__main__":
     sys.exit(main())
     sys.exit(main())
+    sys.exit(main())

+ 125 - 1
pdf2zh/translator.py

@@ -1,7 +1,11 @@
+import hashlib
+import hmac
 import html
 import html
 import logging
 import logging
 import os
 import os
 import re
 import re
+import time
+from datetime import UTC, datetime
 from json import dumps, loads
 from json import dumps, loads
 
 
 import deepl
 import deepl
@@ -55,6 +59,122 @@ class GoogleTranslator(BaseTranslator):
         return result
         return result
 
 
 
 
+class TencentTranslator(BaseTranslator):
+    def sign(self, key, msg):
+        return hmac.new(key, msg.encode("utf-8"), hashlib.sha256).digest()
+
+    def __init__(self, service, lang_out, lang_in, model):
+        lang_out = "zh" if lang_out == "auto" else lang_out
+        lang_in = "en" if lang_in == "auto" else lang_in
+        super().__init__(service, lang_out, lang_in, model)
+        try:
+            server_url = "tmt.tencentcloudapi.com"
+            self.secret_id = os.getenv("TENCENT_SECRET_ID")
+            self.secret_key = os.getenv("TENCENT_SECRET_KEY")
+
+        except KeyError as e:
+            missing_var = e.args[0]
+            raise ValueError(
+                f"The environment variable '{missing_var}' is required but not set."
+            ) from e
+
+        self.session = requests.Session()
+        self.base_link = f"{server_url}"
+
+    def translate(self, text):
+        text = text[:5000]
+        data = {
+            "SourceText": text,
+            "Source": self.lang_in,
+            "Target": self.lang_out,
+            "ProjectId": 0,
+        }
+        payloadx = dumps(data)
+        hashed_request_payload = hashlib.sha256(payloadx.encode("utf-8")).hexdigest()
+        canonical_request = (
+            "POST"
+            + "\n"
+            + "/"
+            + "\n"
+            + ""
+            + "\n"
+            + "content-type:application/json; charset=utf-8\nhost:tmt.tencentcloudapi.com\nx-tc-action:texttranslate\n"
+            + "\n"
+            + "content-type;host;x-tc-action"
+            + "\n"
+            + hashed_request_payload
+        )
+
+        timestamp = int(time.time())
+        date = datetime.fromtimestamp(timestamp, UTC).strftime("%Y-%m-%d")
+        credential_scope = date + "/tmt/tc3_request"
+        hashed_canonical_request = hashlib.sha256(
+            canonical_request.encode("utf-8")
+        ).hexdigest()
+        algorithm = "TC3-HMAC-SHA256"
+        string_to_sign = (
+            algorithm
+            + "\n"
+            + str(timestamp)
+            + "\n"
+            + credential_scope
+            + "\n"
+            + hashed_canonical_request
+        )
+        secret_date = self.sign(("TC3" + self.secret_key).encode("utf-8"), date)
+        secret_service = self.sign(secret_date, "tmt")
+        secret_signing = self.sign(secret_service, "tc3_request")
+        signed_headers = "content-type;host;x-tc-action"
+        signature = hmac.new(
+            secret_signing, string_to_sign.encode("utf-8"), hashlib.sha256
+        ).hexdigest()
+        authorization = (
+            algorithm
+            + " "
+            + "Credential="
+            + self.secret_id
+            + "/"
+            + credential_scope
+            + ", "
+            + "SignedHeaders="
+            + signed_headers
+            + ", "
+            + "Signature="
+            + signature
+        )
+        self.headers = {
+            "Authorization": authorization,
+            "Content-Type": "application/json; charset=utf-8",
+            "Host": "tmt.tencentcloudapi.com",
+            "X-TC-Action": "TextTranslate",
+            "X-TC-Region": "ap-beijing",
+            "X-TC-Timestamp": str(timestamp),
+            "X-TC-Version": "2018-03-21",
+        }
+
+        response = self.session.post(
+            "https://" + self.base_link,
+            json=data,
+            headers=self.headers,
+        )
+        # 1. Status code test
+        if response.status_code == 200:
+            result = loads(response.text)
+        else:
+            raise ValueError("HTTP error: " + str(response.status_code))
+        # 2. Result test
+        try:
+            result = result["Response"]["TargetText"]
+            return result
+        except KeyError:
+            result = ""
+            raise ValueError("No valid key in Tencent's response")
+        # 3. Result length check
+        if len(result) == 0:
+            raise ValueError("Empty translation result")
+        return result
+
+
 class DeepLXTranslator(BaseTranslator):
 class DeepLXTranslator(BaseTranslator):
     def __init__(self, service, lang_out, lang_in, model):
     def __init__(self, service, lang_out, lang_in, model):
         lang_out = "zh" if lang_out == "auto" else lang_out
         lang_out = "zh" if lang_out == "auto" else lang_out
@@ -74,7 +194,11 @@ class DeepLXTranslator(BaseTranslator):
             ) from e
             ) from e
 
 
         self.session = requests.Session()
         self.session = requests.Session()
-        self.base_link = f"{server_url}/{auth_key}/translate"
+        server_url = server_url.rstrip("/")
+        if auth_key:
+            self.base_link = f"{server_url}/{auth_key}/translate"
+        else:
+            self.base_link = f"{server_url}/translate"
         self.headers = {
         self.headers = {
             "User-Agent": "Mozilla/4.0 (compatible;MSIE 6.0;Windows NT 5.1;SV1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30)"  # noqa: E501
             "User-Agent": "Mozilla/4.0 (compatible;MSIE 6.0;Windows NT 5.1;SV1;.NET CLR 1.1.4322;.NET CLR 2.0.50727;.NET CLR 3.0.04506.30)"  # noqa: E501
         }
         }

+ 1 - 0
pyproject.toml

@@ -21,6 +21,7 @@ dependencies = [
     "ollama",
     "ollama",
     "deepl<1.19.1",
     "deepl<1.19.1",
     "openai",
     "openai",
+    "requests",
     "azure-ai-translation-text<=1.0.1",
     "azure-ai-translation-text<=1.0.1",
     "gradio",
     "gradio",
     "huggingface_hub",
     "huggingface_hub",