Просмотр исходного кода

feat (main): support online PDF file(s)

Rongxin 1 год назад
Родитель
Сommit
1263568a31
3 измененных файлов с 34 добавлено и 2 удалено
  1. 4 1
      README.md
  2. 3 0
      README_zh-CN.md
  3. 27 1
      pdf2zh/pdf2zh.py

+ 4 - 1
README.md

@@ -37,7 +37,8 @@ Feel free to provide feedback in [GitHub Issues](https://github.com/Byaidu/PDFMa
 
 <h2 id="updates">Updates</h2>
 
-- [Nov. 23 2024] [ONNX](https://github.com/onnx/onnx) support to reduce dependency sizes *(by [@Wybxc](https://github.com/Wybxc))*  
+- [Nov. 26 2024] CLI now supports online file(s) *(by [@reycn](https://github.com/reycn))*  
+- [Nov. 24 2024] [ONNX](https://github.com/onnx/onnx) support to reduce dependency sizes *(by [@Wybxc](https://github.com/Wybxc))*  
 - [Nov. 23 2024] 🌟 [Public Service](#demo)  online! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] Firewall for preventing web bots *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 22 2024] GUI now supports Italian, and has been improved *(by [@Byaidu](https://github.com/Byaidu), [@reycn](https://github.com/reycn))*  
@@ -146,6 +147,8 @@ In the following table, we list all advanced options for reference:
 
 | Option    | Function | Example |
 | -------- | ------- |------- |
+| (document)  | Local file(s) |  `pdf2zh ~/local.pdf` |
+|  | Online files(s) |  `pdf2zh http://web.com/online.pdf` |
 | `-i`  | [Enter GUI](#gui) |  `pdf2zh -i` |
 | `-p`  | [Partial document translation](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-li` | [Source language](#languages) |  `pdf2zh example.pdf -li en` |

+ 3 - 0
README_zh-CN.md

@@ -37,6 +37,7 @@
 
 <h2 id="updates">近期更新</h2>
 
+- [Nov. 26 2024] CLI 现在已支持(多个)在线 PDF 文件 *(by [@reycn](https://github.com/reycn))*  
 - [Nov. 24 2024] 为降低依赖大小,提供 [ONNX](https://github.com/onnx/onnx) 支持 *(by [@Wybxc](https://github.com/Wybxc))*  
 - [Nov. 23 2024] 🌟 [免费公共服务](#demo) 上线! *(by [@Byaidu](https://github.com/Byaidu))*  
 - [Nov. 23 2024] 防止网页爬虫的防火墙 *(by [@Byaidu](https://github.com/Byaidu))*  
@@ -146,6 +147,8 @@
 
 | Option    | Function | Example |
 | -------- | ------- |------- |
+| (文档)  | 本地(多个)文件 |  `pdf2zh ~/local.pdf` |
+|  | 在线(多个)文件|  `pdf2zh http://web.com/online.pdf` |
 | `-i`  | [进入图形界面](#gui) |  `pdf2zh -i` |
 | `-p`  | [仅翻译部分文档](#partial) |  `pdf2zh example.pdf -p 1` |
 | `-li` | [源语言](#languages) |  `pdf2zh example.pdf -li en` |

+ 27 - 1
pdf2zh/pdf2zh.py

@@ -13,6 +13,7 @@ from pathlib import Path
 from typing import TYPE_CHECKING, Any, Container, Iterable, List, Optional
 
 import pymupdf
+import requests
 
 from pdf2zh import __version__
 from pdf2zh.pdfexceptions import PDFValueError
@@ -36,6 +37,12 @@ def setup_log() -> None:
 
 
 def check_files(files: List[str]) -> List[str]:
+    files = [
+        f for f in files if not f.startswith("http://")
+    ]  # exclude online files, http
+    files = [
+        f for f in files if not f.startswith("https://")
+    ]  # exclude online files, https
     missing_files = [file for file in files if not os.path.exists(file)]
     return missing_files
 
@@ -75,8 +82,8 @@ def extract_text(
     output: str = "",
     **kwargs: Any,
 ) -> AnyIO:
-    from pdf2zh.doclayout import DocLayoutModel
     import pdf2zh.high_level
+    from pdf2zh.doclayout import DocLayoutModel
 
     if not files:
         raise PDFValueError("Must provide files to work upon!")
@@ -90,6 +97,24 @@ def extract_text(
     model = DocLayoutModel.load_available()
 
     for file in files:
+        if file.startswith("http://") or file.startswith("https://"):
+            print("Online files detected, downloading...")
+            try:
+                r = requests.get(file, allow_redirects=True)
+                if r.status_code == 200:
+                    if not os.path.exists("./pdf2zh_files"):
+                        print("Making a temporary dir for downloading PDF files...")
+                        os.mkdir(os.path.dirname("./pdf2zh_files"))
+                    with open("./pdf2zh_files/tmp_download.pdf", "wb") as f:
+                        print(f"Writing the file: {file}...")
+                        f.write(r.content)
+                    file = "./pdf2zh_files/tmp_download.pdf"
+                else:
+                    r.raise_for_status()
+            except Exception as e:
+                raise PDFValueError(
+                    f"Errors occur in downloading the PDF file. Please check the link(s).\nError:\n{e}"
+                )
         filename = os.path.splitext(os.path.basename(file))[0]
 
         doc_en = pymupdf.open(file)
@@ -282,3 +307,4 @@ def main(args: Optional[List[str]] = None) -> int:
 
 if __name__ == "__main__":
     sys.exit(main())
+    sys.exit(main())