2 years ago · 8ab670afc3
--- a/README.md
+++ b/README.md
@@ -27,6 +27,7 @@
 
				 
			
 
				 <a name="whats-new"></a>
			
 
				 ## What's new:
			
 
				+- 2024/03/05：Added support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from the[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary), and [openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper).
			
 
				 - 2024/03/03: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5，Real-time Transcription Service 1.9 released，Docker image supports ARM64 platform；([docs](runtime/readme.md))
			
 
				 - 2024/01/30：funasr-1.0 has been released ([docs](https://github.com/alibaba-damo-academy/FunASR/discussions/1319))
			
 
				 - 2024/01/30：emotion recognition models are new supported. [model link](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary), modified from [repo](https://github.com/ddlBoJack/emotion2vec).
			
@@ -67,20 +68,21 @@ pip3 install -U modelscope
 
				 ## Model Zoo
			
 
				 FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the [Model License Agreement](./MODEL_LICENSE). Below are some representative models, for more models please refer to the [Model Zoo]().
			
 
				 
			
 
				-(Note: ⭐ represents the ModelScope model zoo link, 🤗 represents the Huggingface model zoo link)
			
 
				-
			
 
				-
			
 
				-|                                                                                                         Model Name                                                                                                         |                    Task Details                    |          Training Data           | Parameters |
			
 
				-|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------:|:--------------------------------:|:----------:|
			
 
				-|          paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )           | speech recognition, with timestamps, non-streaming |      60000 hours, Mandarin       |    220M    |
			
 
				-| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> |           speech recognition, streaming            |      60000 hours, Mandarin       |    220M    |
			
 
				-|               paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )                | speech recognition, with timestamps, non-streaming |       50000 hours, English       |    220M    |
			
 
				-|                            conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                             |         speech recognition, non-streaming          |       50000 hours, English       |    220M    |
			
 
				-|                               ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                               |              punctuation restoration               |    100M, Mandarin and English    |    1.1G    | 
			
 
				-|                                   fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                                   |              voice activity detection              | 5000 hours, Mandarin and English |    0.4M    | 
			
 
				-|                                     fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                                     |                timestamp prediction                |       5000 hours, Mandarin       |    38M     | 
			
 
				-|                                       cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                        |        speaker verification/diarization            |            5000 hours            |    7.2M    | 
			
 
				-|                                                 whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() )                                                   | speech recognition, with timestamps, non-streaming |          multilingual            |     1G     |
			
 
				+(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)
			
 
				+
			
 
				+
			
 
				+|                                                                                                         Model Name                                                                                                         |                     Task Details                      |          Training Data           | Parameters |
			
 
				+|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------:|:--------------------------------:|:----------:|
			
 
				+|          paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )           |  speech recognition, with timestamps, non-streaming   |      60000 hours, Mandarin       |    220M    |
			
 
				+| <nobr>paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )</nobr> |             speech recognition, streaming             |      60000 hours, Mandarin       |    220M    |
			
 
				+|               paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )                | speech recognition, without timestamps, non-streaming |       50000 hours, English       |    220M    |
			
 
				+|                            conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                             |           speech recognition, non-streaming           |       50000 hours, English       |    220M    |
			
 
				+|                               ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                               |                punctuation restoration                |    100M, Mandarin and English    |    1.1G    | 
			
 
				+|                                   fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                                   |               voice activity detection                | 5000 hours, Mandarin and English |    0.4M    | 
			
 
				+|                                     fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                                     |                 timestamp prediction                  |       5000 hours, Mandarin       |    38M     | 
			
 
				+|                                       cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                        |           speaker verification/diarization            |            5000 hours            |    7.2M    | 
			
 
				+|                                                  Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🍀](https://github.com/openai/whisper) )                                                  |  speech recognition, with timestamps, non-streaming   |          multilingual            |    1.5G    |
			
 
				+|                                                Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)  [🍀](https://github.com/openai/whisper) )                                                 |  speech recognition, with timestamps, non-streaming   |          multilingual            |    1.5G    |
			
 
				 
			
 
				 
			
 
				 
			
--- a/README_zh.md
+++ b/README_zh.md
@@ -29,6 +29,7 @@ FunASR希望在语音识别的学术研究和工业应用之间架起一座桥
 
				 
			
 
				 <a name="最新动态"></a>
			
 
				 ## 最新动态
			
 
				+- 2024/03/05：新增加Whisper-large-v3模型支持，多语言语音识别/翻译/语种识别，支持从[modelscope](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)仓库下载，也支持从[openai](https://github.com/alibaba-damo-academy/FunASR/tree/main/examples/industrial_data_pretraining/whisper)仓库下载模型。
			
 
				 - 2024/03/03: 中文离线文件转写服务 4.4、英文离线文件转写服务 1.5、中文实时语音听写服务 1.9 发布，docker镜像支持arm64平台；详细信息参阅([部署文档](runtime/readme_cn.md))
			
 
				 - 2024/01/30：funasr-1.0发布，更新说明[文档](https://github.com/alibaba-damo-academy/FunASR/discussions/1319)
			
 
				 - 2024/01/30：新增加情感识别 [模型链接](https://www.modelscope.cn/models/iic/emotion2vec_base_finetuned/summary)，原始模型 [repo](https://github.com/ddlBoJack/emotion2vec).
			
@@ -69,20 +70,21 @@ pip3 install -U modelscope
 
				 
			
 
				 FunASR开源了大量在工业数据上预训练模型，您可以在[模型许可协议](./MODEL_LICENSE)下自由使用、复制、修改和分享FunASR模型，下面列举代表性的模型，更多模型请参考[模型仓库]()。
			
 
				 
			
 
				-（注：⭐ 表示ModelScope模型仓库链接，🤗 表示Huggingface模型仓库链接）
			
 
				-
			
 
				-
			
 
				-|                                         模型名字                                                                                                                 |      任务详情       |     训练数据     | 参数量  |
			
 
				-|:------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:------------:|:----:|
			
 
				-| paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) ) | 语音识别，带时间戳输出，非实时 |  60000小时，中文  | 220M |
			
 
				-|   paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) )   |     语音识别，实时     |  60000小时，中文  | 220M |
			
 
				-|      paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )      |    语音识别，非实时     |  50000小时，英文  | 220M |
			
 
				-|                  conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                   |    语音识别，非实时     |  50000小时，英文  | 220M |
			
 
				-|                  ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                   |      标点恢复       |  100M，中文与英文  | 1.1G | 
			
 
				-|                       fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                       |    语音端点检测，实时    | 5000小时，中文与英文 | 0.4M | 
			
 
				-|                       fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                        |    字级别时间戳预测     |  50000小时，中文  | 38M  |
			
 
				-|                           cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                            |    说话人确认/分割     |    5000小时    | 7.2M | 
			
 
				-| whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🤗]() ) | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |
			
 
				+（注：⭐ 表示ModelScope模型仓库，🤗 表示Huggingface模型仓库，🍀表示OpenAI模型仓库）
			
 
				+
			
 
				+
			
 
				+|                                                                                                     模型名字                                                                                                      |      任务详情       |     训练数据     | 参数量  | 
			
 
				+|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:---------------:|:------------:|:----:|
			
 
				+|    paraformer-zh <br> ([⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary)  [🤗](https://huggingface.co/funasr/paraformer-tp) )    | 语音识别，带时间戳输出，非实时 |  60000小时，中文  | 220M |
			
 
				+| paraformer-zh-streaming <br> ( [⭐](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online/summary) [🤗](https://huggingface.co/funasr/paraformer-zh-streaming) ) |     语音识别，实时     |  60000小时，中文  | 220M |
			
 
				+|         paraformer-en <br> ( [⭐](https://www.modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-en-16k-common-vocab10020/summary) [🤗](https://huggingface.co/funasr/paraformer-en) )         |    语音识别，非实时     |  50000小时，英文  | 220M |
			
 
				+|                      conformer-en <br> ( [⭐](https://modelscope.cn/models/damo/speech_conformer_asr-en-16k-vocab4199-pytorch/summary) [🤗](https://huggingface.co/funasr/conformer-en) )                      |    语音识别，非实时     |  50000小时，英文  | 220M |
			
 
				+|                        ct-punc <br> ( [⭐](https://modelscope.cn/models/damo/punc_ct-transformer_cn-en-common-vocab471067-large/summary) [🤗](https://huggingface.co/funasr/ct-punc) )                         |      标点恢复       |  100M，中文与英文  | 1.1G | 
			
 
				+|                            fsmn-vad <br> ( [⭐](https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/summary) [🤗](https://huggingface.co/funasr/fsmn-vad) )                             |    语音端点检测，实时    | 5000小时，中文与英文 | 0.4M | 
			
 
				+|                              fa-zh <br> ( [⭐](https://modelscope.cn/models/damo/speech_timestamp_prediction-v1-16k-offline/summary) [🤗](https://huggingface.co/funasr/fa-zh) )                               |    字级别时间戳预测     |  50000小时，中文  | 38M  |
			
 
				+|                                 cam++ <br> ( [⭐](https://modelscope.cn/models/iic/speech_campplus_sv_zh-cn_16k-common/summary) [🤗](https://huggingface.co/funasr/campplus) )                                 |    说话人确认/分割     |    5000小时    | 7.2M | 
			
 
				+|                           Whisper-large-v2 <br> ([⭐](https://www.modelscope.cn/models/iic/speech_whisper-large_asr_multilingual/summary)  [🍀](https://github.com/openai/whisper) )                           | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |
			
 
				+|                         Whisper-large-v3 <br> ([⭐](https://www.modelscope.cn/models/iic/Whisper-large-v3/summary)  [🍀](https://github.com/openai/whisper) )                          | 语音识别，带时间戳输出，非实时 |     多语言      |  1G  |
			
 
				 
			
 
				 
			
 
				 <a name="快速开始"></a>
			
--- a/examples/industrial_data_pretraining/whisper/demo.py
+++ b/examples/industrial_data_pretraining/whisper/demo.py
@@ -5,7 +5,7 @@
 
				 
			
 
				 from funasr import AutoModel
			
 
				 
			
 
				-model = AutoModel(model="iic/speech_whisper-large_asr_multilingual",
			
 
				+model = AutoModel(model="iic/Whisper-large-v3",
			
 
				                   model_revision="v2.0.4",
			
 
				                   )
			
 
				 
			
--- a/examples/industrial_data_pretraining/whisper/demo_from_openai.py
+++ b/examples/industrial_data_pretraining/whisper/demo_from_openai.py
@@ -7,8 +7,8 @@ from funasr import AutoModel
 
				 
			
 
				 # model = AutoModel(model="Whisper-small", hub="openai")
			
 
				 # model = AutoModel(model="Whisper-medium", hub="openai")
			
 
				-model = AutoModel(model="Whisper-large-v2", hub="openai")
			
 
				-# model = AutoModel(model="Whisper-large-v3", hub="openai")
			
 
				+# model = AutoModel(model="Whisper-large-v2", hub="openai")
			
 
				+model = AutoModel(model="Whisper-large-v3", hub="openai")
			
 
				 
			
 
				 res = model.generate(
			
 
				 	language=None,
			
--- a/examples/industrial_data_pretraining/whisper/infer_from_local.sh
+++ b/examples/industrial_data_pretraining/whisper/infer_from_local.sh
@@ -13,13 +13,19 @@ workspace=`pwd`
 
				 # download model
			
 
				 local_path_root=${workspace}/modelscope_models
			
 
				 mkdir -p ${local_path_root}
			
 
				-local_path=${local_path_root}/speech_whisper-large_asr_multilingual
			
 
				-git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
			
 
				+#Whisper-large-v2
			
 
				+#local_path=${local_path_root}/speech_whisper-large_asr_multilingual
			
 
				+#git clone https://www.modelscope.cn/iic/speech_whisper-large_asr_multilingual.git ${local_path}
			
 
				+#init_param="${local_path}/large-v2.pt"
			
 
				+#Whisper-large-v3
			
 
				+local_path=${local_path_root}/Whisper-large-v3
			
 
				+git clone https://www.modelscope.cn/iic/Whisper-large-v3.git ${local_path}
			
 
				+init_param="${local_path}/large-v3.pt"
			
 
				 
			
 
				 device="cuda:0" # "cuda:0" for gpu0, "cuda:1" for gpu1, "cpu"
			
 
				 
			
 
				 config="config.yaml"
			
 
				-init_param="${local_path}/large-v2.pt"
			
 
				+
			
 
				 
			
 
				 python -m funasr.bin.inference \
			
 
				 --config-path "${local_path}" \
			
--- a/funasr/download/name_maps_from_hub.py
+++ b/funasr/download/name_maps_from_hub.py
@@ -8,7 +8,8 @@ name_maps_ms = {
 
				     "ct-punc-c": "damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
			
 
				     "fa-zh": "damo/speech_timestamp_prediction-v1-16k-offline",
			
 
				     "cam++": "damo/speech_campplus_sv_zh-cn_16k-common",
			
 
				-    "whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
			
 
				+    "Whisper-large-v2": "iic/speech_whisper-large_asr_multilingual",
			
 
				+    "Whisper-large-v3": "iic/Whisper-large-v3",
			
 
				 }
			
 
				 
			
 
				 name_maps_hf = {
			
--- a/funasr/models/whisper/model.py
+++ b/funasr/models/whisper/model.py
@@ -24,7 +24,7 @@ from funasr.register import tables
 
				 @tables.register("model_classes", "Whisper-large-v1")
			
 
				 @tables.register("model_classes", "Whisper-large-v2")
			
 
				 @tables.register("model_classes", "Whisper-large-v3")
			
 
				-@tables.register("model_classes", "Whisper-WhisperWarp")
			
 
				+@tables.register("model_classes", "WhisperWarp")
			
 
				 class WhisperWarp(nn.Module):
			
 
				     def __init__(self, *args, **kwargs):
			
 
				         super().__init__()
			
@@ -35,8 +35,8 @@ class WhisperWarp(nn.Module):
 
				                 model_or_path = model_or_path.replace("Whisper-", "")
			
 
				             model = whisper.load_model(model_or_path)
			
 
				         else:
			
 
				-            whisper_dims = kwargs.get("whisper_dims", {})
			
 
				-            dims = whisper.model.ModelDimensions(**whisper_dims)
			
 
				+            dims = kwargs.get("dims", {})
			
 
				+            dims = whisper.model.ModelDimensions(**dims)
			
 
				             model = whisper.model.Whisper(dims=dims)
			
 
				         
			
 
				         self.model = model
			
@@ -55,6 +55,13 @@ class WhisperWarp(nn.Module):
 
				         if kwargs.get("batch_size", 1) > 1:
			
 
				             raise NotImplementedError("batch decoding is not implemented")
			
 
				 
			
 
				+        if frontend is None and not hasattr(self, "frontend"):
			
 
				+            frontend_class = tables.frontend_classes.get("WhisperFrontend")
			
 
				+            frontend = frontend_class(n_mels=self.model.dims.n_mels, do_pad_trim=kwargs.get("do_pad_trim", True))
			
 
				+            self.frontend = frontend
			
 
				+        else:
			
 
				+            frontend = frontend if frontend is not None else self.frontend
			
 
				+
			
 
				         meta_data = {}
			
 
				         if isinstance(data_in, torch.Tensor) and kwargs.get("data_type", "sound") == "fbank":  # fbank
			
 
				             speech, speech_lengths = data_in, data_lengths
			
@@ -65,7 +72,7 @@ class WhisperWarp(nn.Module):
 
				         else:
			
 
				             # extract fbank feats
			
 
				             time1 = time.perf_counter()
			
 
				-            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs, audio_fs=kwargs.get("fs", 16000),
			
 
				+            audio_sample_list = load_audio_text_image_video(data_in, fs=frontend.fs if hasattr(frontend, "fs") else 16000, audio_fs=kwargs.get("fs", 16000),
			
 
				                                                             data_type=kwargs.get("data_type", "sound"),
			
 
				                                                             tokenizer=tokenizer)
			
 
				             time2 = time.perf_counter()