•  


Issue with Language Specific Transcription Using txtai and Whisper · Issue #593 · neuml/txtai · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with Language Specific Transcription Using txtai and Whisper #593

Open
Nondzu opened this issue Nov 3, 2023 · 4 comments
Open

Comments

@Nondzu
Copy link

Nondzu commented Nov 3, 2023

Environment

  • txtai version: 6.2.0
  • whisper version:
  • Python version: 3.11.5
  • Operating System:
    Description: Linux Mint 21.2
    Release: 21.2
    Codename: victoria

Description

I'm attempting to transcribe Polish audio using the Whisper model within txtai, and while I am able to get transcriptions, they appear to be in English rather than the native language of the audio.

Here's a snippet of the code I'm using:

from
 txtai
.
transcription
 import
 Transcription


transcribe
 =
 Transcription
(
"openai/whisper-large-v2"
)
for
 text
 in
 transcribe
(
files
):
    
print
(
text
)

Questions

  1. Does txtai's transcription feature automatically translate the text to English, or is it supposed to return text in the language of the audio?
  2. How can I disable any automatic translation feature or specify the language of the audio to ensure that the transcription is in Polish?

Any guidance or suggestions on this matter would be greatly appreciated.

Thank you!

@Nondzu
Copy link
Author

image

@davidmezzetti
Copy link
Member

It's possible Whisper runs the translation task by default. Here's an idea to test out using code from the model page .

from
 transformers
 import
 WhisperProcessor

from
 txtai
.
transcription
 import
 Transcription


transcribe
 =
 Transcription
(
"openai/whisper-large-v2"
)

# Test transcribe only

transcribe
.
pipeline
.
model
.
config
.
forced_decoder_ids
 =
 WhisperProcessor
.
get_decoder_prompt_ids
(
language
=
"polish"
, 
task
=
"transcribe"
)

for
 text
 in
 transcribe
(
files
):
    
print
(
text
)

If that works, I can add in a change that makes this more streamlined.

@Nondzu
Copy link
Author

Nondzu commented Nov 4, 2023

@davidmezzetti thank you for help, after small mod this code works fine

from
 transformers
 import
 WhisperProcessor

from
 txtai
.
pipeline
 import
 Transcription


# from txtai.transcription import Transcription

# model = "openai/whisper-large-v2"

model
 =
 "bardsai/whisper-large-v2-pl-v2"

transcribe
 =
 Transcription
(
model
)
processor
 =
 WhisperProcessor
.
from_pretrained
(
model
)
# Test transcribe only

transcribe
.
pipeline
.
model
.
config
.
forced_decoder_ids
 =
 processor
.
get_decoder_prompt_ids
(
language
=
"polish"
, 
task
=
"transcribe"
)

for
 text
 in
 transcribe
(
files
):
    
print
(
text
)
image

@davidmezzetti
Copy link
Member

Thanks for confirming. I'll keep this issue open and add an argument to disable automatic translation.

Sign up for free to join this conversation on GitHub . Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본