Skip to content

stt_en_conformer_ctc_large_ls.nemo WER 0.51 with common voice 13.0 #13229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jhss opened this issue Apr 24, 2025 · 1 comment
Open

stt_en_conformer_ctc_large_ls.nemo WER 0.51 with common voice 13.0 #13229

jhss opened this issue Apr 24, 2025 · 1 comment

Comments

@jhss
Copy link

jhss commented Apr 24, 2025

Transcribing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [01:12<00:00, 14.07it/s]
[NeMo I 2025-04-24 18:10:21 transcribe_speech:420] Model time for iteration 0: 74.435
[NeMo I 2025-04-24 18:10:21 transcribe_speech:425] Model time avg: 74.435
[NeMo I 2025-04-24 18:10:21 transcribe_speech:431] Finished transcribing from manifest file: datasets/en/mozilla-foundation/common_voice_13_0/en/test/test_mozilla-foundation_common_voice_13_0_manifest.json
[NeMo I 2025-04-24 18:10:21 transcribe_speech:436] Writing transcriptions into file: tmp
[NeMo I 2025-04-24 18:10:23 transcribe_speech:459] Finished writing predictions to tmp!
Backend tkagg is interactive backend. Turning interactive mode on.
[NeMo I 2025-04-24 18:11:12 transcribe_speech:477] Writing prediction and error rate of each sample to tmp!
[NeMo I 2025-04-24 18:11:12 transcribe_speech:478] {'samples': 16372, 'tokens': 152585, 'wer': 0.5112756824065275, 'ins_rate': 0.053124487990300485, 'del_rate': 0.023324704263197563, 'sub_rate': 0.43482649015302943}

I ran speech_to_text_eval.py with the model stt_en_conformer_ctc_large_ls.nemo
I think WER seems high. Is it usual?

@nithinraok
Copy link
Collaborator

Yes, those are old models and might not be suitable for all sets. Could you try with latest https://huggingface.co/nvidia/parakeet-tdt_ctc-110m using ctc decoder. You can swap the decoder to ctc before running inference using:
asr_model.change_decoding_strategy(decoder_type='ctc')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants