Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there anyway to control how long after the last word is spoken before Vosk closes the session? #24

Open
dv8inpp opened this issue Apr 6, 2022 · 16 comments

Comments

@dv8inpp
Copy link

dv8inpp commented Apr 6, 2022

Is there anyway to control how long after the last word is spoken before Vosk closes the session?

I am using the Python implementation and would like to limit how long the system will wait before closing the session.

Are there any parameter files I can create?

python3 ./asr_server.py /opt/vosk-model-en/model

@muyousif
Copy link

Hi,

Same thing, could you please confirm parameter which sets the max silence threshold as currently it looks like it's very short.

@Goddard
Copy link

Goddard commented Aug 23, 2022

also looking at this

@nshmyrev
Copy link
Contributor

You can change the following params in model.conf:

--endpoint.rule2.min-trailing-silence=0.5
--endpoint.rule3.min-trailing-silence=1.0
--endpoint.rule4.min-trailing-silence=2.0

You can equally scale them up.

@Goddard
Copy link

Goddard commented Aug 23, 2022

Thanks for your response. When using this will this stop the audio stream from asterisk server to my websocket server from ending before the call ends?

@nshmyrev
Copy link
Contributor

Thanks for your response. When using this will this stop the audio stream from asterisk server to my websocket server from ending before the call ends?

No, current module stops the stream after every result. This is how asterisk speech module works unfortunately. It would be nice to have some long transcription mode though.

@Goddard
Copy link

Goddard commented Aug 23, 2022

I see, the method I am also working on as an alternative due to this limitation is using a different method.

This plugin - https://github.com/nadirhamid/asterisk-audiofork

It provides a continuous audio stream, but of course doesn't work with vosk out of the box. What do you think would be needed to adapt the code to just use this binary data audio stream?

@nshmyrev
Copy link
Contributor

What do you think would be needed to adapt the code to just use this binary data audio stream?

You can just adapt backend server https://github.com/nadirhamid/audiofork-transcribe-demo, no need to update audiofork module itself, it should work the same way.

@Goddard
Copy link

Goddard commented Aug 23, 2022

The audiofork transcribe demo is using google's closed source transcription.

How would I adapt it? Especially if I wanted to use an open source option.

@Goddard
Copy link

Goddard commented Oct 4, 2022

Ok, I made a script, but I am getting significant slow downs. I've tried configuring beam and other things you have suggested but still the result lags behind. This is on a CPU. Any things I can try to improve speed to be almost real time?

https://gist.github.com/Goddard/b86c0469c42e1f4c415f37354a5f30db

@nshmyrev
Copy link
Contributor

nshmyrev commented Oct 4, 2022

Any things I can try to improve speed to be almost real time?

What is your hardware and how many streams are you trying to process

@Goddard
Copy link

Goddard commented Oct 4, 2022

In my tests I am only doing 1 stream.
The results are okay, but the transcription says it is processing in about .2 to .9 ms. The time it takes to actually show the results in the terminal is like a second or three though. This seems to compound over time.

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 62
Model name: Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping: 4
CPU MHz: 2600.095
BogoMIPS: 5200.00
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d

@nshmyrev
Copy link
Contributor

nshmyrev commented Oct 4, 2022

The results are okay, but the transcription says it is processing in about .2 to .9 ms.

This is very small delay.

How much memory do you have?

@Goddard
Copy link

Goddard commented Oct 4, 2022

64 gigs

That is what the time reports, but I was thinking the process would be asynchronous between transcriptions so they wouldn't build up to taking longer than the person speaking.

Unless i have an issue with my script I don't see a way to increase the speed. Because that is just the transcription time for each partial. It could have several partials. If you have 20 partials adding up to .2 or .9 that equals sometimes a 10 second delay to get a full transcription.

Does vosk-api use a VAD as well? Do you think this would speed it up?

@nshmyrev
Copy link
Contributor

nshmyrev commented Oct 4, 2022

Do you see this delay with asterisk-audiofork module or with vosk-asterisk?

@Goddard
Copy link

Goddard commented Oct 4, 2022

Even when using something only locally I get poor results for example - https://github.com/alphacep/vosk-server/tree/master/websocket-microphone

python will claim the transcription process only took m.s. but really it takes a few seconds for the data to print to the screen.

Sometimes it will take 4 seconds for the text to be printed to the terminal. I don't think it is a situation where python is slow because even the websocket-cpp boost beast appears to lag behind considerably.

The vosk asterisk plugin appears to be a bit faster, but the transcription ends before the entire call ends so it isn't very useful.

I just installed using a python virtual environment and pip requirements.txt on Ubuntu 22.04.

My local machine is a newer intel cpu with 64 gigs as well
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
CPU family: 6
Model: 141
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 1
CPU max MHz: 4600.0000
CPU min MHz: 800.0000
BogoMIPS: 4608.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc
_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ep
t_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetb
v1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg
avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 384 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 10 MiB (8 instances)
L3: 24 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Srbds: Not affected
Tsx async abort: Not affected

Only thing I see is WARNING (VoskAPI:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (25158432,1108448,23744520), after rebuilding, repo size was 21053120, effective beam was 5.49789 vs. requested beam 6
WARNING (VoskAPI:CheckMemoryUsage():determinize-lattice-pruned.cc:316) Did not reach requested beam in determinize-lattice: size exceeds maximum 50000000 bytes; (repo,arcs,elems) = (27757600,743808,21513912), after rebuilding, repo size was 24994144, effective beam was 4.20504 vs. requested beam 6

@Goddard
Copy link

Goddard commented Oct 5, 2022

for example using boost beast websocket example provided, it takes approxamately 4 seconds for the speech recognition to print.

I used the websocket microphone example connected to a remote boost beast websocket example.
INFO:root:{
"text" : "testing testing one two three"
}

But even locally I experience the same thing. Would a GPU be faster then that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants