Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spiel speech API #1335

Draft
wants to merge 1 commit into
base: gtk4
Choose a base branch
from
Draft

Add Spiel speech API #1335

wants to merge 1 commit into from

Conversation

eeejay
Copy link

@eeejay eeejay commented Jun 7, 2024

Spiel is a modern speech synthesis API for the desktop that will hopefully support many kinds of providers and voices. It has GI bindings, so adding it to foliate shouldn't be hard.

I started a port, but ran into some trouble with how foliate creates SSML mark elements to report speech progress. Spiel has speech boundary events, including SSML marks, but now all providers support it. Unfortunately I think changes will be needed in foliate-js as well to make this work. Specifically, pre-segmenting the text into marks gets in the way here.

@johnfactotum
Copy link
Owner

If you mean supporting providers that don't support mark events but do support word boundary events, that is indeed not something currently supported by foliate-js. I do plan on adding this, since SSML support seems to be problematic in browsers. (Ideally I'd like to switch to the Web Speech API once it's supported by WebKitGTK.)

One slight snag is that the marks are currently also used to implement "speak from here" and pausing, mainly to ensure that the speech always begins from word boundaries. Maybe it could do without this (or decouple this from the speech text), in which case it shouldn't be too hard to do without marks. Just need to maintain a text walker instance (see https://github.com/johnfactotum/foliate-js/blob/main/text-walker.js) to convert string offsets to ranges.

@eeejay
Copy link
Author

eeejay commented Jun 8, 2024

It makes sense moving into the web view. Happy to see that it is at least proposed in WebKitGTK.

Note: because of the way spiel works there is no need to track the speech progress for pausing purposes. Calling pause will simply pause the stream. So I think SSML marks would need to stay exclusively for speech dispatcher support. And yeah, we would need to de-serialize the string offset to a DOM range, which may or may not be trivial?

@johnfactotum
Copy link
Owner

Calling pause will simply pause the stream.

Ideally I think it should be capable of rewinding to the start of the last word or sentence other than simply pausing the stream. The current behavior in Foliate isn't good either because it simply restarts from a word boundary, which would result in incorrect pronunciation or intonation. The same goes for "speak from here" (i.e. starting speech from a user-selected position).

Avoiding part words isn't really an important feature, though. Mostly it's just what you get "for free" since the text is already segmented. It's probably fine to just drop it.

And yeah, we would need to de-serialize the string offset to a DOM range, which may or may not be trivial?

For plain text it should be okay. It might be non-trivial when using SSML, when the offsets are that of the SSML source string, because in general mapping source offsets to nodes is difficult if not impossible with the browser's DOM APIs. But since the XML here is controlled and rather simple, maybe one can just count the number of characters between < and > and adjust the offsets accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants