Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add voices from Microsoft Edge #30

Open
HappyMac3920 opened this issue Jan 15, 2024 · 9 comments
Open

Add voices from Microsoft Edge #30

HappyMac3920 opened this issue Jan 15, 2024 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@HappyMac3920
Copy link

Microsoft Edge has a feature called Immersive Reader that can read web pages to the user. Before you ask, yes it does have the same voices as the Bing Translator has, except the Edge API has even more voices (such as en-US-AndrewNeural and en-US-AvaNeural, although it is important to note that the API might not have all the voices, such as fi-FI-SelmaNeural), and the audio quality is better compared to the Bing Translator API.
This has already been reversed in https://github.com/rany2/edge-tts so if you plan on implementing it to the website, this can give you a head start.

@HappyMac3920
Copy link
Author

Microsoft Clipchamp also has a wider range of Azure TTS voices (even multilingual AI ones!) which work even with a free subscription, but they require an authorization token and an account to use. :(

@chrisjp chrisjp self-assigned this Jan 26, 2024
@chrisjp chrisjp added the enhancement New feature or request label Jan 26, 2024
@chrisjp
Copy link
Owner

chrisjp commented Jan 26, 2024

Of all the requests I get to add voices this looks like the most reasonable and realistic one to look into next. I appreciate the link to rany2's edge-tts project as I suspect we can borrow some stuff from that, namely the client token it seems to require and the URLs it sends requests to.

As for Clipchamp, I'm unfamiliar with this but if it's like you say and requires an account and auth token that could potentially make things more difficult.

Can't promise a timeframe for when this gets done but I will see what I can do.

@tobybear
Copy link

Just as a side note, there are other implementations of the edge-tts/readaloud interface in languages (like JS) possibly more suitable to you in case you don't like Python. In any case, be aware that the implementation cannot be done using simple internet requests alone like for all the other sites, but you will need to implement a streaming interface using websockets. It can (and has been) done, but is a bit more tricky. :)

@HappyMac3920
Copy link
Author

HappyMac3920 commented Jan 27, 2024

Of all the requests I get to add voices this looks like the most reasonable and realistic one to look into next. I appreciate the link to rany2's edge-tts project as I suspect we can borrow some stuff from that, namely the client token it seems to require and the URLs it sends requests to.

As for Clipchamp, I'm unfamiliar with this but if it's like you say and requires an account and auth token that could potentially make things more difficult.

Can't promise a timeframe for when this gets done but I will see what I can do.

Well for Clipchamp it requires an auth token. But if I had a guess, it requires an auth token from an account, but I wasn't able to find out where the auth token is fetched.
Also, in the voice list I did find 4 multilingual voices (fr-FR-VivienneMultilingualNeural, fr-FR-RemyMultilingualNeural, de-DE-FlorianMultilingualNeural and de-DE-SeraphinaMultilingualNeural) that do work, but there is a caveat: I tried English text on de-DE-FlorianMultilingualNeural, it did synthesize correctly in English, but when I tried in Hungarian through edge-tts, it did not say the text properly, but in Clipchamp it did read the text correctly in Hungarian. I think it must be some sort of limitation imposed in the Edge TTS servers. Turns out I was wrong. On multilingual voices, Hungarian text works, but it does not always detect it properly.

@HappyMac3920
Copy link
Author

Just as a side note, there are other implementations of the edge-tts/readaloud interface in languages (like JS) possibly more suitable to you in case you don't like Python. In any case, be aware that the implementation cannot be done using simple internet requests alone like for all the other sites, but you will need to implement a streaming interface using websockets. It can (and has been) done, but is a bit more tricky. :)

I assume you are referring to https://github.com/Migushthe2nd/MsEdgeTTS but I am not aware of other similar projects that use Javascript.

@tobybear
Copy link

tobybear commented Jan 27, 2024

There are some JS-based Greasemonkey/Tampermonkey browser plugins if I remember correctly. Also several Chinese sites have sources for the communication to the MS servers using JS. Best ist probably to search for the websocket URL endpoint as seen in the python project (or the token) to find similar projects.
I played with several of these a month ago, but abandoned them for easier to use TTS sites.

@HappyMac3920
Copy link
Author

Microsoft Clipchamp also has a wider range of Azure TTS voices (even multilingual AI ones!) which work even with a free subscription, but they require an authorization token and an account to use. :(

Copilot's voice is another Microsoft TTS voice that (I think) cannot be implemented to this website due to the fact that it sends a message ID to the server, not plain text. It is actually multilingual though.

@HappyMac3920
Copy link
Author

I am writing this here because it is a bit related to the MS Edge TTS API.
I dug deeper into the Clipchamp TTS API, and here is what I found:
The API first requests a "token" from https://app.clipchamp.com/v2/azure-cognitive/auth-token with my accounts auth token required in the headers. If successful, a JWT client token is handed out in a JSON format, with the token needed for TTS and a region, in my case "eastus", although I live in Hungary. (there might be other regions too)
So we have two tokens now, an account token, and a TTS token.
The voice list is in https://eastus.tts.speech.microsoft.com/cognitiveservices/voices/list but that also requires the account auth token.
A voice request is made as a websocket request, kind of similar to the MS Edge one. The URL is wss://eastus.tts.speech.microsoft.com/cognitiveservices/websocket/v1?Authorization=ttstokengoeshere&X-ConnectionId=connectionidgoeshere
SSML is supported when requesting, the rate and the pitch may be modified.
Also, I did see similarities in the request form (metadata and audio output):
Clipchamp requests:
{"synthesis":{"audio":{"metadataOptions":{"bookmarkEnabled":false,"punctuationBoundaryEnabled":"false","sentenceBoundaryEnabled":"false","sessionEndEnabled":true,"visemeEnabled":false,"wordBoundaryEnabled":"false"},"outputFormat":"audio-24khz-48kbitrate-mono-mp3"},"language":{"autoDetection":false}}}
MS Edge requests:
{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":false,"wordBoundaryEnabled":true},"outputFormat":"audio-24khz-48kbitrate-mono-mp3"}}}}
Notice that the audio output it requests is same on both services.
I am not familiar with how websockets work, I do know quite a few things about JWT, notably the fact that Apple's server which hosts software updates for their products uses JWTs as a response with update URLs which are also included in the JWT, I did mess with it quite a few times especially decoding the base64 payload.
So overall, it is heavily similar to the MS Edge API, but because it requires authorization, I most likely think that the Clipchamp TTS API is difficult to implement. As I mentioned in my second post, Clipchamp TTS is free, so it does not require a subscription at all to use, only a Clipchamp account (can be from a Microsoft Account, a Google Account, or from an e-mail address).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants