Text To Speech

Convert text to audio in near real time, play it back, and save it as a file for later use. Text to Speech is available in both Neural and Standard versions. Applying the latest in digital speech innovation, the Neural Text to Speech capability makes the voices of your apps nearly indistinguishable from recordings of people.

Are you ready to start dictating your documents and text using just your voice? Instead of offering separated dictation or speech-to-text capabilities, Windows 10 conveniently groups its voice commands under Speech Recognition, which interprets the spoken word across the operating system for a variety of tasks. We’ll teach you how to get everything ready, as well as how to enable speech-to-text in Windows 10, so you can start chatting away to your favorite OS and improve Windows’ “ear” for your voice.

Note: Speech recognition is only currently available in English, French, Italian, Spanish, German, Japanese, Portuguese, Simplified Chinese, and Traditional Chinese.

Finding a mic

The first step is to make sure you have the right hardware for speech-to-text options. These days, you may not think much about this step — after all, nearly all devices today come with built-in mics.

The problem here is one of quality. While built-in mics work well for more simple tasks — such as Skype conversations and quick voice commands — you have to consider distortion and mic quality if you really want to capitalize on speech-to-text. In the past, Microsoft has warned that its speech-recognition features are best suited for headset microphones that interpret sounds with greater clarity and are less susceptible to ambient noise. If you’re serious about using speech recognition for Windows 10, it’s a good idea to pick up a headset that is compatible with your computer.

If you are going to buy hardware, do it sooner rather than later, as the speech features tend to work best if you don’t switch devices after training. If you do decide to get a new mic, follow these steps to make sure Windows knows you want to use it over any previous microphone you may have had:

Step 1: Using the Windows 10 search box, type in microphone. This will allow you to go directly to the Set up a microphone section of your Control Panel.

Step 2: Windows will ask you what the problem with using Cortana is. Select Cortana can’t hear me.

Step 3: From the list that appears, choose your new microphone. Then select Set up the mic.

Step 4: Follow the on-screen prompts and repeat the spoken phrases to help calibrate your microphone for speech-to-text.

Setting up speech recognition

With your mic ready, it’s time to start configuring your various speech-recognition capabilities. In Windows 10, this is a more seamless process than it has been in the past. These steps and tutorials will affect an array of Windows programs, but you may also want to make sure dictation is enabled in any writing apps that you prefer to use. Begin with the steps below.

Step 1: Open Windows 10’s Control Panel by searching for it in the Windows search box.

Step 2: Click the menu for Ease of Access, and then click Start speech recognition. Follow the on-screen instructions to set up your microphone.

Step 3: You can set up document review if you want, though it’s worth reading the privacy statement that goes along with it before making that decision.

Step 4: Decide whether you want speech-to-text to be activated with a keyboard or vocal command and click Next. Use the reference sheet to familiarize yourself with commands you can make and continue through the other preferences. You can also run the tutorial to give you an idea of how it all works.

You should now be ready to go. You can enable or disable speech to text by pressing Ctrl + Windows key at any time.

Training your computer and more

At this point, you can venture into Windows docs and use speech-to-text with a variety of Microsoft files. You’re all set! However, you may want to improve Windows’ voice-recognition capabilities even further. Microsoft’s latest software has the ability to learn your voice with a little training, and that can really pay off after a few sessions.

Step 1: Navigate back to the Ease of Access menu and select Speech recognition.

Step 2: Choose Train your computer to better understand you.

Step 3: You will be given the task of reading out extended sequences of text to help Windows better understand your voice. By the end of it, it should have a better grasp of your particular accent and vocal traits.

Also, note the option at the bottom of the speech-recognition menu that allows access to the Speech Reference Card. This gives you all the vocal shortcuts you need to get around in a small side screen/printout. It’s a great tool for beginners who also want to control programs and software commands with their voices.

Editors' Recommendations

-->

Text-to-speech from Azure Speech Services is a service that enables your applications, tools, or devices to convert text into natural human-like synthesized speech. Choose from standard and neural voices, or create your own custom voice unique to your product or brand. 75+ standard voices are available in more than 45 languages and locales, and 5 neural voices are available in 4 languages and locales. For a full list, see supported languages.

Text-to-speech technology allows content creators to interact with their users in different ways. Text-to-speech can improve accessibility by providing users with an option to interact with content audibly. Whether the user has a visual impairment, a learning disability, or requires navigation information while driving, text-to-speech can improve an existing experience. Text-to-speech is also a valuable add-on for voice bots and virtual assistants.

Standard voices

Standard voices are created using Statistical Parametric Synthesis and/or Concatenation Synthesis techniques. These voices are highly intelligible and sound natural. You can easily enable your applications to speak in more than 45 languages, with a wide range of voice options. These voices provide high pronunciation accuracy, including support for abbreviations, acronym expansions, date/time interpretations, polyphones, and more. Use standard voice to improve accessibility for your applications and services by allowing users to interact with your content audibly.

Neural voices

Neural voices use deep neural networks to overcome the limits of traditional text-to-speech systems in matching the patterns of stress and intonation in spoken language, and in synthesizing the units of speech into a computer voice. Standard text-to-speech breaks down prosody into separate linguistic analysis and acoustic prediction steps that are governed by independent models, which can result in muffled voice synthesis. Our neural capability does prosody prediction and voice synthesis simultaneously, which results in a more fluid and natural-sounding voice.

Neural voices can be used to make interactions with chatbots and virtual assistants more natural and engaging, convert digital texts such as e-books into audiobooks and enhance in-car navigation systems. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems.

Neural voices support different styles, such as neutral and cheerful. For example, the Jessa (en-US) voice can speak cheerfully, which is optimized for warm, happy conversation. You can adjust the voice output, like tone, pitch, and speed using Speech Synthesis Markup Language. For a full list of available voices, see supported languages.

To learn more about the benefits of neural voices, see Microsoft’s new neural text-to-speech service helps machines speak like people.

Custom voices

Voice customization lets you create a recognizable, one-of-a-kind voice for your brand. To create your custom voice font, you make a studio recording and upload the associated scripts as the training data. The service then creates a unique voice model tuned to your recording. You can use this custom voice font to synthesize speech. For more information, see custom voices.

Speech Synthesis Markup Language (SSML)

Speech Synthesis Markup Language (SSML) is an XML-based markup language that lets developers specify how input text is converted into synthesized speech using the text-to-speech service. Compared to plain text, SSML allows developers to fine-tune the pitch, pronunciation, speaking rate, volume, and more of the text-to-speech output. Normal punctuation, such as pausing after a period, or using the correct intonation when a sentence ends with a question mark are automatically handled.

All text inputs sent to the text-to-speech service must be structured as SSML. For more information, see Speech Synthesis Markup Language.

Pricing note

When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. Here's a list of what's billable:

Text passed to the text-to-speech service in the SSML body of the request
All markup within the text field of the request body in the SSML format, except for <speak> and <voice> tags
Letters, punctuation, spaces, tabs, markup, and all white-space characters
Every code point defined in Unicode

For detailed information, see Pricing.

Important

Each Chinese, Japanese, and Korean language character is counted as two characters for billing.

Core features

This table lists the core features for text-to-speech:

Use case	SDK	REST
Convert text to speech.	Yes	Yes
Upload datasets for voice adaptation.	No	Yes*
Create and manage voice font models.	No	Yes*
Create and manage voice font deployments.	No	Yes*
Create and manage voice font tests.	No	Yes*
Manage subscriptions.	No	Yes*

* These services are available using the cris.ai endpoint. See Swagger reference. These custom voice training and management APIs implement throttling that limits requests to 25 per 5 seconds, while the speech synthesis API itself implements throttling that allows 200 requests per second as the highest. When throttling occurs, you'll be notified via message headers.

Get started with text to speech

We offer quickstarts designed to have you running code in less than 10 minutes. This table includes a list of text-to-speech quickstarts organized by language.

SDK quickstarts

Quickstart (SDK)	Platform	API reference
C#, .NET Framework	Windows	Browse
C++	Windows	Browse
C++	Linux	Browse

REST quickstarts

Quickstart (REST)	Platform	API reference
C#, .NET Core	Windows, macOS, Linux	Browse
Node.js	Window, macOS, Linux	Browse
Python	Window, macOS, Linux	Browse

Sample code

Sample code for text-to-speech is available on GitHub. These samples cover text-to-speech conversion in most popular programming languages.