High quality, emotional, fluent and variable Text-to-Speech engine?

2023-03-14 08:02 问答作者：

After looking at some of services/tools, I've come to a conclusion. Most Text-to-Speech tools have too techy, robotic - in other words, bad quality c voices.

And yeah, on top of that, it looks like they come with a "hard-coded" voice templates, therefore shortening the variety/customization. Some tools allow you to set the reading speed and pitch', but that's not enough.

My guess about the problem behind the emotional aspect - it's hard to judge emotions from plain text, even more if it's just a sentence or two. Plus, the good ol' PC is a machine - machines don't have emotions, but that's a different story.

The thing that bothers me the most, is, quality. For example, there are these tools out there, that use to cut off apex of words, resulting in these techy voices. Feels like there's a problem with sentence construction or something. And yes, while people are working on such tools, I wonder, what keeps them from working a little more to improve those... cutting off the apex, that's not a small deal! Plus, have to keep in mind, that a good, quality Text-to-Speech software is worth, well... A LOT! Therefore resulting in a pretty profitable product.

Oh, under fluency I'm hiding questions, exclamations and so on. (Possible that those do not apply to fluency, but I'm not native English, please excuse me if that's the case.)

A list of tools I've looked into:

Quite impressive, but still have space for improvements (++)

^{- Loquendo : lacks voice variety, got some minor apex/fluency problems (depends on sentence), too much coughing and excuses in examples!}

^{- Nuance Vocalizer : while still lacks variety, some of the provided voices are worthy.}

Could as well cooperate to get more resources then to work on different, but almost equal products (--)

^{- eSpeak : one of the best robots out there, hence the program logo(?!)}

^{- Natural Reader (dumb autoplay!!) : well, it got some fluency, but still that techy feeling kicks in.}

^{- iSpeech : good laugh when setting the voice to Japanese with English text. I bet Japanese guys aren't very happy about it.}

^{- Cepstral + Enhanced Voices ... plus the enhanced voices give the good ol' crappy result, so, except ~5 more voices, nothing have been enhanced.}

^{- AT&T : decent fluency, but got problems with sentence endings and too much robo!}

^{- LumenVox TTS : looks like coming from a background with lots of speech tools, but still results in robotic voices.}

^{- And some more...}

In case I've missed something worth a look, please share. Can be free, commercial, super expensive... as long as it works, I'm interested!

And the question(-s)..

What do you think are the main issues behind quality, fluency and variety of those voices? Since emotional aspect is hard to judge, I don't mind if you skip it, but if you have an idea or two, I wouldn't mind if you shared your thoughts
How is text transformed into speech? Like, what algorithms are used behind these tools? Maybe a fresh theory or two could come in handy.
Are those actually different engines/drivers or just different voice patterns for the same driver/engine?
Is it just me, or the quality between one of the first Text2Speech tools hasn't changed much (or at all) over the years? And have to admit, that this oldsch开发者_StackOverflowool Apple's tool provides better results than some of the year 2000+ tools, at least when comparing video with what I've looked into.)

I don't know if you're looking for an open solution, but if you have a Mac, you should check out OS X advanced speech markup and the "Repeat After Me" phrase building tool. It's really powerful. The Alex voice built into Mac OS X 10.5 and later is more advanced than the other voices.

On a Mac, highlight the following text, control-click, and go to Speech > Start Speaking:

You talkin' to me
[[inpt PHON]] [[slnc 500]] [[rate -30]]
+yUW _1tAOl=kIHn ~AX [[pbas +3]]+mIY?

http://www.mattmontag.com/personal/mac-os-x-speech-synthesis-markup

The TTS used by Google Translate is quite good for short phrases, though likely to produce an unnatural intonation contour for anything complicated. Still, at the word level, it's impressive. There is a small code example here

And there's Ivona - They might make a slightly more articulation errors than e.g. Google Translate, but they do somewhat better on rhythm and intonation. Check out their 'Raveena' voice, it's one of their best yet.

I know that this is an old question, but I just saw the demo of "Watson" from IBM, it's pretty impressive!! They have support for several languages, you can control tone, pauses, intonation and some other variables.

You should go and take a look if you are still looking for this, or if any other person is looking for a good TTS.

Disclaimer: I don't work for IBM or anything related to this product, I just found it impressive!

继续阅读：audio speech-synthesis text-to-speech voice

High quality, emotional, fluent and variable Text-to-Speech engine?

A list of tools I've looked into:

Quite impressive, but still have space for improvements (++)

Could as well cooperate to get more resources then to work on different, but almost equal products (--)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

A list of tools I've looked into:

Quite impressive, but still have space for improvements (++)

Could as well cooperate to get more resources then to work on different, but almost equal products (--)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？