Non-Speech Noise or Sound Recognition Software?

2023-01-24 01:45 问答作者：

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.

I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.

In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:

Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
(or) Is there already an existing li开发者_如何学Pythonbrary to do non-word noise recognition?
(or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?

Thanks

Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.

In short, the sequence of steps is the following:

First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:

CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH

Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.

Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.

Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):

#JSGF V1.0;
/**
 * JSGF Grammar for Hello World example
 */
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;

This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).

Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.

I don't know any existing libraries you can use, I suspect you may have to roll your own.

Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.

http://www.cs.bham.ac.uk/internal/courses/robotics/halloffame/2001/team14/sound.htm

继续阅读：audio cross-platform microphone speech-recognition

Non-Speech Noise or Sound Recognition Software?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？