What does a audio frame contain?

2023-01-20 18:08 问答作者：

I'm doing some research on how to compare sound files(wave). Basically, I want to compare stored soundfiles (wav) with sound from a microphone. So in the end I would like to pre-store some voice commands of my own and then when I'm running my app I would like to compare the pre-stored files with input from the microphone.

My thought was to put in some margin when comparing because saying something two times in a row in the exactly same way would be difficult I guess.

So after some googling I see that Python has this module named wave and the Wave_read object. That object has a function named readframes(n):

Reads and returns at most n frames of audio, as a string of bytes.

What do these bytes contain? I'm thi开发者_JAVA百科nking of looping thru the wave files one frame at the time comparing them frame by frame.

An audio frame, or sample, contains amplitude (loudness) information at that particular point in time. To produce sound, tens of thousands of frames are played in sequence to produce frequencies.

In the case of CD quality audio or uncompressed wave audio, there are around 44,100 frames/samples per second. Each of those frames contains 16-bits of resolution, allowing for fairly precise representations of the sound levels. Also, because CD audio is stereo, there is actually twice as much information, 16-bits for the left channel, 16-bits for the right.

When you use the sound module in python to get a frame, it will be returned as a series of hexadecimal characters:

One character for an 8-bit mono signal.
Two characters for 8-bit stereo.
Two characters for 16-bit mono.
Four characters for 16-bit stereo.

In order to convert and compare these values you'll have to first use the python wave module's functions to check the bit depth and number of channels. Otherwise, you'll be comparing mismatched quality settings.

A simple byte-by-byte comparison has almost no chance of a successful match, even with some tolerance thrown in. Voice-pattern recognition is a very complex and subtle problem that is still the subject of much research.

I believe the accepted description to be slightly incorrect.

A frame appears to be somewhat like stride in graphics formats. For interleaved stereo @ 16 bits/sample, the frame size is 2*sizeof(short)=4 bytes. For non-interleaved stereo @ 16 bits/sample, the samples of the left channel are all one after another, so the frame size is just sizeof(short).

The first thing you should do is a fourier transformation to transform the data into its frequencies. It is rather complex however. I wouldn't use voice recognition libraries here as it sounds like you don't record voices only. You would then try different time shifts (in case the sounds are not exactly aligned) and use the one that gives you the best similarity - where you have to define a similarity function. Oh and you should normalize both signals (same maximum loudness).

继续阅读：python wav

What does a audio frame contain?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？