开发者

Given MP3, is it possible to break out different instruments using Fast Fourier transform (FFT)?

I am working on a music visualizer and I'd like to display a different visual element for each instrument. For example, blue bar representing vocal, red bar representing guitar, yellow bar representing drums, etc.

Is there a way to analyze the results of FFT to get this information?

Tha开发者_JS百科nks.


This is a challenge that's an active area of research in music technology.

It's possible, to an extent, but it's certainly not easy. It will be especially difficult using mp3 as a lot of important information is lost in compression.

What you're trying to do is known as Audio Source Separation, or Sound Source Separation. It pursues the separation of an audio recording into its constituent elements.

These elements could be speech (several people talking at the same time - the 'cocktail party problem') or instruments (separating one instrument from another in a recording 'blind demixing').

There's various approaches you could take, some of these are based on the frequency domain characteristics of sound and others are based on spatial properties.

The frequency domain approach might appear fairly straightforward if you're trying to separate a bass drum and a flute (i.e. the low frequency bins of your FFT would be the bass drum and the higher frequency bins assigned to the flute) however in reality sounds are rarely neatly segregated into useful frequency regions. The bass drum for example will have harmonic content right the way up the frequency spectrum. These types of solutions are hence very mathematically complicated and often involves statistical modeling. Heavy stuff.

Separation based on spatial properties of sound often relies on some prior knowledge of where each source was before recording (this is 'non-blind'). It's often necessary to have more than one microphone (stereo recording at least). Using some clever maths it's possible to approach separating the sources based on a knowledge of where the source is in space, based on the relationship of the signals at each microphone. This is also the basis for a technique called beamforming, by which the position of a source can be determined using an array of microphones.

So, back on track. People are trying to do it, but it's complicated, and using mp3 will make your life difficult!

I'm afraid I don't really know enough to explain the approaches better, but I can find a few references to get you started:

http://www.cs.tut.fi/~tuomasv/demopage.html

http://www.cs.northwestern.edu/~pardo/courses/eecs352/lectures/source%20separation.pdf (pdf warning!)

Good luck!


For the vocal and bass you can use the fact that they are usually in the center of the stereo mix, which means it will have the exact same waveform in the left and right channel. If you subtract one channel from the other you will end up with a new channel that often will be without vocal and bass.

Something like:

sound = LoadMP3(...)
length = sound.SampleCount
left = sound.Channels[LEFT]
right = sound.Channels[RIGHT]
for i = 0:length
    difference[i] = left[i] - right[i]

Now you can look at clever ways to visualize FFT(left), FFT(right) and FFT(difference).

Maybe this will take a small step towards the effect that you are after?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜