Recognizing notes within recorded sound - Part 2 - Python

2023-01-16 10:07 问答作者：

this is a continuation of this question here.

This is the code I used in order to get the samples:

spf = wave.open(speech,'r')
sound_info = spf.readframes(-1)
sound_info = fromstring(sound_info, 'Int16')

The length of sound_info is 194560, which is 4.4 times the sample rate of 44100. The length of the sound file is 2.2 seconds, so isn't sound_info twice the length it should be?

Also I can only seem to find enough info on why FFTs are used in order to produce the frequency spectrum.

I would like to split a sound up and anal开发者_运维百科yse the frequency spectrum of multiple fractions of a second, rather than the whole sound file.

Help would be very much appreciated. :)

This is the basic sound_info graph

plot(sound_info)

Recognizing notes within recorded sound - Part 2 - Python

This is the FFT graph

freq = [abs(x.real) for x in fft(sound_info)]
plot(freq)

Recognizing notes within recorded sound - Part 2 - Python

If your wav file has two channels, then the length of sound_info would be 2*sample rate*duration (seconds). The channel data alternate, so if you have slurpped all the values into a 1-dimensional array, data, then the values associated with one channel would be data[::2], and the other would be data[1::2].

Roughly speaking, smooth functions can be represented as sums of sine and cosine waves (with various amplitudes and frequencies).

The FFT (Fast Fourier Transform) relates the function to the coefficients (amplitudes) of those sine and cosine waves. That is, there is a one-to-one mapping between the function on the one hand and the sequence of coefficients on the other.

If a sound sample consists mainly of one note, its FFT will have one coefficient which is very big (in absolute value), and the others will be very small. That coefficient corresponds to a particular sine wave, with a particular frequency. That's the frequency of the note.

Don't reinvent the wheel :)

Check out http://librosa.github.io, especially the part about the Short-Time-Fourier Transform (STFT) or in your case rather something like a Constant-Q-Transform (CQT).

But first things first: Let's assume we have a stereo signal (2 channels) from an audio file. For now, we throw away spatial information which is encoded in the two channels of the audio file by creating an average channel (sum up both channels and divide by 2). We now have a signal which is mono (1 channel). Since we have a digital signal, each point in time is called a sample.

Now begins the fun part, we chop the signal into small chunks (called frames) by taking consecutive samples (512 or multiples of 2 are standard values). By taking the discrete Fourier Transform (DFT) on each of these frames, we get a time-frequency representation called the spectrogram. Any further concepts (overlap etc.) can be read in every DSP book or in resources like this lab course: https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/02-teaching/2016s_apl/LabCourse_STFT.pdf

Note that the frequency axis of the DFT is linearly spaced. In the western music system, an octave is split into 12 semitones whose center frequencies are spaced in a logarithmic fashion. Check out the script above about a binning strategy how to receive a logarithmically spaced frequency axis from the linear STFT. However, this approach is very basic and there are lots of other and probably better approaches.

Now back to your problem of note recognition. First: It's a very hard one. :) As mentioned above, a real sound played by an instruments contains overtones. Also, if you are interested in transcribing notes played by complete bands, you get interference by the other musicians etc.

Talking about methods you could try out: Lot's of people nowadays use non-negative matrix fatorization (NMF or similar LDPCA) or neural networks to approach this task. For instance, NMF is included in scikit-learn. To get started, I would recommend NMF. Use only mono-timbral sounds, i.e., a single instrument playing at a time. Initialize the templates with simple decaying overtone structures and see what happens.

继续阅读：audio computer-science fft numpy python

Recognizing notes within recorded sound - Part 2 - Python

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？