Recognizing notes within recorded sound - Part 2 - Python
this is a continuation of this question here.
This is the code I used in order to get the samples:
spf = wave.open(speech,'r')
sound_info = spf.readframes(-1)
sound_info = fromstring(sound_info, 'Int16')
The length of sound_info is 194560, which is 4.4 times the sample rate of 44100. The length of the sound file is 2.2 seconds, so isn't sound_info twice the length it should be?
Also I can only seem to find enough info on why FFTs are used in order to produce the frequency spectrum.
I would like to split a sound up and anal开发者_运维百科yse the frequency spectrum of multiple fractions of a second, rather than the whole sound file.
Help would be very much appreciated. :)
This is the basic sound_info graph
plot(sound_info)
This is the FFT graph
freq = [abs(x.real) for x in fft(sound_info)]
plot(freq)
If your wav file has two channels, then the length of sound_info
would be 2*sample rate*duration (seconds). The channel data alternate, so if you have slurpped all the values into a 1-dimensional array, data
, then the values associated with one channel would be data[::2]
, and the other would be data[1::2]
.
Roughly speaking, smooth functions can be represented as sums of sine and cosine waves (with various amplitudes and frequencies).
The FFT (Fast Fourier Transform) relates the function to the coefficients (amplitudes) of those sine and cosine waves. That is, there is a one-to-one mapping between the function on the one hand and the sequence of coefficients on the other.
If a sound sample consists mainly of one note, its FFT will have one coefficient which is very big (in absolute value), and the others will be very small. That coefficient corresponds to a particular sine wave, with a particular frequency. That's the frequency of the note.
Don't reinvent the wheel :)
Check out http://librosa.github.io, especially the part about the Short-Time-Fourier Transform (STFT) or in your case rather something like a Constant-Q-Transform (CQT).
But first things first: Let's assume we have a stereo signal (2 channels) from an audio file. For now, we throw away spatial information which is encoded in the two channels of the audio file by creating an average channel (sum up both channels and divide by 2). We now have a signal which is mono (1 channel). Since we have a digital signal, each point in time is called a sample.
Now begins the fun part, we chop the signal into small chunks (called frames) by taking consecutive samples (512 or multiples of 2 are standard values). By taking the discrete Fourier Transform (DFT) on each of these frames, we get a time-frequency representation called the spectrogram. Any further concepts (overlap etc.) can be read in every DSP book or in resources like this lab course: https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/02-teaching/2016s_apl/LabCourse_STFT.pdf
Note that the frequency axis of the DFT is linearly spaced. In the western music system, an octave is split into 12 semitones whose center frequencies are spaced in a logarithmic fashion. Check out the script above about a binning strategy how to receive a logarithmically spaced frequency axis from the linear STFT. However, this approach is very basic and there are lots of other and probably better approaches.
Now back to your problem of note recognition. First: It's a very hard one. :) As mentioned above, a real sound played by an instruments contains overtones. Also, if you are interested in transcribing notes played by complete bands, you get interference by the other musicians etc.
Talking about methods you could try out: Lot's of people nowadays use non-negative matrix fatorization (NMF or similar LDPCA) or neural networks to approach this task. For instance, NMF is included in scikit-learn. To get started, I would recommend NMF. Use only mono-timbral sounds, i.e., a single instrument playing at a time. Initialize the templates with simple decaying overtone structures and see what happens.
精彩评论