开发者

Compare two spectogram to find the offset where they match algorithm

I record a daily 2 minutes radio broadcast from Internet. There's always the same starting and ending jingle. Since the radio broadcast exact time may vary from more or less 6 minutes I have to record around 15 minutes of radio.

I wish to identify the exact time where those jingles are in the 15 minutes record, so I can extract the portion of audio I want.

I already started a C# application where I decode an MP3 to PCM data and convert the PCM data to a spectrogram based on http://www.codeproject.com/KB/audio-video/SoundCatcher.aspx

I tried to use a Cross Correlation algorithm on the PCM data but the algorithm is very slow around 6 minutes with a step of 10ms and is some occasion it fail to find the jingle start time.

Any ideas of algorithms to compare two spectrogram for match? Or a better way to find that jingle start time?

Thanks,

Update, sorry for the delay

First, thank for all the anwsers most of them were relevent and or interresting ideas.

I tried to implement the Shazam algorithm proposed by fonzo. But failed to detect the peaks in the spectrogram. Here's three spectrograms of the starting jingle from three different records. I tried AForge.NET with the blob filter (but it failed to identify peaks), to blur the image and check for difference in height, the Laplace convolution, slope analysis, to detect the series of vertical bars (but there was too many false positive)...

In the mean while, I tried the Hough algorithm proposed by Dave Aaron Smith. Where I calculate the RMS of each columns. Yes yes each columns, it's a O(N*M) but M << N (Notice a column is around 8k of sample). So in the overall it's not that bad, still the algorithm take about 3 minutes, but has never fail.

I could go with that solution, but if possible, I would prefer the Shazam cause it's O(N) and probably much faster (and cooler also). So does any of you have an idea of an algorithm to always detect the same points in those spect开发者_如何学运维rograms (doesn't have to be peaks), thanks to add a comment.

Compare two spectogram to find the offset where they match algorithm

Compare two spectogram to find the offset where they match algorithm

Compare two spectogram to find the offset where they match algorithm

New Update

Finally, I went with the algorithm explained above, I tried to implement the Shazam algorithm, but failed to find proper peaks in the spectrogram, the identified points where not constant from one sound file to another. In theory, the Shazam algorithm is the solution for that kind of problem. The Hough algorithm proposed by Dave Aaron Smith was more stable and effective. I split around 400 files, and only 20 of them fail to split properly. Disk space when from 8GB to 1GB.

Thanks, for your help.


There's a description of the algorithm used by the shazam service (which identifies a music given a short possibly noisy sample) here : http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
From what I understood, the first thing done is to isolate peaks in the spectrogram (with some tweaks to assure an uniform coverage), which will give a "constellation" of pair of values (time;frequency) from the initial spectrogram. Once done, the sample constellation is compared to the constellation of the full track by translating a window of the sample length from the beginning to the end and counting the number of correlated points.
The paper then describes the technical solution they found to be able to do the comparison fast even with a huge collection of tracks.


I wonder if you could use a Hough transform. You would start by cataloging each step of the opening sequence. Let's say you use 10 ms steps and the opening sequence is 50 ms long. You compute some metric on each step and get

1 10 1 17 5

Now go through your audio and analyze each 10 ms step for the same metric. Call this array have_audio

8 10 8 7 5 1 10 1 17 6 2 10...

Now create a new empty array that's the same length as have_audio. Call it start_votes. It will contain "votes" for the start of the opening sequence. If you see a 1, you may be in the 1st or 3rd step of the opening sequence, so you have 1 vote for the opening sequence starting 1 step ago and 1 vote for the opening sequence starting 3 steps ago. If you see a 10, you have 1 vote for the opening sequence starting 2 steps ago, a 17 votes for 4 step ago, and so on.

So for that example have_audio, your votes will look like

2 0 0 1 0 4 0 0 0 0 0 1 ...

You have a lot of votes at position 6, so there's a good chance the opening sequence starts there.

You could improve performance by not bothering to analyze the entire opening sequence. If the opening sequence is 10 seconds long, you could just search for the first 5 seconds.


Here is a good python package that does just this:

https://code.google.com/p/py-astm/

If you are looking for a specific algorithm, good search terms to use are "accoustic fingerprinting" or "perceptual hashing".

Here's another python package that could also be used:

http://rudd-o.com/new-projects/python-audioprocessing/documentation/manuals/algorithms/butterscotch-signatures


If you already know the jingle sequence, you could analyse the correlation with the sequence instead of the cross correlation between the full 15 minutes tracks.

To quickly calculate the correlation against the (short) sequence, I would suggest using a Wiener filter.

Edit: a Wiener filter is a way to locate a signal in a sequence with noise. In this application, we are considering anything that is "not jingle" as noise (question for the reader: can we still assume that the noise is white and not correlated?).

( I found the reference I was looking for! The formulas I remembered were a little off and I'll remove them now)

The relevant page is Wiener deconvolution. The idea is that we can define a system whose impulse response h(t) has the same waveform as the jingle, and we have to locate the point in a noisy sequence where the system has received an impulse (i.e.: emitted a jingje).

Since the jingle is known, we can calculate its power spectrum H(f), and since we can assume that a single jingle appears in a recorded sequence, we can say that the unknown input x(t) has the shape of a pulse, whose power density S(f) is constant at each frequency.

Given the knowledges above, you can use the formula to obtain a "jingle-pass" filter (as in, only signals shaped like the jingle can pass) whose output is highest when the jingle is played.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜