shazam 音频指纹听歌识曲原理

如何用java来实现shazam？

几天以前我读了一篇论文：How shazam works？（shazam的工作原理）
这让我有了自己写一个java版本的shazam的兴趣。

关于shazam

shazam的核心功能是类似于网易云音乐和QQ音乐中的“听歌识曲”，当你开启听歌识曲功能以后，仅需要两三秒，便可以知道正在播放的歌曲叫什么。
当我第一次使用听歌识曲时：太神奇辽！！！这是怎么做到的？？？即便当我使用听歌识曲功能好几次以后，我依然感觉超级神奇！
如果自己用java来写一个听歌识曲的小程序那不是更酷吗哈哈哈哈，我开始依照shazam的论文来搞这个东西。
这就是我过去两天的课余小项目。

录入歌曲。。。！！！

要做的第一件事就是把一段样本录入java中。以前从来没有做过音频类的东西，所以有多难我也不清楚。
可是事实证明是，超级简单！！！！

final AudioFormat format = getFormat(); //Fill AudioFormat with the wanted settings 
DataLine.Info info = new DataLine.Info(TargetDataLine.class, format);
final TargetDataLine line = (TargetDataLine) AudioSystem.getLine(info);
line.open(format);
line.start();

现在我们就可以从TargetDatLine中读取数据了，就像读取普通数据流一样。

// In another thread I start: OutputStream out = new ByteArrayOutputStream();
running = true;try {while (running) {int count = line.read(buffer, 0, buffer.length);if (count > 0) {out.write(buffer, 0, count);}}out.close();
} catch (IOException e) {System.err.println("I/O problems: " + e);System.exit(-1);
}

使用这个方法就可以打开麦克风来记录下音频信息了。个人采用记录音频的格式为：

private AudioFormat getFormat() {float sampleRate = 44100;int sampleSizeInBits = 8;int channels = 1; //mono boolean signed = true;boolean bigEndian = true;return new AudioFormat(sampleRate, sampleSizeInBits, channels, signed, bigEndian);
}

所以，我们在 ByteArrayOutputStream数组中有了完整的音频信息。第一步捕获音频信息完成了。

这。。。是个啥？

下一步面临的问题就是分析得到的数据，当我尝试着输出这个数组的时候，他是这样的。。

0
0
1
2
4
7
6
3
-1
-2
-4
-2
-5
-7
-8
(etc)

What？？fuk？？你告诉我这是音乐？打死我吧。
但是当我将这个数组用Open Office可视化之后，它有点音乐的样子啦。
可视化数组
这就是我们平常看到的这个音乐曲线了。

这个图谱通常被称作时域（它的横轴是时间线），但是这些数据对我们来说没有什么用处。。如果你曾经读过shazam创始人的论文，你就知道它变成频域（横轴是频率）以后再处理的。（为了取得良好的抗噪性）

因此，下一个大问题是，我们如何将时域变成频域？

离散傅里叶变换（DFT）

为了将时域变成频域，我们用到了傅里叶变换。傅里叶变换可以将时域变成频域。

这只是其中一个问题，如果你把时域变成频域的话就会损失掉时间信息，只剩下一堆频率幅值类的信息。

为了解决这个问题，我决定采用一种数据块的方式，就是将这一大段时域音频数组分成N个小块，对他们分别实行傅里叶变换，这样就在一定程度上保住了时间信息。（第一个时间单位的频域图，第二个时间单位的频域图，第N个。。。。。）

我个人采用以4096byte为一个数据块，对数据块进行挨个傅里叶变换。

想办法实现它

没有从头去了解傅里叶变换究竟是怎么变换的。

Google后，我采用了一个变换模块，叫FFT（Fast Fourier Transformation），

通过调用FFT来将时域数据块变成频域数据块。

byte audio[] = out.toByteArray();final int totalSize = audio.length;int amountPossible = totalSize/Harvester.CHUNK_SIZE;//When turning into frequency domain we'll need complex numbers: 
Complex[][] results = new Complex[amountPossible][];//For all the chunks: 
for(int times = 0;times < amountPossible; times++) {Complex[] complex = new Complex[Harvester.CHUNK_SIZE];for(int i = 0;i < Harvester.CHUNK_SIZE;i++) {//Put the time domain data into a complex number with imaginary part as 0: complex[i] = new Complex(audio[(times*Harvester.CHUNK_SIZE)+i], 0);}//Perform FFT analysis on the chunk: results[times] = FFT.fft(complex);
}//Done!

因此现在我们有了一堆频域的数据块，这些包含着全部的幅值和频率信息。
。

我决定再可视化一下，下面是代码（没什么卵用，纯属为了验证傅里叶变换的正确与否hhhhh）

for(int i = 0; i < results.length; i++) {int freq = 1;for(int line = 1; line < size; line++) {// To get the magnitude of the sound at a given frequency slice // get the abs() from the complex number. // In this case I use Math.log to get a more managable number (used for color) double magnitude = Math.log(results[i][freq].abs()+1);// The more blue in the color the more intensity for a given frequency point: g2d.setColor(new Color(0,(int)magnitude*10,(int)magnitude*20));// Fill: g2d.fillRect(i*blockSizeX, (size-line)*blockSizeY,blockSizeX,blockSizeY);// I used a improviced logarithmic scale and normal scale: if (logModeEnabled && (Math.log10(line) * Math.log10(line)) > 1) {freq += (int) (Math.log10(line) * Math.log10(line));} else {freq++;}}
}

效果就是这样（虽然看起来这首歌的频域比较恐怖。。。）
在这里插入图片描述

找到频域中的关键点

下一步便是提取频域中的关键点，来做成一个个哈希值，为了检索时匹配。

因为我想把这个项目在两天之内完成，所以做的很粗糙，只有一个大体的算法，与其说这是一个应用型的项目，不如说这是为了验证shazam理论正确性而做出的项目（着实粗糙）。

提取关键点的时候，我提取了四个关键点，每个频域数据块中，在频率区间40-80, 80-120, 120-180, 180-300.分别提取一个。提取哪一个呢？提取幅值最大的那一个。

/For every line of data: for (int freq = LOWER_LIMIT; freq < UPPER_LIMIT-1; freq++) {//Get the magnitude: double mag = Math.log(results[freq].abs() + 1);//Find out which range we are in: int index = getIndex(freq);//Save the highest magnitude and corresponding frequency: if (mag > highscores[index]) {highscores[index] = mag;recordPoints[index] = freq;}
}//Write the points to a file: 
for (int i = 0; i < AMOUNT_OF_POINTS; i++) {fw.append(recordPoints[i] + "\t");
}
fw.append("\n");// ... snip ... public static final int[] RANGE = new int[] {40,80,120,180, UPPER_LIMIT+1};//Find out in which range 
public static int getIndex(int freq) {int i = 0;while(RANGE[i] < freq) i++;return i;}
}

因此我们就得到了这么一堆数组。。

33	56	99	121	195	
30	41	84	146	199	
33	51	99	133	183	
33	47	94	137	193	
32	41	106	161	191	
33	76	95	123	185	
40	68	110	134	232	
30	62	88	125	194	
34	57	83	121	182	
34	42	89	123	182	
33	56	99	121	195	
30	41	84	146	199	
33	51	99	133	183	
33	47	94	137	193	
32	41	106	161	191	
33	76	95	123	185

将这个数组可视化之后就是这个样子的，
在这里插入图片描述
红色的就是记录下来的很重要的点。

检索我自己本地的音乐

我将我的电脑中300+歌曲全部进行变换存储了起来以便进行比对测试。

构建哈希值

论文中可能因为某些商业原因并没有讲清楚应该如何构建哈希值，下面是我自己的构建方法。

刚才已经在每个频域段中提取出了一个频率值，就是幅值最大的那个。每个数据块中都会有五个值，我用了前四个，因为在我测试看来，这样构建哈希值效率最高。

将四个混合在一起就构成了一个哈希值。

//Using a little bit of error-correction, damping 
private static final int FUZ_FACTOR = 2;private long hash(String line) {String[] p = line.split("\t");long p1 = Long.parseLong(p[0]);long p2 = Long.parseLong(p[1]);long p3 = Long.parseLong(p[2]);long p4 = Long.parseLong(p[3]);return  (p4-(p4%FUZ_FACTOR)) * 100000000 + (p3-(p3%FUZ_FACTOR)) * 100000 + (p2-(p2%FUZ_FACTOR)) * 100 + (p1-(p1%FUZ_FACTOR));
}

哈希值构建已经完成，相应的，每个哈希值对应的应该是一个数据块，即第几首音乐的第几个数据块。因为数据块存储了两个信息，所以我新定义了一个类DataPoint。
DataPoint构造如下：

private class DataPoint {private int time;private int songId;public DataPoint(int songId, int time) {this.songId = songId;this.time = time;}public int getTime() {return time;}public int getSongId() {return songId;}
}

现在我们已经拥有了所有应该拥有的信息，下一步就是匹配啦。

匹配的想法是，当我们把正在录到的音频解析成哈希值后，在这个巨型数据库（对我来说其实只有三千首）中匹配，看看有没有一样的，如果查找到得了，就说明存在。

但是这种匹配精确度并不高，因为每首歌总会有那么几个匹配到的哈希值。
所以为了提高精确度，引入了一个相对时间，由于数据块是按照时间来进行平均切割的，每个数据块的序号就代表了时间。当匹配到一个哈希值后，记录下被检索的数据块序号与检索的数据块序号的差值。当再有一个哈希值匹配到后，再记录下二者差值。再有匹配到的再记录下，，，，，如果出现了差值相同的，我们就有把握把他认为是同一首歌，因为在相同的时间间隔后又在一次匹配到了。

匹配结果

Done loading: 2921 songsStart matching song...Top 20 matches:01: 08_the_kooks_-_match_box.mp3 with 16 matches.
02: 04 Racoon - Smoothly.mp3 with 8 matches.
03: 05 Röyksopp - Poor Leno.mp3 with 7 matches.
04: 07_athlete_-_yesterday_threw_everyting_a_me.mp3 with 7 matches.
05: Flogging Molly - WMH - Dont Let Me Dia Still Wonderin.mp3 with 7 matches.
06: coldplay - 04 - sparks.mp3 with 7 matches.
07: Coldplay - Help Is Round The Corner (yellow b-side).mp3 with 7 matches.
08: the arcade fire - 09 - rebellion (lies).mp3 with 7 matches.
09: 01-coldplay-_clocks.mp3 with 6 matches.
10: 02 Scared Tonight.mp3 with 6 matches.
11: 02-radiohead-pyramid_song-ksi.mp3 with 6 matches.
12: 03 Shadows Fall.mp3 with 6 matches.
13: 04 Röyksopp - In Space.mp3 with 6 matches.
14: 04 Track04.mp3 with 6 matches.
15: 05 - Dress Up In You.mp3 with 6 matches.
16: 05 Supergrass - Can't Get Up.mp3 with 6 matches.
17: 05 Track05.mp3 with 6 matches.
18: 05The Fox In The Snow.mp3 with 6 matches.
19: 05_athlete_-_wires.mp3 with 6 matches.
20: 06 Racoon - Feel Like Flying.mp3 with 6 matches.Matching took: 259 msFinal prediction: 08_the_kooks_-_match_box.mp3.song with 16 matches.

嗯，成功了！而且用时非常短！
有木有感觉很神奇！

结语

完整代码还需要我完善一下，需要的同学留下联系方式我有空发给你们。