毕业设计做的User-based Recommend System,其中涉及到了“余弦相似度”这个概念。
阮一峰大牛的博客介绍了以图搜图的原理(这里就不赘言),并给除了Python实现的代码 ,这个程序对于我来说,有的地方感觉稍显复杂,搜图的效果也不那么棒,于是我重写了下。
改进的地方有两点:
1.图片哈希:
原来的代码是这样的:
reduce(lambda x, (y, z): x | (z << y),enumerate(map(lambda i: 0 if i < avg else 1, im.getdata())),0)
我直接用python的map+lambda给简化成求64个像素点的像素值,然后又平均像素值比较,转化成0、1组成的向量:
avg = reduce(lambda x, y: x + y, im.getdata()) / 64
return map(lambda x: 0 if x>avg else 1, im.getdata())
[1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
2.比较相似度的方法:
原来的代码用hamming距离求:
def hamming(h1, h2):h, d = 0, h1 ^ h2while d:h += 1d &= d - 1return h
我用余弦距离求:
def cos_dist(a, b):if len(a) != len(b):return Nonepart_up = 0.0a_sq = 0.0b_sq = 0.0for x, y in zip(a,b):part_up += x*ya_sq += x**2b_sq += y**2part_down = math.sqrt(a_sq*b_sq)if part_down == 0.0:return Noneelse:return part_up / part_down
最后,写出的代码如下:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import math
import glob
from PIL import Image
EXTS = ['png']
master = Image.open('image.png')
master = master.resize((8, 8), Image.ANTIALIAS).convert('L')
avg = reduce(lambda x, y: x + y, master.getdata()) / 64
master_data = map(lambda x: 0 if x>avg else 1, master.getdata())
#print master_data
def cos_dist(a, b):
if len(a) != len(b):
return None
part_up = 0.0
a_sq = 0.0
b_sq = 0.0
for x, y in zip(a,b):
part_up += x*y
a_sq += x**2
b_sq += y**2
part_down = math.sqrt(a_sq*b_sq)
if part_down == 0.0:
return None
else:
return part_up / part_down
images = []
for ext in EXTS:
#fill in images list
images.extend(glob.glob('*.%s' % ext))
#print repr(images)
dists = []
for f in images:
if f == "image.png":
continue
im = Image.open(f)
im = im.resize((8, 8), Image.ANTIALIAS).convert('L')
avg = reduce(lambda x, y: x + y, im.getdata()) / 64
#这里还有问题,得出的像素是反的,所以我把">"号变成了"<"号
im_data = map(lambda x: 0 if x<avg else 1, im.getdata())
dist = cos_dist(master_data, im_data)
print "image: %s\t avg: %f\t dist:%s\t" % (f, avg, dist)
print im_data
dists.append((f, dist))
for f, dist in sorted(dists, key=lambda i: i[1]):
print "%f\t%s" % (dist, f)
我有这样几张图片:
1. 基准图片 image.png:
下面这几张图片就是要测试的图片,与基准图片的相似度 image1 > image2 > image3 > image4 > image5
2.image1.png
3.image2.png
4.image3.png
5.image4.png
6.image5.png:
运行结果如下:
image: image1.png avg: 215.000000 dist:0.970142500145
[1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
image: image2.png avg: 210.000000 dist:0.857250857251
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
image: image3.png avg: 168.000000 dist:0.735980072194
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
image: image4.png avg: 108.000000 dist:0.714434508312
[1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1]
image: image5.png avg: 76.000000 dist:0.25
[0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
0.250000 image5.png
0.714435 image4.png
0.735980 image3.png
0.857251 image2.png
0.970143 image1.png
可见,计算出的余弦相似度 image1 > image2 > image3 > image4 > image5
这与我们看到的是一样的,最后,我们搜索出的最相似的图片就是image1喽~
--------------
代码放在了github上。