缓存穿透

在高并发场景中，为了避免请求直接打到数据库上（会导致数据库出现性能问题，从而影响整个系统），常使用缓存来处理大部分请求，如memcached、ehcache、redis等。
但对于一些恶意请求，传统缓存机制却难以应对，如存在以下场景：

接口参数为String id，业务处理为获取此参数，执行sql: select * from user where id = ? limit 1;

正常用户请求参数为25，业务处理则为

select * from user where id = 25 limit 1;

正常返回所查询的user表中id为25的信息，然后将次数据放到缓存中，如redis中，设置key=25,value=user.toString(),再有查询id 为25的请求，可以直接返回redis中的数据，而不用反复查询数据库。

但是如果存在请求参数为-1，业务处理则为

select * from user where id = -1 limit 1;

一般情况下，数据库中不会存在id为-1的数据，返回结果为空，所以缓存中也不会存在数据。如果存在大量的此类请求，则全部会打到数据库上，导致性能问题，从而影响整个系统，缓存变成了摆设，没有起到处理重复请求的作用，此种情况叫做缓存穿透

解决方案

应对缓存穿透有多种办法，如建立无效请求的黑名单，查询参数在黑名单中的，可以直接被拦截；
前级放一个有效数据的集合，记录所有的key，根据集合来判断；
等等。

但上述方案都存在很大的弊端

需要一个无限大的空间来存储不存在的信息，不合理；
集合大小和数据库大小正相关，消耗严重；
需要跟着数据库来实时更新。

布隆过滤器（Bloom Filter）

简介

是一个很长的二进制向量和一系列的随机映射函数
用于检测一个元素是否存在于某个集合中
空间效率和查询时间远优于一般算法
存在一定的误识别率并且无法删除元素

算法介绍

①使用一系列的随机映射函数将元素映射到二进制向量的上（每一个函数可设置一个1），以16位向量举例，因此可以得到

1010101000101000

②在对元素进行检测判别时

判断原理
③使用同样的映射函数完成映射，得到

0100101010000000

与上一步得到的向量按位进行或运算

        10101010001010000100101010000000----------------------------1110101010101000

得到的向量与原向量不相等，因此可以得出结论，此元素不存在于集合中。

布隆过滤器的误判别

①按照上述过程生成二进制向量

②检测某个不存在的元素
在这里插入图片描述
得到二进制向量

0010100010100000

与上一步生成的二进制向量按位进行或运算

        10101010001010000010100010100000----------------------------1010101000101000

得到的二进制向量与原向量相等，因此会被误识别为“存在”。

注：布隆过滤器的误识别率与二进制向量的长度、存储元素个数有关系。

JAVA使用

guava.jar已对布隆过滤器进行封装，在pom文件中引入依赖包

		<dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>23.0</version></dependency>

创建布隆过滤器

private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), total, fpp);

解析

BloomFilter.create有三个参数，分别为元素类型，最大存储个数，误判别率。

guava中的封装会根据这三个参数生成二进制向量数组LockFreeBitArray，

存储数越大，误判别率越低，这个数组就越大，理论上运算也就越慢，后面我们会对性能进行测试。
源码如下：

/*** Creates a {@link BloomFilter} with the expected number of insertions and* expected false positive probability.** <p>Note that overflowing a {@code BloomFilter} with significantly more elements than specified,* will result in its saturation, and a sharp deterioration of its false positive probability.** <p>The constructed {@code BloomFilter} will be serializable if the provided* {@code Funnel<T>} is.** <p>It is recommended that the funnel be implemented as a Java enum. This has the benefit of* ensuring proper serialization and deserialization, which is important since {@link #equals}* also relies on object identity of funnels.** @param funnel the funnel of T's that the constructed {@code BloomFilter} will use* @param expectedInsertions the number of expected insertions to the constructed*     {@code BloomFilter}; must be positive* @param fpp the desired false positive probability (must be positive and less than 1.0)* @return a {@code BloomFilter}* @since 19.0*/public static <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp) {return create(funnel, expectedInsertions, fpp, BloomFilterStrategies.MURMUR128_MITZ_64);}@VisibleForTestingstatic <T> BloomFilter<T> create(Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {checkNotNull(funnel);checkArgument(expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);checkNotNull(strategy);if (expectedInsertions == 0) {expectedInsertions = 1;}/** TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size* is proportional to -log(p), but there is not much of a point after all, e.g.* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!*/long numBits = optimalNumOfBits(expectedInsertions, fpp);int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);try {return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);} catch (IllegalArgumentException e) {throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);}

LockFreeBitArray是BloomFilter的静态内部类，通过他的构造器可以发现，实际存储的数据类型为AtomicLongArray

在这里插入图片描述
看到这里的包路径，我们就更可以安心了，线程安全。

元素检测

调用封装好的api，BloomFilter.mightContain

在这里插入图片描述
读注释发现，返回true表示可能含有此数据，返回false表示一定不包含此数据！
上述为使用布隆过滤器的思想。

布隆过滤器的使用与性能表现

package com.example.demo.test;import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;import java.util.ArrayList;
import java.util.List;public class BloomTest {private static int total = 10000000;private static double fpp = 0.0000001;private static BloomFilter<Integer> bloomFilter = BloomFilter.create(Funnels.integerFunnel(), total, fpp);public static void main(String[] args) {System.out.println("=======init========");Long initTime = System.currentTimeMillis();for (int i = 0; i < total; i++) {bloomFilter.put(i);}Long startTime = System.currentTimeMillis();System.out.println("init time:"+(startTime - initTime)+"ms");System.out.println("=======start========");// 存放误判的数据List<Integer> list = new ArrayList<>(1000);for (int i = total+1000; i < total + total ; i++) {if (bloomFilter.mightContain(i)) {list.add(i);}}System.out.println("误判数量:" + list.size());System.out.println("=======done========");Long endTime = System.currentTimeMillis();System.out.println("use time:"+(endTime - startTime)+"ms");}
}

声明一千万的数据量，同时进行一千万的数据检测，误判率为0.0000001

执行结果为

=init==
init time:16554ms
=start==
误判数量:2
=done==
use time:2378ms

可以看出在初始化时，较为耗时，但做一千万数据量与另一个一千万数据量的比对时，仅耗时2.3s，且只有2个误判数量。
对应到实际业务中就是，数据库存在一千万的数据量，恶意请求为一千万个不同的错误id，仅有2个id会打到数据库上，基本可以完成对恶意请求的拦截。
若想要更好的结果，可以继续优化误判别率，或是针对这不多的漏网之鱼定向捕杀。