AC自动机探究（一）

1.初识AC自动机

1.1什么是AC自动机

可以简单将AC自动机理解为字典树的加强版本，使用AC自动机可以存储敏感词数据，去使用AC自动机筛选出大文本中是否有敏感词数据。

所以使用AC自动机可以简单，快速的根据AC自动机寻找到文本中的敏感词数据，统计出敏感词集合并返回。

1.2AC自动机的前身：字典树

1.2.1什么是字典树

AC自动机是由字典树发展而来的，是字典树的加强版本，AC自动机就是由字典树发展而来的。

字典树是可以使用树的方式，用最少的存储空间和较为低廉的时间成本存储/获取/删除字符串的集合结构。

1.2.2为什么要使用字典树

使用集合结构存储字符串：

假设我们使用的是数组存储，存储a，ab，abc，abcd，abcde的空间复杂度是O(N)，而且去查询字符串的时候，时间复杂度最好的情况下也是惊人的O(N^2)，所以使用数组这种结构存储字符串数据，是比较消耗性能的。

空间复杂度：O(N)

时间复杂度(查询)：O(N^2)

使用字典树存储字符串：

使用这种字典树存储结构虽然空间复杂度仍然是O(N)，但是查询速度降低到了O(N)。

字典树可以将增删改查字符串的速度降低到O(N)，是一种非常适合存储大量字符串，并且需要大量增删改查操作的情况。

1.2.3字典树的封装

java">/*** 封装的普通前缀树*/
public class PrefixTree {// 记录根节点private Node root;public PrefixTree() {root = new Node();}/*** 封装的前缀树Node节点*/private static class Node {public Node[] nexts = new Node[26];public int pass;public int end;}/*** 插入单词** @param word*/public void insert(String word) {if (word == null) return;char[] chars = word.toCharArray();Node curNode = root;curNode.pass++;int path;for (char c : chars) {path = c - 'a';if (curNode.nexts[path] == null) {curNode.nexts[path] = new Node();}curNode = curNode.nexts[path];curNode.pass++;}curNode.end++;}/*** 搜索单词** @param word* @return*/public int search(String word) {if (word == null) return -1;char[] chars = word.toCharArray();Node curNode = root;int path;for (char c : chars) {path = c - 'a';if (curNode.nexts[path] == null) {return -1;}curNode = curNode.nexts[path];}return curNode.end;}/*** 搜索前缀的数量** @param prefixWord* @return*/public int searchPrefixCount(String prefixWord) {if (prefixWord == null) return -1;char[] chars = prefixWord.toCharArray();Node curNode = root;int path;for (char c : chars) {path = c - 'a';if (curNode.nexts[path] == null) {return -1;}curNode = curNode.nexts[path];}return curNode.pass;}/*** 删除字典树中的数据** @param word*/public boolean delete(String word) {if (word == null) {throw new NullPointerException("word is null");}char[] chars = word.toCharArray();Node curNode = root;int path;// 先查询是否存在int count = search(word);if (count > 0) {return false;}// 能走到这里说明字典树中一定存在数据for (char c : chars) {path = c - 'a';if (--curNode.nexts[path].end == 0) {curNode.nexts[path] = null;break;}curNode = curNode.nexts[path];}return true;}}

1.2.4字典树结构分析

主要是看字典树中的节点分析，字典树中没什么，主要是保存了根节点，主要去分析树上的每个节点。

java">// 记录根节点
private Node root;public PrefixTree() {root = new Node();
}/*** 封装的前缀树Node节点*/
private static class Node {public Node[] nexts = new Node[26];public int pass;public int end;
}

字典树上的节点中主要有三个字段：nexts字段，pass字段，end字段。

nexts字段中存储的是节点下面的通路，表示字典树下面的子节点数组。

pass字段主要用来表示节点通过的次数，主要可以通过这个字段查找前缀数据。

end字段主要用来表示节点作为结尾的次数，主要可以通过这个字段查找字典树中存储了多少数据。

1.3AC自动机对字典树的加强

AC自动机加强了字典树，主要是体现在fail指针上。

AC自动机的节点上增加了fail指针，节点定义如下：

java">private static class Node {public Node[] nexts;public String endStr;public boolean endUse;public Node fail;
}

其他的先不用管，先分析这个fail指针到底是做什么的：

1.头节点的fail指针指向null。

2.头节点的子节点的fail指针都指向头节点。

3.其他节点的fail节点的指向法则是这样的：先找头节点的fail指针，然后去找头节点fail指针有没有和自己对应路线一样的节点，如果有则指向，没有则继续找fail指针，知道找到fail指针指向null节点为止。

fail指针的设计可以将查询文章中对应敏感词的复杂度大大降低，因为如果仅仅使用字典树去查询文章中的敏感词，是个非常缓慢的过程，接下来分析一下使用字典树查询文章中敏感词。

1.4字典树实现敏感词查询

使用字典树实现敏感词检索是非常复杂的，虽然我们可以借助字典树快速检索字符串是否存在，但是文章分割，太难了。

假设我们的文章一共有四个字abcd，敏感词为：abcd，abc，bc，d。现在尝试检索出所有敏感词：

java">/*** 搜索敏感词** @param article* @return*/
public Set<String> searchTouchy(String article) {if (article == null) throw new NullPointerException();if (article.isEmpty()) return Collections.emptySet();Set<String> res = new HashSet<>();int end = 0;while (end < article.length()) {for (int p1 = 0; p1 <= end; p1++) {for (int p2 = 0; p2 <= p1; p2++) {try {String word = article.substring(p2, p1 + 1);int count = search(word);if (count > 0 && !res.contains(word)) {res.add(word);}} catch (Exception e) {throw e;}}}end++;}return res;
}

这个算法是一个时间复杂度为O(N^3)的算法，不推荐使用，效率太低了。

但是如果我们使用的是AC自动机，就可以将时间复杂度降低到O(N^2)，接下来我们从AC自动机的功能入手，解密AC自动机的前生今世。

2.实战篇

2.1AC自动机的插入和构建

2.1.1初探AC自动机

在了解AC自动机实现查找敏感词之前，需要先构造整个AC自动机，AC自动机说白了就是一个加强版本的字典树，在节点上加强出了fail指针。

构建AC自动机采用的手段是：先按字典树的逻辑插入所有敏感词，最终再将所有节点的fail节点构建起来。

2.1.2AC自动机的插入

AC自动机的插入其实和字典树的插入没有什么区别，不再多说。

java">/*** 插入数据** @param word*/
public void insert(String word) {if (word == null) return;Node curNode = root;int path;char[] chars = word.toCharArray();for (char c : chars) {path = c - 'a';if (curNode.nexts[path] == null) {curNode.nexts[path] = new Node();}curNode = curNode.nexts[path];}curNode.endStr = word;
}

2.1.3AC自动机fail指针构建

code的实现如下，基本思路就是就是，宽度优先遍历，每次遍历到一个节点的时候，在加入到队列的过程中，将子节点的fail指针构建指向。fail指针的构建：1.头节点的所有子节点的fail指针指向头节点。2.其他子节点去找父节点的fail节点，如果父节点的fail节点next节点中有子节点的对应节点，就指向对应节点，如果没有对应节点就继续找fail节点的fail节点，直到fail节点为null。

java">/*** 构建build*/
public void build() {if (root == null) return;// 宽度有限遍历Queue<Node> queue = new LinkedList<>();Node curNode = root;Node fail = null;Node next = null;queue.add(curNode);while (!queue.isEmpty()) {curNode = queue.poll();for (int index = 0; index < curNode.nexts.length; index++) {next = curNode.nexts[index];if (next != null) {// 默认设置为rootnext.fail = root;// 处理非默认情况fail = curNode.fail;while (fail != null) {if (fail.nexts[index] != null) {next.fail = fail.nexts[index];}fail = fail.fail;}queue.add(curNode.nexts[index]);}}}
}

fail指针构建的流程图：

后续会去分析这样做的合理性，先看懂流程，再分析出来原理。

2.2AC自动机实现查找敏感词

2.2.1实现步骤

使用AC自动机可以很轻松查找敏感词，具体步骤如下：

从头节点开始。

1.判断头节点的子节点有没有符合的路径，如果没有则跳过。

2.如果有则向下走，走成功一次之后，就借助fail指针转一个圈子，将所有数据比较一下。

3.走到不能走的时候，就向fail指针走，如果有就继续向下走，没有就继续沿着fail指针继续走，直到走到根节点为止。

4.如果在中途回到了根节点，查看根节点下面是否有路，有路就走，无路跳过。

用文字表述可能有点难以理解，接下来用流程图的方式表述：

简化一下实现的步骤：

其实实现步骤可以简单理解的，首先我们必须要认清的本质是：fail指针指到最后，都能指回到根节点。

只要可以把握住这个本质，很多问题就迎刃而解了：从根节点出发，只要能走下去，每走一步都要顺着fail指针转一圈收集答案，收集到根节点为止，即答案收集完毕。继续沿着原有节点向下走，如果走不动了，就跳转到自己的fail节点，直到跳到能走/根节点（其实这就是转子收缩，记住这个概念，原理分析要用到），跳到根节点就跳到下一个字符重复流程。

2.2.2实现代码

java">/*** 搜索AC自动机中存储的敏感词** @param article* @return*/
public List<String> searchTouchy(String article) {if (article == null) throw new NullPointerException();if (article.isEmpty()) return Collections.emptyList();List<String> res = new ArrayList<>();char[] words = article.toCharArray();Node curNode = root;int path;for (char word : words) {path = word - 'a';// 不符合条件就绕着fail转起来, 走转子判断流程if (curNode.nexts[path] == null && curNode != root) {curNode = curNode.fail;}// 转到不能转为止curNode = curNode.nexts[path] == null ? root : curNode.nexts[path];// 如果不是头节点, 也就是能走/转成功了, 就走转子收集流程Node follow = curNode;while (follow != null) {// 转子已走if (follow.endUse) {break;}if (follow.endStr != null) {res.add(follow.endStr);follow.endUse = true;}// 继续转follow = follow.fail;}}return res;
}

2.2.3测试对比

2.2.3.1封装测试工具

测试工具我们使用SpringAOP和Spring的StopWatch实现。

封装Aspect切面：

java">import org.aspectj.lang.ProceedingJoinPoint;
import org.aspectj.lang.annotation.Around;
import org.aspectj.lang.annotation.Aspect;
import org.aspectj.lang.annotation.Before;
import org.springframework.stereotype.Component;
import org.springframework.util.StopWatch;@Aspect
@Component
public class ExecutionTimeTestAspect {@Around("@annotation(com.xinghai.annotation.TimeTest)")public Object observerExecTime(ProceedingJoinPoint joinPoint) throws Throwable {StopWatch stopWatch = new StopWatch();stopWatch.start();Object res = joinPoint.proceed();stopWatch.stop();System.out.println("执行的时间: " + stopWatch.getLastTaskTimeNanos() + "ns");return res;}}

封装注解：

java">import java.lang.annotation.*;@Documented
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface TimeTest {
}

在测试方法上添加TimeTest注解：

java">/*** 搜索敏感词** @param article* @return*/
@TimeTest
public Set<String> searchTouchy(String article) {
}

java">/*** 搜索AC自动机中存储的敏感词** @param article* @return*/
@TimeTest
public List<String> searchTouchy(String article) {
}

2.2.3.2测试方法

使用以下测试方案：

java">@Test
public void test() {ApplicationContext applicationContext = new AnnotationConfigApplicationContext(SpringConfig.class);ACAutomaton touchy = applicationContext.getBean(ACAutomaton.class);touchy.insert("abcd");touchy.insert("abc");touchy.insert("bc");touchy.insert("d");touchy.insert("fuck");touchy.build();System.out.println(touchy.searchTouchy("adhasuidhoiadjfuckhioasdhjiaddddshdbujikashduiawhdiukasjhbdjkasgjkhyfdgasuydgasuijhdgasjdgahjskdghjaskda"));PrefixTree_SearchTouchyForArticle touchy2 = applicationContext.getBean(PrefixTree_SearchTouchyForArticle.class);touchy2.insert("abcd");touchy2.insert("abc");touchy2.insert("bc");touchy2.insert("d");touchy2.insert("fuck");System.out.println(touchy2.searchTouchy("adhasuidhoiadjfuckhioasdhjiaddddshdbujikashduiawhdiukasjhbdjkasgjkhyfdgasuydgasuijhdgasjdgahjskdghjaskda"));
}

2.2.3.3测试结果

执行时间：AC自动机61700ns，字典树38739300ns，字典树的时间消耗是AC自动机的627倍，指数级增长。

接下来我们进行更专业的测试。随机生成一段文本，使用AC自动机和字典树检测敏感词。

2.2.4测试工具封装

由于这里我使用了策略者模式，过度设计了一下，主要是我考虑到以后随机生成字符串可能具有一定通用性，以后可能还要在其基础上进行封装改善，所以就先过度设计了，这里只给出核心代码：

java">import com.xinghai.strategy.strgenerate.stg.GenerateRandomStr;public class GenerateSmallLetterStr implements GenerateRandomStr {@Overridepublic String generate(int count) {// 准备初始字符char init = 'a';// 准备StringBuilder拼接StringBuilder builder = new StringBuilder();for (int index = 0; index < count; index++) {// 生成0-25的偏移量int offset = (int) (Math.random() * 26);builder.append((char) (init + offset));}return builder.toString();}
}

2.2.5精准测试

使用封装的工具生成一千个字符的文章，看看我们算法的速度怎么样：

java">@Test
public void test() {ApplicationContext applicationContext = new AnnotationConfigApplicationContext(SpringConfig.class);String words = GenerateRandomStr.generateRandomStr(1000, StrTypeEnum.SmallLetter);ACAutomaton touchy = applicationContext.getBean(ACAutomaton.class);touchy.insert("abcd");touchy.insert("abc");touchy.insert("bc");touchy.insert("d");touchy.insert("fuck");touchy.build();System.out.println(touchy.searchTouchy(words));PrefixTree_SearchTouchyForArticle touchy2 = applicationContext.getBean(PrefixTree_SearchTouchyForArticle.class);touchy2.insert("abcd");touchy2.insert("abc");touchy2.insert("bc");touchy2.insert("d");touchy2.insert("fuck");System.out.println(touchy2.searchTouchy(words));
}

这....，字典树的速度整整比AC自动慢了24万5663倍。

可以看到AC自动机相对于字典树的暴力检测来说，速度提升的非常显著，数据量越大越显著。

接下来探究AC自动机的原理。

3.原理篇

原理篇将在下一篇文章中更新，将会以最细节的方式更新，因为作者最近自由意志沉沦，实在是无力爆肝了，而且原理这方面真的需要作者花费时间讲清楚，作者不希望草草了之，下一篇会以万字干活的形式更新。

3.1原理篇总览

1.尽力放弃原则。

2.转子收缩优化。

3.2尽力放弃原则

为什么字典树的复杂度这么高？而且随着word文本量增大，算法时间消耗激增？

因为想要从字典树中找出所有敏感词，需要把传入的文章中的所有的字符串枚举出，时间复杂度是O(M^2)，不管是行不行的，全尝试一遍，这就不符合尽力放弃原则

但是AC自动机整体借助转子收缩优化，尽力去放弃一些不可能的字符串，将时间复杂度大幅度降低。=

4.结语

我们总要花点时间去做自己的事情，人生只有一次，勇敢一点，再差又能如何呢？又怎能甘愿措施机会呢？作者没更新完原理篇纯粹是因为，作者陷入了一个需要花费精力解决的事情，作者不要错失，因为作者渴望一个结果。