何谓滑动窗口分词:
比如原句:woods data hadoop
分词后为:
woodswoods datawoods data hadoopdatadata hadoophadoop
创建索引,自定义分词方式:
使用ik_smart中文分词,并使用shingle过滤器(滑动窗口模式)
put test
{"settings": {"index": {"analysis": {"analyzer": {"shingle1": {"type":"custom","tokenizer":"ik_smart","filter":["shingle-filter"]}},"filter": {"shingle-filter": {"type":"shingle","min_shingle_size":2, -- 设置最小和最大的滑动窗口尺寸"max_shingle_size":10, "output_unigrams":true -- 是否保留原始的单个分词,默认true:保留}}}}},"mappings": {"my_type_name": {"dynamic": false,"_all": {"enabled": false},"properties": {"country": {"type": "keyword"},"age": {"type": "integer"}}}}
}
查询:
post test/_analyze
{"analyzer": "shingle1","text": "飞利浦吸尘器家用自营"
}
结果:
{"tokens": [{"token": "飞利浦","start_offset": 0,"end_offset": 3,"type": "CN_WORD","position": 0},{"token": "飞利浦 吸尘器","start_offset": 0,"end_offset": 6,"type": "shingle","position": 0,"positionLength": 2},{"token": "飞利浦 吸尘器 家用","start_offset": 0,"end_offset": 8,"type": "shingle","position": 0,"positionLength": 3},{"token": "飞利浦 吸尘器 家用 自营","start_offset": 0,"end_offset": 10,"type": "shingle","position": 0,"positionLength": 4},{"token": "吸尘器","start_offset": 3,"end_offset": 6,"type": "CN_WORD","position": 1},{"token": "吸尘器 家用","start_offset": 3,"end_offset": 8,"type": "shingle","position": 1,"positionLength": 2},{"token": "吸尘器 家用 自营","start_offset": 3,"end_offset": 10,"type": "shingle","position": 1,"positionLength": 3},{"token": "家用","start_offset": 6,"end_offset": 8,"type": "CN_WORD","position": 2},{"token": "家用 自营","start_offset": 6,"end_offset": 10,"type": "shingle","position": 2,"positionLength": 2},{"token": "自营","start_offset": 8,"end_offset": 10,"type": "CN_WORD","position": 3}]
}
此种分词方式适用于用户搜索词的模糊匹配,类似于hive中的rlike功能。
相较于使用match、match_phrase、regex的查询,该方式可以使用trem查询,效率大幅提高、且准确性几乎相同。