什么是聚合查询
聚合查询可以对照数据库理解
注意事项
用于进行聚合的字段必须是exact value,分词字段不可进行聚合,对于text字段如果需要使用聚合,需要开启fielddata,但是通常不建议,因为fielddata是将聚合使用的数据结构由磁盘(doc_values)变为了堆内存(field_data),大数据的聚合操作很容易导致OOM
聚合分类
分桶聚合(Bucket agregations):类比SQL中的group by的作用,主要用于统计不同类型数据的数量
指标聚合(Metrics agregations):主要用于最大值、最小值、平均值、字段之和等指标的统计
管道聚合(Pipeline agregations):用于对聚合的结果进行二次聚合,如要统计绑定数量最多的标签bucket,就是要先按照标签进行分桶,再在分桶的结果上计算最大值。
基本语法
GET test_idx_aggs/_search
{"aggs": {"<aggs_name>": {"<agg_type>": {"field": "<field_name>"}}}
}
aggs_name:聚合函数的名称
agg_type:聚合种类,比如是桶聚合(terms)或者是指标聚合(avg、sum、min、max等)
field_name:字段名称或者叫域名。
聚合测试脚本
为了节省你的时间,把用到的脚本提前准备好
测试数据
# 聚合查询
DELETE test_idx_aggs
## 定义索引
PUT test_idx_aggs
{"mappings" : {"properties" : {"createtime" : {"type" : "date"},"date" : {"type" : "date"},"desc" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}},"analyzer":"ik_max_word"},"lv" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"name" : {"type" : "text","analyzer":"ik_max_word","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"price" : {"type" : "long"},"tags" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}},"type" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}}}}
}## 测试数据
PUT /test_idx_aggs/_doc/1
{"name" : "小米手机","desc" : "手机中的战斗机","price" : 3999,"lv":"旗舰机","type":"手机","createtime":"2020-10-01T08:00:00Z","tags": [ "性价比", "发烧", "不卡顿" ]
}
PUT /test_idx_aggs/_doc/2
{"name" : "小米NFC手机","desc" : "支持全功能NFC,手机中的滑翔机","price" : 4999,"lv":"旗舰机","type":"手机","createtime":"2020-05-21T08:00:00Z","tags": [ "性价比", "发烧", "公交卡" ]
}
PUT /test_idx_aggs/_doc/3
{"name" : "NFC手机","desc" : "手机中的轰炸机","price" : 2999,"lv":"高端机","type":"手机","createtime":"2020-06-20","tags": [ "性价比", "快充", "门禁卡" ]
}
PUT /test_idx_aggs/_doc/4
{"name" : "小米耳机","desc" : "耳机中的黄焖鸡","price" : 999,"lv":"百元机","type":"耳机","createtime":"2020-06-23","tags": [ "降噪", "防水", "蓝牙" ]
}
PUT /test_idx_aggs/_doc/5
{"name" : "红米耳机","desc" : "耳机中的肯德基","price" : 399,"type":"耳机","lv":"百元机","createtime":"2020-07-20","tags": [ "防火", "低音炮", "听声辨位" ]
}
PUT /test_idx_aggs/_doc/6
{"name" : "小米手机10","desc" : "充电贼快掉电更快,超级无敌望远镜,高刷电竞屏","price" : "","lv":"旗舰机","type":"手机","createtime":"2020-07-27","tags": [ "120HZ刷新率", "120W快充", "120倍变焦" ]
}
PUT /test_idx_aggs/_doc/7
{"name" : "挨炮 SE2","desc" : "除了CPU,一无是处","price" : "3299","lv":"旗舰机","type":"手机","createtime":"2020-07-21","tags": [ "割韭菜", "割韭菜", "割新韭菜" ]
}
PUT /test_idx_aggs/_doc/8
{"name" : "XS Max","desc" : "听说要出新款12手机了,终于可以换掉手中的4S了","price" : 4399,"lv":"旗舰机","type":"手机","createtime":"2020-08-19","tags": [ "5V1A", "4G全网通", "大" ]
}
PUT /test_idx_aggs/_doc/9
{"name" : "小米电视","desc" : "70寸性价比只选,不要一万八,要不要八千八,只要两千九百九十八","price" : 2998,"lv":"高端机","type":"耳机","createtime":"2020-08-16","tags": [ "巨馍", "家庭影院", "游戏" ]
}
PUT /test_idx_aggs/_doc/10
{"name" : "红米电视","desc" : "我比上边那个更划算,我也2998,我也70寸,但是我更好看","price" : 2999,"type":"电视","lv":"高端机","createtime":"2020-08-28","tags": [ "大片", "蓝光8K", "超薄" ]
}
PUT /test_idx_aggs/_doc/11
{"name": "红米电视","desc": "我比上边那个更划算,我也2998,我也70寸,但是我更好看","price": 2998,"type": "电视","lv": "高端机","createtime": "2020-08-28","tags": ["大片","蓝光8K","超薄"]
}
查询语法
简单查询
GET test_idx_aggs/_search
{// 这个含义是不要显示hit的source数据信息,只要聚合信息"size": 0, "aggs": {"test_aggs_price": {"sum": {"field": "price"}}}
}
桶聚合 Bucket
分类,不同类一个桶; 比如根据标签分类
桶聚合查询
## 按照不同标签统计数量
GET test_idx_aggs/_search
{// 这个含义是不要显示hit的source数据信息,只要聚合信息"size": 0, "aggs": {"test_aggs_tags": {// 使用terms分桶 "terms": {// 这样就是桶查询; 这里为什么用tags.keyword而不是用tags因为tags是text类型默认没有被正排索引所以无法使用聚合,需要设置fielddata为true才可以使用"field": "tags.keyword",// 限制桶数量或者也可以理解为分页"size": 99999,// 还支持很多其他属性,kibana可以看到提示,也可以官网查看手册"order": {"_key": "asc"}}}}
}## 设置text使用聚合查询
POST test_idx_aggs/_mapping
{"properties": {// 你的字段名"tags": {// 之前属性保持一致"type": "text",// 设置fielddata为true即可"fielddata": true}}
}
指标聚合
最大/小; 平均,计数,求和;基数去重, top; 统计聚合
指标查询语法
## 自定义多指标查询
GET test_idx_aggs/_search
{"size": 0, "aggs": {"test_aggs_name_price_max": {"max": {"field": "price"}},"test_aggs_name_price_min": {"min": {"field": "price"}}}
}## 使用`stats`进行统计查询, 会列出所有指标
GET test_idx_aggs/_search
{"size": 0, "aggs": {"test_aggs_name_stats": {"stats": {"field": "price"}}}
}
去重统计
使用cardinality
关键字
GET test_idx_aggs/_search
{"size": 0, "aggs": {"test_aggs_name_distinct_cnt": {// 去重"cardinality": {// 注意这里如果分词就不准确了,并且非正排索引的text类型不支持 "field": "name.keyword"}}}
}
管道聚合
类似linux命令行的管道,对聚合的结果进行多次聚合
管道聚合语法
GET test_idx_aggs/_search
{"size": 0, "aggs": {// 第一层: 按照type聚合"test_agg_name_groupby_type_bucket": {// terms 就是分桶关键字"terms": {"field": "type.keyword","size": 99999},// 第二层: 计算按照type聚合后的平均价格"aggs": {"test_aggname_price_avg": {"avg": {"field": "price"}}}}}
}
继续聚合
GET test_idx_aggs/_search
{"size": 0, "aggs": {"test_agg_name_groupby_type_bucket": {// terms 就是分桶关键字"terms": {"field": "type.keyword","size": 99999},// 第二层聚合"aggs": {"test_aggname_price_avg": {"avg": {"field": "price"}}}},// 要计算按照type聚合后平均价格最小值,就放这个平级位置"test_aggname_min_typebucket": {"min": {"field": "test_agg_name_groupby_type_bucket>test_aggname_price_avg"}}}
}
嵌套聚合
嵌套聚合就是这种场景:比如先按照品牌分桶, 然后不同品牌在按照同品牌下产品矩阵分桶比如旗舰机, 老年机.
嵌套聚合语法
GET test_idx_aggs/_search
{"size": 0, "aggs": {"test_aggname_bucket_type": {"terms": {"field": "type.keyword","size": 9999},"aggs": {"test_aggname_bucket_lv": {"terms": {"field": "lv.keyword","size": 9999}}}}}
}
其他
GET test_idx_aggs/_search
{"size": 0, "aggs": {"type_agg": {"terms": {"field": "type.keyword"},"aggs": {"lv_agg": {"terms": {"field": "lv.keyword"},"aggs": {"price_stats": {"stats": {"field": "price"}},"tags_buckets": {"terms": {"field": "tags.keyword"}}}}}}}
}## 统计每个商品类型中 不同档次分类商品中 平均价格最低的档次
GET test_idx_aggs/_search
{"size": 0,"aggs": {"type_bucket": {"terms": {"field": "type.keyword"},"aggs": {"lv_bucket": {"terms": {"field": "lv.keyword"},"aggs": {"price_avg": {"avg": {"field": "price"}}}},"min_bucket": {"min_bucket": {"buckets_path": "lv_bucket>price_avg"}}}}}
}
聚合和查询的相互关系
基于query或filter的聚合
GET test_idx_aggs/_search
{//"size": 0, "query": {"range": {"price": {"gte": 4000}}},// 要聚合谁就写在谁的平级位置"aggs": {"test_aggname_bucket_type": {"terms": {"field": "type.keyword","size": 9999}}}
}
基于filter聚合
GET test_idx_aggs/_search
{"size": 0, "query": {"constant_score": {"filter": {"range": {"price": {"gte": 10,"lte": 99999}}},"boost": 1.2}},"aggs": {"test_aggname_bucket_types": {"terms": {"field": "type.keyword","size": 99999}}}
}
基于聚合结果的查询
场景: 聚合后再过滤或者查询
但是注意query结果是过滤的,但是统计信息没有过滤
GET test_idx_aggs/_search
{"aggs": {"aggname_bucket_types": {"terms": {"field": "type.keyword","size": 99999}}}
}
## 先聚合后过滤
GET test_idx_aggs/_search
{"aggs": {"aggname_bucket_types": {"terms": {"field": "type.keyword","size": 99999}}}// post_filter 是关键,"post_filter": {"range": {"price": {"gte": 4000}}}
}
查询条件的作用域
场景: 某些统计是基于过滤器比如统计4000以上的max,min,avg,但是还有一个指标想统计所有的价格的avg
GET test_idx_aggs/_search
{"size": 0, "aggs": {"aggname_max_price": {"max": {"field": "price"}},"aggname_avg_price": {"avg": {"field": "price"}},"aggname_avg_all": {// global就表示不受queyr或者filter限制// 当然global也可以改成filter,用来统计特定条件下的聚合数据"global": {},// 注意aggs的层级"aggs": {"aggname_avg_all_inner": {"avg": {"field": "price"}}}}},"query": {"range": {"price": {"gte": 4000}}}
}
聚合排序
前边脚本已经写过了.
这里写一个多层聚合的不同排序
GET test_idx_aggs/_search
{"size": 0,"aggs": {"aggname_bucket_type": {"terms": {"field": "type.keyword","order": {"_count": "asc"}},// 一定要注意这个层级位置"aggs": {"aggname_bck_lv2": {"terms": {"field": "lv.keyword","size": 9999,"order": {"_count": "desc"}}}}}}
}
多层级排序字段 示例
GET test_idx_aggs/_search
{"size": 0,"aggs": {"tag_avg_price": {"terms": {"field": "type.keyword","size": 9999,"order": {// `>`表示其下一级"test_aggname_stats>test_aggname_sum.sum": "desc"}},"aggs": {"test_aggname_stats": {"filter": {"terms": {"type.keyword": ["耳机", "手机", "电视"]}},"aggs": {"test_aggname_sum": {"extended_stats": {"field": "price"}}}}}}}
}
排序规则
多级排序
即排序的优先级,按照外层优先的顺序
多层排序
即按照多层聚合中的里层某个聚合的结果进行排序
常用的查询函数
histogram
## 按价格区间分桶,统计计数
GET test_idx_aggs/_search?size=0
{"aggs": {"test_range_bucket": {"range": {"field": "price","ranges": [{"from": 0,"to": 1000},{"from": 1000,"to": 2000},{"from": 2000,"to": 3000},{"from": 4000,"to": 5000}]}}}
}## histogram实现## histogram实现
GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"histogram": {// field如果date类型, 可以使用fix_interval支持毫秒/秒/分钟/小时/天, 1d表示一天; interval支持month,year,day,hour等等// 甚至支持格式化"format""field": "price","interval": 1000,// 是否用key value形式"keyed": true,// count >= min_doc_count才显示"min_doc_count": 0,// 空值处理,为空设置一个默认值"missing": 999}}}
}## 日期histogram
GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"date_histogram": {// field如果date类型, 可以使用fix_interval支持毫秒/秒/分钟/小时/天, 1d表示一天; calendar_interval支持month,year,day,hour等等// 甚至支持格式化"format""field": "createtime","interval": "day",// 是否用key value形式"keyed": true,// count >= min_doc_count才显示"min_doc_count": 1,"extended_bounds": {"min": "2020-05","max": "2029-12"}}}}
}
GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"date_histogram": {"field": "createtime",// 注意这里的 calendar_interval,fix_interval,interval// 注意这里即使不存在也会生成桶,可能造成大量的数据,造成服务不可用,如设置毫秒"calendar_interval": "day","keyed": true,"min_doc_count": 1,// 这块限制桶产生的区间, 可以避免产生过多的数据"extended_bounds": {"min": "2020-05","max": "2029-12"}}}}
}GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"date_histogram": {"field": "createtime",// 间隔"calendar_interval": "month","keyed": true,// 支持过滤"min_doc_count": 0,// 支持forma"format": "yyyy-MM", // 支持范围限制"extended_bounds": {"min": "2020-01","max": "2020-12"}}}}
}## 支持自动感知分桶
GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"auto_date_histogram": {"field": "createtime","format": "yyyy-MM"}}}
}
cumulate
累加和
GET test_idx_aggs/_search?size=0
{"aggs": {"test_historam": {"date_histogram": {"field": "createtime","interval": "month","format": "yyyy-MM", "keyed": true,"extended_bounds": {"min": "2020-01","max": "2029-12"}}// 基于上边分桶进行统计,比如统计每月销售额,"aggs": {"sum_total": {"sum": {"field": "price"}}// 比如我们要统计累加销售额怎么办呢如2月份统计1+2月份,3月份统计1+2+3月份,以此类推,"my_acumulate":{"cumulative_sum": {// 这个path是要你指定到底使用哪个值,这里我们使用sub_total统计的sum price"buckets_path": "sum_total"}}}}}
}
percentile
饼图,百分比
它的数据先按照指定的字段分段,然后统计每个段极其以内的占比
GET test_idx_aggs/_search?size=0
{"aggs": {"my_percentil_statics": {"percentiles": {"field": "price","percents": [1,5,25,50,75,95,99]}}}
}
## 查出来结果类似,表示每种价格之内的占比"aggregations" : {"my_percentil_statics" : {"values" : {"1.0" : 399.00000000000006, "5.0" : 399.0, // 表示399之内的价格占比"25.0" : 2998.0,// 表示2998以内的占比"50.0" : 2999.0,"75.0" : 3999.0,"95.0" : 4999.0,"99.0" : 4999.0}}}
GET test_idx_aggs/_search?size=0
{"aggs": {"my_percentil_statics": {"percentile_ranks": {"field": "price","values": [1000, // 表示小于1000的占比多少3000, // 小于3000的占比多少5000]}}}
}
聚合原理初步
底层存储
聚合使用doc_values数据结构(正排索引)
只有在正排索引的字段上才支持聚合查询
如text的类型就默认不创建正排索引的.
keyword默认创建正排索引
fielddata是搜索时动态创建,放在内存中,不像doc_values是提前放到磁盘中的,需要注意内存使用问题
另外注意如果使用fieldata会被分词,统计结果可能不准确, 所以还是建议直接keyword,这样不会被分词
设置fielddata
如果强行对text类型进行聚合查询,可以设置field_data
POST test_idx_aggs/_mapping
{"properties": {// 你的字段名"tags": {// 之前属性保持一致"type": "text",// 设置fielddata为true即可"fielddata": true}}
}POST test_idx_aggs/_mapping
{"properties": {"name": {"type": "text",// 如果是中文要注意分词"analyzer": "ik_max_word","fielddata": true}}
}
优化点
仅检索的字段一般不创建正排索引; 需要聚合的才建