es实现上传文件查询

news/2024/12/11 17:03:09/

es实现上传文件查询

上传文件,获取文件内容base64,使用es的ingest-attachment文本抽取管道转换为文字存储

安装插件

通过命令行安装(推荐)

1.进入 Elasticsearch 安装目录
2.使用 elasticsearch-plugin 命令安装
bin/elasticsearch-plugin install ingest-attachment
3.重启elasticsearch
# 如果是系统服务
sudo systemctl restart elasticsearch
# 或者直接重启
./bin/elasticsearch查看是否安装成功
1.查看elasticsearch-7.17.12\plugins下是否存在ingest-attachment
2.查看已安装插件列表
bin/elasticsearch-plugin list进入kibana调试控制台,通过 API 检查
GET /_nodes/plugins

可以手动加载插件再解压到plugins目录下

https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-7.17.0.zip

新增和es交互的实体FileEsDTO

尽量只留需要参与查询的字段,不经常变动的字段

package com.xiaofei.site.search.model.dto.file;import lombok.Data;
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;
import org.springframework.data.elasticsearch.annotations.Field;
import org.springframework.data.elasticsearch.annotations.FieldType;import java.io.Serializable;
import java.util.Date;/*** 文件 ES 包装类**/
@Document(indexName = "file_v3")
@Data
public class FileEsDTO implements Serializable {private static final String DATE_TIME_PATTERN = "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'";/*** id*/@Idprivate Long id;/*** 文件名*/@Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart")private String fileName;/*** 文件类型*/@Field(type = FieldType.Keyword)private String fileType;/*** 解析后的文本内容*/@Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart")private String content;/*** 文件描述*/@Field(type = FieldType.Text, analyzer = "ik_max_word", searchAnalyzer = "ik_smart")private String description;/*** 上传用户ID*/@Field(type = FieldType.Long)private Long userId;/*** 业务类型*/@Field(type = FieldType.Keyword)private String biz;/*** 下载次数*/@Field(type = FieldType.Integer)private Integer downloadCount;/*** 创建时间*/@Field(index = false, store = true, type = FieldType.Date, format = {}, pattern = DATE_TIME_PATTERN)private Date createTime;/*** 更新时间*/@Field(index = false, store = true, type = FieldType.Date, format = {}, pattern = DATE_TIME_PATTERN)private Date updateTime;private Integer isDelete;private static final long serialVersionUID = 1L;
}

新增文本抽取管道

字段就是存储文本的content

#文本抽取管道
PUT /_ingest/pipeline/attachment
{"description": "Extract file content","processors": [{"attachment": {"field": "content","target_field": "attachment","indexed_chars": -1}},{"remove": {"field": "content"}}]
}

新增es文档索引

PUT /file_v3
{"mappings": {"properties": {"id": {"type": "long"},"fileName": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart","fields":{"keyword":{"type":"keyword","ignore_above":256}}},"fileType": {"type": "keyword"},"content": {"type": "binary"},"attachment": {"properties": {"content": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart"},"content_type": {"type": "keyword"},"language": {"type": "keyword"},"title": {"type": "text"}}},"description": {"type": "text","analyzer": "ik_max_word","search_analyzer": "ik_smart"},"userId": {"type": "long"},"biz": {"type": "keyword"},"isDelete": {"type": "integer"},"createTime": {"type": "date"},"updateTime": {"type": "date"}}}
}

修改文件上传方法

增加文本内容同步到es

public interface FileService extends IService<FilePo> {
...FilePo uploadFile(MultipartFile file, FileUploadBizEnum fileUploadBizEnum);
...
}
@Service
@Slf4j
public class FileServiceImpl extends ServiceImpl<FileMapper, FilePo> implements FileService {
@Resourceprivate RestHighLevelClient restHighLevelClient;@Overridepublic FilePo uploadFile(MultipartFile file, FileUploadBizEnum fileUploadBizEnum) {try {...int insert = fileMapper.insert(filePo);if (insert > 0) {//上传esboolean esUpload = uploadFileToEs(filePo, dest);// 4. 删除临时文件dest.delete();if (!esUpload) {throw new BusinessException(ErrorCode.OPERATION_ERROR, "文件上传失败");}return filePo;}return null;} catch (IOException e) {log.error("文件上传失败", e);throw new BusinessException(ErrorCode.OPERATION_ERROR, "文件上传失败");}}
/*** 通过pipeline上传文件到es* @param filePo* @param file* @return*/public boolean uploadFileToEs(FilePo filePo, File file) {try {// 1. 读取文件内容并转换为 Base64byte[] fileContent = Files.readAllBytes(file.toPath());String base64Content = Base64.getEncoder().encodeToString(fileContent);// 2. 准备索引文档Map<String, Object> document = new HashMap<>();document.put("id", filePo.getId());document.put("fileName", filePo.getFileName());document.put("fileType", filePo.getFileType());document.put("content", base64Content);  // 使用 content 字段document.put("description", filePo.getDescription());document.put("userId", filePo.getUserId());document.put("biz", filePo.getBiz());document.put("createTime", filePo.getCreateTime());document.put("updateTime", filePo.getUpdateTime());// 3. 创建索引请求IndexRequest indexRequest = new IndexRequest("file_v3").id(filePo.getId().toString()).setPipeline("attachment").source(document);// 4. 执行索引请求IndexResponse indexResponse = restHighLevelClient.index(indexRequest, RequestOptions.DEFAULT);return indexResponse.status() == RestStatus.CREATED|| indexResponse.status() == RestStatus.OK;} catch (Exception e) {log.error("上传文件到 ES 失败", e);return false;}}/*** 生成文件名(防止重复)** @param originalFilename* @return*/private String generateFileName(String originalFilename) {String extension = FilenameUtils.getExtension(originalFilename);String uuid = RandomStringUtils.randomAlphanumeric(8);return uuid + "." + extension;}/*** 计算文件MD5** @param file* @return* @throws IOException*/private String calculateMD5(File file) throws IOException {return DigestUtils.md5Hex(new FileInputStream(file));}
}

上传一个测试文档

通过es查询,看看是否正常解析成文本

content存储着文本

image-20241210220739746

新增es查询方法

public interface FileService extends IService<FilePo> {Page<FileVo> searchFromEs(FileQueryRequest fileQueryRequest);
}
@Service
@Slf4j
public class FileServiceImpl extends ServiceImpl<FileMapper, FilePo> implements FileService {@Resourceprivate ElasticsearchRestTemplate elasticsearchRestTemplate;@Overridepublic Page<FileVo> searchFromEs(FileQueryRequest fileQueryRequest) {String searchText = fileQueryRequest.getSearchText();String fileName = fileQueryRequest.getFileName();String content = fileQueryRequest.getContent();// es 起始页为 0long current = fileQueryRequest.getCurrent() - 1;long pageSize = fileQueryRequest.getPageSize();String sortField = fileQueryRequest.getSortField();String sortOrder = fileQueryRequest.getSortOrder();BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();//boolQueryBuilder.filter(QueryBuilders.termQuery("isDelete", 0));// 按关键词检索if (StringUtils.isNotBlank(searchText)) {boolQueryBuilder.should(QueryBuilders.matchQuery("fileName", searchText));boolQueryBuilder.should(QueryBuilders.matchQuery("description", searchText));boolQueryBuilder.should(QueryBuilders.matchQuery("attachment.content", searchText));boolQueryBuilder.minimumShouldMatch(1);}// 按标题检索if (StringUtils.isNotBlank(fileName)) {boolQueryBuilder.should(QueryBuilders.matchQuery("fileName", fileName));boolQueryBuilder.minimumShouldMatch(1);}// 按内容检索if (StringUtils.isNotBlank(content)) {boolQueryBuilder.should(QueryBuilders.matchQuery("attachment.content", content));boolQueryBuilder.minimumShouldMatch(1);}// 排序SortBuilder<?> sortBuilder = SortBuilders.scoreSort();if (StringUtils.isNotBlank(sortField)) {sortBuilder = SortBuilders.fieldSort(sortField);sortBuilder.order(CommonConstant.SORT_ORDER_ASC.equals(sortOrder) ? SortOrder.ASC : SortOrder.DESC);}// 分页PageRequest pageRequest = PageRequest.of((int) current, (int) pageSize);// 构造查询NativeSearchQuery searchQuery = new NativeSearchQueryBuilder().withQuery(boolQueryBuilder).withPageable(pageRequest).withSorts(sortBuilder).build();SearchHits<FileEsDTO> searchHits = elasticsearchRestTemplate.search(searchQuery, FileEsDTO.class);Page<FileVo> page = new Page<>();page.setTotal(searchHits.getTotalHits());List<FilePo> resourceList = new ArrayList<>();// 查出结果后,从 db 获取最新动态数据if (searchHits.hasSearchHits()) {List<SearchHit<FileEsDTO>> searchHitList = searchHits.getSearchHits();List<Long> fileIdList = searchHitList.stream().map(searchHit -> searchHit.getContent().getId()).collect(Collectors.toList());List<FilePo> fileList = baseMapper.selectBatchIds(fileIdList);if (fileList != null) {Map<Long, List<FilePo>> idPostMap = fileList.stream().collect(Collectors.groupingBy(FilePo::getId));fileIdList.forEach(fileId -> {if (idPostMap.containsKey(fileId)) {resourceList.add(idPostMap.get(fileId).get(0));} else {// 从 es 清空 db 已物理删除的数据String delete = elasticsearchRestTemplate.delete(String.valueOf(fileId), FileEsDTO.class);log.info("delete post {}", delete);}});}}List<FileVo> fileVoList = new ArrayList<>();if (CollUtil.isNotEmpty(resourceList)) {for (FilePo filePo : resourceList) {FileVo fileVo = FilePoToVoUtils.poToVo(filePo);fileVoList.add(fileVo);}}page.setRecords(fileVoList);return page;}
}

po转Vo工具

package com.xiaofei.site.search.utils;import com.xiaofei.site.search.model.entity.FilePo;
import com.xiaofei.site.search.model.vo.FileVo;
import org.springframework.stereotype.Component;/*** @author tuaofei* @description TODO* @date 2024/12/6*/
@Component
public class FilePoToVoUtils {public static FileVo poToVo(FilePo entity) {if (entity == null) {return null;}FileVo vo = new FileVo();vo.setId(entity.getId());vo.setBiz(entity.getBiz());vo.setFileName(entity.getFileName());vo.setFileType(entity.getFileType());vo.setFileSize(entity.getFileSize());vo.setFileSizeFormat(formatFileSize(entity.getFileSize()));vo.setFileExtension("");vo.setUserId(entity.getUserId());vo.setDownloadCount(entity.getDownloadCount());vo.setDescription(entity.getDescription());vo.setCreateTime(entity.getCreateTime());vo.setUpdateTime(entity.getUpdateTime());vo.setContent(entity.getContent());// 设置预览和下载URLvo.setPreviewUrl(generatePreviewUrl(entity));vo.setDownloadUrl(generateDownloadUrl(entity));// 设置权限vo.setCanPreview(checkPreviewPermission(entity));vo.setCanDownload(checkDownloadPermission(entity));return vo;}/*** 格式化文件大小*/private static String formatFileSize(Long size) {if (size == null) {return "0B";}if (size < 1024) {return size + "B";} else if (size < 1024 * 1024) {return String.format("%.2fKB", size / 1024.0);} else if (size < 1024 * 1024 * 1024) {return String.format("%.2fMB", size / (1024.0 * 1024.0));} else {return String.format("%.2fGB", size / (1024.0 * 1024.0 * 1024.0));}}/*** 生成预览URL*/private static String generatePreviewUrl(FilePo entity) {// 根据业务逻辑生成预览URLreturn "/api/file/preview/" + entity.getId();}/*** 生成下载URL*/private static String generateDownloadUrl(FilePo entity) {// 根据业务逻辑生成下载URLreturn "/api/file/download/" + entity.getId();}/*** 检查预览权限*/private static Boolean checkPreviewPermission(FilePo entity) {// 根据业务逻辑检查预览权限return true;}/*** 检查下载权限*/private static Boolean checkDownloadPermission(FilePo entity) {// 根据业务逻辑检查下载权限return true;}
}

测试内容分词是否正常

使用分词器分词后,拿分词后的单个分词结果搜索,应该能搜索到结果

POST /file_v3/_analyze
{"analyzer": "ik_max_word","text": "xxx"
}

http://www.ppmy.cn/news/1554275.html

相关文章

基于spring boot的高校专业实习管理系统的设计与实现

文末获取源码和万字论文&#xff0c;制作不易&#xff0c;感谢点赞支持。 设计题目&#xff1a;基于spring boot的高校专业实习管理系统的设计与实现 摘 要 随着国内市场经济这几十年来的蓬勃发展&#xff0c;突然遇到了从国外传入国内的互联网技术&#xff0c;互联网产业从开…

如何避免缓存击穿?超融合常驻缓存和多存储池方案对比

作者&#xff1a;SmartX 解决方案专家 钟锦锌 很多运维人员都知道&#xff0c;混合存储介质配置可能会带来“缓存击穿”的问题&#xff0c;尤其是大数据分析、数据仓库等需要频繁访问“冷数据”的应用场景&#xff0c;缓存击穿可能会更频繁地出现&#xff0c;影响业务运行。除…

Distance in Tree 树形dp练习(树中两点距离为k的数量板子)

Distance in Tree 题面翻译 题目大意 输入点数为 N N N一棵树 求树上长度恰好为 K K K的路径个数 输入格式 第一行两个数字 N , K N,K N,K,如题意 接下来的 N − 1 N-1 N−1行中,每行两个整数 u , v u,v u,v表示一条树边 ( u , v ) (u,v) (u,v) 输出格式 一个整数 a n…

k8s折腾笔记

k8s折腾笔记 k8s安装、部署、运行demo1.系统环境2.开始安装2.1 先从master节点开始2.2 worker节点 3.遇到的问题4.集群demo k8s安装、部署、运行demo 1.系统环境 两台服务器&#xff0c;都是ubuntu22版本&#xff0c; 一台2核4g&#xff0c;作为master节点 一台2核2g&#xf…

Hyper-V创建虚拟机配置IP等网络配置原理(Linux、Windows为例)

Hyper-V创建虚拟机配置IP等网络配置原理&#xff08;Linux、Windows为例&#xff09; 大家知道Windows系统里面内置了Hyper-V管理器&#xff0c;用来创建和管理本地虚拟机环境。今天我创建了两台虚拟机&#xff0c;一台是CentOS7.9&#xff08;Linux&#xff09;&#xff0c;另…

使用 Streamlit +gpt-4o实现有界面的图片内容分析

在上一篇利用gpt-4o分析图像的基础上&#xff0c;进一步将基于 Python 的 Streamlit 库&#xff0c;结合 OpenAI 的 API&#xff0c;构建一个简洁易用的有界面图片内容分析应用。通过该应用&#xff0c;用户可以轻松浏览本地图片&#xff0c;并获取图片的详细描述。 调用gpt-4o…

springboot系列--拦截器加载原理

一、拦截器加载原理 拦截器是在容器启动时&#xff0c;就创建并加载好&#xff0c;此时并未放入拦截器链中&#xff0c;只是放在一个拦截器集合当中&#xff0c;当一个请求进来之后&#xff0c;会通过匹配路径&#xff0c;查看是否有命中集合中的拦截器的拦截路径&#xff0c;如…

安全架构评审

安全架构评审 1.概述2.安全设计原则3.美团安全架构评审模型安全需求分析架构review攻击面分析和威胁建模攻击面分析威胁列表 1.概述 完整的安全评审会包含安全架构评审、安全代码审核和安全测试三个手段 安全架构评审聚焦于探寻安全设计中的漏洞&#xff0c;以宏观视野全面考…