基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南

2025-08-01 10:21 开发作者：墨夶

一、项目背景与技术选型

在企业级应用中，文档内容的智能化检索是一个高频需求。例如：

上传PDF/Word文档后自动抽取文本
支持中文分词和模糊匹配
搜索结果高亮显示关键词

技术选型

技术	作用
SpringBoot	快速构建微服务
ElasticSearch	实现全文检索与高亮功能
Jieba分词插件	中文分词支持
Ingest Attachment Processor Plugin	文档内容抽取（PDF/Word等）

二、环境准备

2.1 Maven依赖配置

<!-- pom.XML -->
<dependencies>
    <!-- SpringBoot基础 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>

    <!-- Elasticsearch连接 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
    </dependency>

    <!-- 文件处理工具 -->
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>5.2.3</version>
    </dependency>

    <!-- Jieba分词插件 -->
    <dependency>
        <groupId>com.nlp</groupId>
        <artifactId>elasticsearch-analysis-jieba</artifactId>
        <version>7.17.0</version>
    </dependency>
</dependencies>

2.2 配置文件

# application.yml
spring:
  data:
    elasticsearch:
      cluster-name: my-cluster
      cluster-nodes: localhost:9200
  elasticsearch:
    rest:
      uris: http://localhost:9200
      username: elastic
      password: your_password

三、核心功能实现步骤

3.1 安装ElasticSearch插件

Ingest Attachment Processor Plugin

# 安装插件（本地ES）
elasticsearch-plugin install ingest-attachment

# 安装插件（docker容器内）
docker exec -it elasticsearch bin/elasticsearch-plugin install ingest-attachment

注意：确保插件版本与ES版本匹配！重启ES后生效。

Jieba中文分词插件

elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-jieba/releases/download/v7.17.0/elasticsearch-analysis-jieba-7.17.0.zip

3.2 创建文档抽取管道

ElasticSearch的Ingest Pipeline用于自动化处理上传的文件内容。

3.2.1 定义Pipeline

PUT _ingest/pipeline/attachment-extract
{
  "description": "Extract attachment content",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "target_field": "attachment",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}

关键点：

attachment处理器将Base64编码的文件内容解析为文本。
remove处理器删除原始二进制字段，保留提取后的文本。

3.3 定义索引与映射

索引的mapping和settings决定了数据存储格式和分词规则。

3.3.1 创建索引

PUT /fileinfo
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "fileName": { "type": "text" },
      "fileType": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": javascript"text", "analyzer": "jieba" }  // 使用Jieba分词
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "analyzer": {
        "jieba": {
          "type": "custom",
          "tokenizer": "jieba_tokenizer"
        }
      }
    }
  }
}

注意：attachment.content字段必须使用分词器，否则全文检索会失败！

3.4 Java代码实现文档处理

3.4.1 文件上传接口

@RestController
@RequestMapping("/api/files")
public class FileUploadController {

    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @PostMapping("/upload")
    public ResponseEntity<String> uploadFile(@RequestParam("file") MultipartFile file) throws IOException {
        // 1. 文件转Base64
        String base64Content = Base64.getEncoder().encodeToString(file.getBytes());

        // 2. 构造文档对象
        Map<String, Object> document = new HashMap<>();
        document.put("id", UUID.randomUUID().toString());
        document.put("fileName", file.getOriginalFilename());
        document.put("fileType", getFileType(file.getOriginalFilename()));
        document.put("content", base64Content);  // 二进制字段

        // 3. 使用Pipeline处理并索引文档
        IndexRequest request = new IndexRequest("fileinfo")
                .setId(document.get("id编程客栈").toString())
                .setPipeline("attachment-extract")  // 关键：绑定Pipeline
                .setSource(document);

        elasticsearchRestTemplate.index(request);

        return ResponseEntity.ok("文件已成功索引");
    }

    private String getContentType(MultipartFile file) {
        String originalFilename = file.getOriginalFilename();
        if (originalFilename.endsWith(".pdf")) {
            return "application/pdf";
        } else if (originalFilename.endsWith(".docx")) {
            return "application/vnd.openxmlformats-officedocument.wordprocessingml.document";
        }
        return "application/octet-stream";
    }
}

代码解析：

Base64.getEncoder() 将文件转为Base64字符串，便于传输。
setPipeline("attachment-extract") 调用预定义的Pipeline处理内容。
elasticsearchRestTemplate.index() 执行索引操作。

3.5 全文检索与高亮分词

3.5.1 搜索接口

@GetMapping("/search")
public ResponseEntity<Map<String, Object>> searchFiles(@RequestParam String keyword) {
    // 1. 构建查询
    SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
    sourceBuilder.query(QueryBuilders.matchQuery("attachment.content", keyword)
            .analyzer("jieba")  // 使用Jieba分词
            .fuzziness("AUTO"));

    // 2. 启用高亮
    HighlightBuilder highlightBuilder = new HighlightBuilder();
    highlightBuilder.field("attachment.content").preTags("<mark>").postTags("</mark>");
    sourceBuilder.highlighter(highlightBuilder);

    // 3. 执行搜索
    SearchRequest searchRequest = new SearchRequest("fileinfo");
    searchRequest.source(sourceBuilder);
    SearchResponse response = elasticsearchRestTemplate.search(searchRequest);

    // 4. 提取高亮结果
    List<Map<String, Object>> results = new ArrayList<>();
    for (SearchHit hit : response.getHits().getHits()) {
        Map<String, Object> source = hit.getSourceAsMap();
        Map<String, HighlightField> highlights = hit.getHighlightFields();
        HighlightField contentHighlight = highlights.get("attachment.content");
        if (contentHighlight != null) {
            source.put("highlight", contentHighlight.fragments()[0].string());
        }
        results.add(source);
   http://www.devze.com }

    return ResponseEntity.ok(Collections.singletonMap("results", results));
}

关键点：

matchQuery("attachment.content", keyword) 对内容字段进行分词匹配。
HighlightBuilder 控制高亮标签（如<mark>）。
搜索结果中highlight字段包含高亮片段。

四、性能优化与注意事项

4.1 缓存策略

ElasticSearch缓存：启用request_cache减少重复查询开销。
应用层缓存：使用Redis缓存高频搜索结果。

4.2 分页与过滤

// 分页示例
sourceBuilder.from(0).size(10);  // 限制每页10条
sourceBuilder.sort(SortBuilders.fieldSort("createTime").order(SortOrder.DESC));  // 按时间排序

4.3 安全与容错

文件类型校验：防止非法文件上传。
异常处理：捕获ElasticsearchException并返回友好的错误信息。

五、代码整合

5.1 配置类（ElasticSearch连接）

@Configuration
public class jsElasticsearchConfig {

    @Value("${spring.elasticsearch.rest.uris}")
    private String esUri;

    @Bean
    public RestHighLevelClient elasticsearchClient() {
        return new RestHighLevelClient(
                RestClient.builder(new HttpHost(esUri.split(":")[0], Integer.parseInt(esUri.split(":")[1]), "http")));
    }

    @Bean
    public ElasticsearchRestTemplate elasticsearchRestTemplate(RestHighLevelClient client) {
        return new ElasticsearchRestTemplate(client);
    }
}

5.2 高亮结果返回示例

{
  "results": [
    {
      "id": "123",
      "fileName": "进口红酒.pdf",
      "fileType": "pdf",
      "attachment": {
        "content": "这款红酒产自法国波尔多地区，口感醇厚..."
      },
      "highlight": "这款红酒产自法国波尔多地区，<mark>口感醇厚</mark>..."
    }
  ]
}

六、从零到一的文档搜索闭环

步骤	核心代码/配置	作用
1. 依赖配置	pom.xml	引入ElasticSearch和分词插件
2. 管道定义	PUT _ingest/pipeline/attachment-extract	自动抽取文件内容
3. 索引映射	PUT /fileinfo	定义字段类型和分词规则
4. 文件上传	FileUploadController.uploadFile()	将文件转为Base64并索引
5. 全文搜索	FileUploadController.searchFiles()	使用Jieba分词和高亮

七、行动号召：立即动手实践！

“文档检索不再是难题！现在就搭建你的智能搜索系统！”

尝试基础功能：上传一个PDF并验证内容抽取是否成功。
挑战分词优化：自定义Jieba分词词典，提升匹配准确率。
扩展搜索维度：添加按文件类型、时间范围的过滤功能。

以上就是基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南的详细内容，更多关于SpringBoot ElasticSearch文档检索的资料请关注编程客栈(www.cppcnhttp://www.devze.coms.com)其它相关文章！

继续阅读：SpringBoot ElasticSearch文档检索 SpringBoot文档检索

基于SpringBoot+ElasticSearch实现文档智能化检索的完整指南

目录

一、项目背景与技术选型

技术选型

二、环境准备

2.1 Maven依赖配置

2.2 配置文件

三、核心功能实现步骤

3.1 安装ElasticSearch插件

3.2 创建文档抽取管道

3.2.1 定义Pipeline

3.3 定义索引与映射

3.3.1 创建索引

3.4 Java代码实现文档处理

3.4.1 文件上传接口

3.5 全文检索与高亮分词

3.5.1 搜索接口

四、性能优化与注意事项

4.1 缓存策略

4.2 分页与过滤

4.3 安全与容错

五、代码整合

5.1 配置类（ElasticSearch连接）

5.2 高亮结果返回示例

六、从零到一的文档搜索闭环

七、行动号召：立即动手实践！

更多精彩内容

精彩评论

最新开发

Go中make函数和append函数的作用详解

使用C++设计开发一个功能完善的多进程管理器

C/C++ Qt监控文件状态变化方式

深入理解 C++ 的 std::initializer_list及使用场景分析

C语言memcpy函数用法详解:高效内存复制的实用工具

开发排行榜

springboot后端存储富文本内容的思路与步骤(含图片内容)

PyCharm运行python测试,报错“没有发现测试”/“空套件”的解决

return base64.b64encode(b).decode(

基于C语言实现钻石棋游戏的示例代码

Sublime Text 3解决中文乱码问题（实测可用）