word文档实现标题提取
话不多说,直接上代码(使用的是com.aspose.words.*下所有的包,最后附依赖jar包,解压zip文件,用里面的jar包就行,也可以自己maven下载)
/*** 标题提取* @param inputFilePath* @param outputFilePath* @return*/public static void modifyWordDocument(String inputFilePath, String outputFilePath) {try {// 加载 Word 文档Document doc = new Document(inputFilePath);// 遍历文档的节点for (Object node : doc.getChildNodes(NodeType.PARAGRAPH, true)) {Paragraph paragraph = (Paragraph) node;// 检查段落的样式是否为标题样式if (paragraph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_1 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_2 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_3 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_4 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_5 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_6 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_7 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_8 &¶graph.getParagraphFormat().getStyle().getStyleIdentifier() != StyleIdentifier.HEADING_9) {// 如果不是标题样式,则删除该段落paragraph.remove();}}// 删除所有表格for (Object node : doc.getChildNodes(NodeType.TABLE, true)) {Table table = (Table) node;table.remove();}// 用于存储要删除的章节List<Section> emptySections = new ArrayList<>();// 遍历文档的章节for (Section section : doc.getSections()) {boolean isEmpty = true;// 检查每个章节中的所有段落for (Object node : section.getChildNodes(NodeType.PARAGRAPH, true)) {Paragraph paragraph = (Paragraph) node;// 如果找到非空段落,则该章节不为空if (!paragraph.getRange().getText().trim().isEmpty()) {isEmpty = false;break;}}// 如果章节是空的,加入删除列表if (isEmpty) {emptySections.add(section);}}// 删除存储的空章节for (Section section : emptySections) {if (section.getParentNode() != null) {// 只有在有父节点时才删除section.remove();}}// 保存修改后的文档doc.save(outputFilePath);} catch (Exception e) {e.printStackTrace();}}
下面是测试main方法:
public static void main(String[] args) {String inputFilePath = "your\\file_path\\test.docx";String outputFilePath = "your\\file_path\\test-标题提取.docx";modifyWordDocument(inputFilePath, outputFilePath);}
pom文件中依赖引入,我引得是本地包
<dependency><groupId>com.aspose-word-cracked</groupId><artifactId>aspose-word-cracked</artifactId><scope>system</scope><version>1.0</version><systemPath>${basedir}/libs/aspose-words-20.12-jdk17-crack.jar</systemPath></dependency>