How to extract just plain text from .doc & .docx files? [closed]

2023-02-25 04:42 问答作者：

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and cit开发者_运维知识库ations.

Closed 7 years ago.

Improve this question

Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx?

I've found this - wondered if there were any other suggestions?

If you want the pure plain text(my requirement) then all you need is

unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

Which I found at command line fu

It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost.

LibreOffice

One option is libreoffice/openoffice in headless mode (make sure all other instances of libreoffice are closed first):

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" mydocument.doc

For more details see e.g. this link: http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/

For a list of libreoffice filters see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/config/fragments/filters

Since the openoffice command line syntax is a bit too complicated, there is a handy wrapper which can make the process easier: unoconv.

Apache POI

Another option is Apache POI — a well supported Java library which unlike antiword can read, create and convert .doc, .docx, .xls, .xlsx, .ppt, .pptx files.

Here is the simplest possible Java code for converting a .doc or .docx document to plain text:

import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;

import org.apache.poi.POITextExtractor;
import org.apache.poi.extractor.ExtractorFactory;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.xmlbeans.XmlException;

public class WordToTextConverter {
    public static void main(String[] args) {
        try {
            convertWordToText(args[0], args[1]);
        } catch (ArrayIndexOutOfBoundsException aiobe) {
            System.out.println("Usage: java WordToTextConverter <word_file> <text_file>");
        }
    }

    public static void convertWordToText(String src, String desc) {
        try {
            FileInputStream fs = new FileInputStream(src);
            final POITextExtractor extractor = ExtractorFactory.createExtractor(fs);
            FileWriter fw = new FileWriter(desc);
            fw.write(extractor.getText());
            fw.flush();
            fs.close();
            fw.close();

        } catch (IOException | OpenXML4JException | XmlException e) {
            e.printStackTrace();
        }
    }
}


# Maven dependencies (pom.xml):

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>my.wordconv</groupId>
<artifactId>my.wordconv.converter</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-ooxml</artifactId>
        <version>3.17</version>
    </dependency>
    <dependency>
        <groupId>org.apache.poi</groupId>
        <artifactId>poi-scratchpad</artifactId>
        <version>3.17</version>
    </dependency>
</dependencies>
</project>

NOTE: You will need to add the apache poi libraries to the classpath. On ubuntu/debian the libraries can be installed with sudo apt-get install libapache-poi-java — this will install them under /usr/share/java. For other systems you'll need to download the library and unpack the archive to a folder that you should use instead of /usr/share/java. If you use maven/gradle (the recommended option), then include the org.apache.poi dependencies as shown in the code snippet.

The same code will work for both .doc and .docx as the required converter implementation will be chosen by inspecting the binary stream.

Compile the class above (assuming it's in the default package, and the apache poi jars are under /usr/share/java):

javac -cp /usr/share/java/*:. WordToTextConverter.java

Run the conversion:

java -cp /usr/share/java/*:. WordToTextConverter doc.docx doc.txt

A clonable gradle project which pulls all necessary dependencies and generates the wrapper shell script (with gradle installDist).

Try Apache Tika. It supports most document formats (every MS Office format, OpenOffice/LibreOffice formats, PDF, etc.) using Java-based libraries (among others, Apache POI). It's very simple to use:

java -jar tika-app-1.4.jar --text ./my-document.doc

Try "antiword" or "antiword-xp-rb"

My favorite is antiword:

http://www.winfield.demon.nl/

And here's a similar project which claims support for docx:

https://github.com/rainey/antiword-xp-rb/wiki

I find wv to be better than catdoc or antiword. It can deal with .docx and convert to text or html. Here is a function I added to my .bashrc to temporarily view the file in the terminal. Change it as required.

# open word in less (ie worl document.doc)
worl() {
    DOC=$(mktemp /tmp/output.XXXXXXXXXX)
    wvText $1 $DOC
    less $DOC
    rm $DOC
}

I recently dealt with this issue and found OpenOffice/LibreOffice commandline tools to be unreliable in production (thousands of docs processed, dozens concurrently).

Ultimately, I built a light-weight wrapper, DocRipper that is much faster and grabs all text from .doc, .docx and .pdf without formatting. DocRipper utilizes Antiword, grep and pdftotext to grab text and return it.

继续阅读：doc docx extract text-extraction

How to extract just plain text from .doc & .docx files? [closed]

LibreOffice

Apache POI

Try "antiword" or "antiword-xp-rb"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

LibreOffice

Apache POI

Try "antiword" or "antiword-xp-rb"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？