is there a way to read a word document line by line

2023-03-25 02:26 问答作者：

I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...

Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();

foreach (Word.Range docRange in doc.Words) // loads all words in document
{
    IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
        .Select(i => docRange.Text.Substring(i))
        .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

    wordPosition =
        (int)
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

    foreach (var substring in sortedSubstrings)
    {
        index = docRange.Text.IndexOf(substring) + wordPosition;
        charLocation[index] = substring;
    }
}

However I would have preferred to load the document one line at a time... is it possible to do so?

I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.

foreach (Word.Paragraph para in doc.Paragraphs)
{
    foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

        foreach (var substring in sortedSubstrings)
        {
    开发者_开发技巧        index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }

    }
}

This helps in you getting string line by line.

    object file = Path.GetDirectoryName(Application.ExecutablePath) + @"\Answer.doc";

    Word.Application wordObject = new Word.ApplicationClass();
    wordObject.Visible = false;

    object nullobject = Missing.Value;
    Word.Document docs = wordObject.Documents.Open
        (ref file, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject);

    String strLine;
    bool bolEOF = false;

    docs.Characters[1].Select();

    int index = 0;
    do
    {
        object unit = Word.WdUnits.wdLine;
        object count = 1;
        wordObject.Selection.MoveEnd(ref unit, ref count);

        strLine = wordObject.Selection.Text;
        richTextBox1.Text += ++index + " - " + strLine + "\r\n"; //for our understanding

        object direction = Word.WdCollapseDirection.wdCollapseEnd;
        wordObject.Selection.Collapse(ref direction);

        if (wordObject.Selection.Bookmarks.Exists(@"\EndOfDoc"))
            bolEOF = true;
    } while (!bolEOF);

    docs.Close(ref nullobject, ref nullobject, ref nullobject);
    wordObject.Quit(ref nullobject, ref nullobject, ref nullobject);
    docs = null;
    wordObject = null;

Here's the genius behind the code. Follow the link for some more explanation on how it works.

I would suggest following the code on this page here

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

EDIT: Document is retrieved by calling this:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

Sadly the formatting of the code on the page I linked wasn't all to easy.

EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

        Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
        object miss = System.Reflection.Missing.Value;
        object path = @"D:\viewstate.docx";
        object readOnly = true;
        Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
        string totaltext = "";

        object unit = Microsoft.Office.Interop.Word.WdUnits.wdLine;
        object count = 1;
        word.Selection.MoveEnd(ref unit, ref count);
        totaltext = word.Selection.Text;

        TextBox1.Text = totaltext;
        docs.Close(ref miss, ref miss, ref miss);
        word.Quit(ref miss, ref miss, ref miss);
        docs = null;
        word = null;

Increment the count for each line

I recommend using DocX library. It is lightweight and doesn't require Word to be installed on the machine. Here is the code that use to get text line by line :

using(DocX doc = DocX.Load("sample.docx"))
{
     for (int i = 0; i < doc.Paragraphs.Count; i++ )
     {
          foreach (var item in doc.Paragraphs[i].Text.Split(new string[]{"\n"}
                    , StringSplitOptions.RemoveEmptyEntries))
          {
                Console.WriteLine(item);
          }
     }
}

继续阅读：ms-word

is there a way to read a word document line by line

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？