开发者

Read text file word-by-word using LINQ

I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.

This is wht I could come up with:

static void Main()
        {
            string[] content = File.ReadAllLines("text.txt");

            var query = (from c in content
                         select content);

            foreach (var line in content)
            {
                Console.Write(line+"\n");
            }

        }

This reads the file line by line. If i change ReadAllLines to ReadAllText, the file is read letter by letter.

Any idea开发者_高级运维s?


string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}

You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.

In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.


string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' };    // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);


string content = File.ReadAllText("Text.txt");

var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries) 

select word;

You will need to define the array of whitespace chars with your own values like so:

List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};

This code assumes that panctuation is a part of the word (like a comma).


It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:

Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
    string word = m.Groups[1].Value;

    // do your processing here

    m = m.NextMatch();
}


The following uses iterator blocks, and therefore uses deferred loading. Other solutions have you loading the entire file into memory before being able to iterate over the words.

static IEnumerable<string> GetWords(string path){  

    foreach (var line in File.ReadLines(path)){
        foreach (var word in line.Split(null)){
            yield return word;
        }
    }
}

(Split(null) automatically removes whitespace)

Use it like this:

foreach (var word in GetWords(@"text.txt")){
    Console.WriteLine(word);
}

Works with standard Linq funness too:

GetWords(@"text.txt").Take(25);
GetWords(@"text.txt").Where(w => w.Length > 3)

Of course error handling etc. left out for sake of learning.


You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine)) but that's not a lot of linq.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜