Read text file word-by-word using LINQ
I am learning LINQ, and I want to read a text file (let's say an e-book) word by word using LINQ.
This is wht I could come up with:
static void Main()
{
string[] content = File.ReadAllLines("text.txt");
var query = (from c in content
select content);
foreach (var line in content)
{
Console.Write(line+"\n");
}
}
This reads the file line by line. If i change ReadAllLines
to ReadAllText
, the file is read letter by letter.
Any idea开发者_高级运维s?
string[] content = File.ReadAllLines("text.txt");
var words=content.SelectMany(line=>line.Split(' ', StringSplitOptions.RemoveEmptyEntries));
foreach(string word in words)
{
}
You'll need to add whatever whitespace characters you need. Using StringSplitOptions to deal with consecutive whitespaces is cleaner than the Where clause I originally used.
In .net 4 you can use File.ReadLines for lazy evaluation and thus lower RAM usage when working on large files.
string str = File.ReadAllText();
char[] separators = { '\n', ',', '.', ' ', '"', ' ' }; // add your own
var words = str.Split(separators, StringSplitOptions.RemoveEmptyEntries);
string content = File.ReadAllText("Text.txt");
var words = from word in content.Split(WhiteSpace, StringSplitOptions.RemoveEmptyEntries)
select word;
You will need to define the array of whitespace chars with your own values like so:
List<char> WhiteSpace = { Environment.NewLine, ' ' , '\t'};
This code assumes that panctuation is a part of the word (like a comma).
It's probably better to read all the text using ReadAllText() then use regular expressions to get the words. Using the space character as a delimiter can cause some troubles as it will also retrieve punctuation (commas, dots .. etc). For example:
Regex re = new Regex("[a-zA-Z0-9_-]+", RegexOptions.Compiled); // You'll need to change the RE to fit your needs
Match m = re.Match(text);
while (m.Success)
{
string word = m.Groups[1].Value;
// do your processing here
m = m.NextMatch();
}
The following uses iterator blocks, and therefore uses deferred loading. Other solutions have you loading the entire file into memory before being able to iterate over the words.
static IEnumerable<string> GetWords(string path){
foreach (var line in File.ReadLines(path)){
foreach (var word in line.Split(null)){
yield return word;
}
}
}
(Split(null) automatically removes whitespace)
Use it like this:
foreach (var word in GetWords(@"text.txt")){
Console.WriteLine(word);
}
Works with standard Linq funness too:
GetWords(@"text.txt").Take(25);
GetWords(@"text.txt").Where(w => w.Length > 3)
Of course error handling etc. left out for sake of learning.
You could write content.ToList().ForEach(p => p.Split(' ').ToList().ForEach(Console.WriteLine))
but that's not a lot of linq.
精彩评论