Dynamically Storing and retriving 3,000,000+words in c#.NET using Collections
How to Store and retrieve 3,000,000+ words in Dynamically without using SQL..
Get a word from a document then check whether the word is available or not.
if available, then increment it in corresponding document count...
if not available i.e, New word then create a new column then increment the document count and put Zero to all other documents.
For Example..
I having 93,000 documents each contains more or less 5000 words. If new word comes then add a new column. Likewise 960000 words came.
-------------开发者_如何转开发---Word1 word2 word3 word4 word5 ….---- New Word … word96000
Document1 ----2 ----19 ----45 ----16 ----7 ---- ------….0 ----.. ----..
Document2 ----4 ----6 ----3 ----56 ----3 ----…. --------0 ----.. ----..
Document3 ----56 ----34 ----1 ----67 ----4 ----…. --------0 ----.. ----..
Document4 ----7 ----45 ----9 ----45 ----6 ----…. --------0 ----.. ----..
Document5 ----56 ----43 ----234 ----87 ----46 ----…. --------0 ----..
Document6 ----56 ----6 ----2 ----5 ----23 ----…. --------0 ----.. ----..
. …. . .. ..
. …. . .. ..
. …. . .. ..
. …. . .. ..
. …. . .. ..
. …. . .. ..
. …. . .. ..
Document1000 ----5 ----9 ----9 ----89 ----34 ----…. --------1 .. ..
Count of those words that are added are dynamically updated in the corresponding document's entry.
Such a sparse matrix is often best implemented as a dictionary of dictionaries.
Dictionary<string, Dictionary<string, int> index;
But the question lacks too many details to give more advice.
To avoid wasting memory, I would suggest the following:
class Document {
List<int> words;
}
List<Document> documents;
If you have 1000 documents then create List<Document> documents = new List<Document>(1000);
Now if document1 has the words: word2, word19 and word45, add the index of these words to your document
documents[0].words.add(2);
documents[0].words.add(19);
documents[0].words.add(45);
You can modify the code to store the words themselves.
To see how many times the word word2 is repeated, you can go throw the entire list of documents and see if the document contains the word index or not.
foreach (Document d in documents) {
if (d.words.Contain(2)) {
count++;
}
}
var nWords = (from Match m in Regex.Matches(File.ReadAllText("big.txt").ToLower(), "[a-z]+")
group m.Value by m.Value)
.ToDictionary(gr => gr.Key, gr => gr.Count());
Provide you with a dictionary list indexed by word and count. I'm sure you could then save the info as each file is read in and then build up your final lists. maybe?
精彩评论