开发者

Bag of Words representation problem

Basically i have a dictionary containing all the words of my vocabulary as keys, and all with 0 as value.

To process a document into a bag of words representation i used to copy that dictionary with the appropriate IEqualityComparer and simply checked if the dictionary contained every word in the document and incremented it's key.

To get the array of the bag of words representation i simply used the ToArray method.

This seemed to work fine, but i was just told that the dictionary doesnt assure the same Key order, so the resulting arrays mig开发者_开发百科ht represent the words in different order, making it useless.

My current idea to solve this problem is to copy all the keys of the word dictionary into an ArrayList, create an array of the proper size and then use the indexOf method of the array list to fill the array.

So my question is, is there any better way to solve this, mine seems kinda crude... and won't i have issues because of the IEqualityComparer?


Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have

W1 --> 0
W2 --> 1
W3 --> 4

indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is

W1 --> 1
W2 --> 2
W3 --> 0

You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.

Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.

OK, so order them.

vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();

and you're done.

Does that solve your problem, or am I misunderstanding the scenario?


If I understand correctly, you want to split a document by word frequency.

You could take the document and run a Regex over it to split out the words:

var words=Regex
    .Matches(input,@"\w+")
    .Cast<Match>()
    .Where(m=>m.Success)
    .Select(m=>m.Value);

To make the frequency map:

var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});

There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.

Reading your comments, to create a corresponding sequence of only frequencies:

map.Select(a=>a.frequency)

This sequence will be in exactly the same order as the sequence map above.

Is this any help at all?


There is also an OrderedDictionary.

Represents a collection of key/value pairs that are accessible by the key or index.


Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.

class WordCounter {

public Dictionary dictionary = new Dictionary();

    public void CountWords(string text)
    {
        if (text != null && text != string.Empty)
        {
            text = text.ToLower();
            string[] words = text.Split(' ');
            if (dictionary.ContainsKey(words[0]))
            {
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
            else
            {
                int count = words.Count(
                    delegate(string s)
                    {
                        if (s == words[0]) { return true; }
                        else { return false; }
                    });
                dictionary.Add(words[0], count);
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
        }
    }

    public int[] GetWordCount(string text)
    { 
        CountWords(text);
        return dictionary.Values.ToArray<int>();
    }


}


Would be this helpful to you:

SortedDictionary<string, int> dic = new SortedDictionary<string, int>();

            for (int i = 0; i < 10; i++)
            {
                if (dic.ContainsKey("Word" + i))
                    dic["Word" + i]++;
                else
                    dic.Add("Word" + i, 0);
            }

            //to get the array of words:
            List<string> wordsList = new List<string>(dic.Keys);
            string[] wordsArr = wordsList.ToArray();

            //to get the array of values
            List<int> valuesList = new List<int>(dic.Values);
            int[] valuesArr = valuesList.ToArray();


If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.

While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜