Bag of Words representation problem

2022-12-22 11:34 问答作者：

Basically i have a dictionary containing all the words of my vocabulary as keys, and all with 0 as value.

To process a document into a bag of words representation i used to copy that dictionary with the appropriate IEqualityComparer and simply checked if the dictionary contained every word in the document and incremented it's key.

To get the array of the bag of words representation i simply used the ToArray method.

This seemed to work fine, but i was just told that the dictionary doesnt assure the same Key order, so the resulting arrays mig开发者_开发百科ht represent the words in different order, making it useless.

My current idea to solve this problem is to copy all the keys of the word dictionary into an ArrayList, create an array of the proper size and then use the indexOf method of the array list to fill the array.

So my question is, is there any better way to solve this, mine seems kinda crude... and won't i have issues because of the IEqualityComparer?

Let me see if I understand the problem. You have two documents D1 and D2 each containing a sequence of words drawn from a known vocabulary {W1, W2... Wn}. You wish to obtain two mappings indicating the number of occurrences of each word in each document. So for D1, you might have

W1 --> 0
W2 --> 1
W3 --> 4

indicating that D1 was perhaps "W3 W2 W3 W3 W3". Perhaps D2 is "W2 W1 W2", so its mapping is

W1 --> 1
W2 --> 2
W3 --> 0

You wish to take both mappings and determine the vectors [0, 1, 4] and [1, 2, 0] and then compute the angle between those vectors as a way of determining how similar or different the two documents are.

Your problem is that the dictionary does not guarantee that the key/value pairs are enumerated in any particular order.

OK, so order them.

vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();

and you're done.

Does that solve your problem, or am I misunderstanding the scenario?

If I understand correctly, you want to split a document by word frequency.

You could take the document and run a Regex over it to split out the words:

var words=Regex
    .Matches(input,@"\w+")
    .Cast<Match>()
    .Where(m=>m.Success)
    .Select(m=>m.Value);

To make the frequency map:

var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});

There are overloads of the GroupBy method that allow you to supply an alternative IEqualityComparer if this is important.

Reading your comments, to create a corresponding sequence of only frequencies:

map.Select(a=>a.frequency)

This sequence will be in exactly the same order as the sequence map above.

Is this any help at all?

There is also an OrderedDictionary.

Represents a collection of key/value pairs that are accessible by the key or index.

Something like this might work although it is definitely ugly and I believe is similar to what you were suggesting. GetWordCount() does the work.

class WordCounter {

public Dictionary dictionary = new Dictionary();

    public void CountWords(string text)
    {
        if (text != null && text != string.Empty)
        {
            text = text.ToLower();
            string[] words = text.Split(' ');
            if (dictionary.ContainsKey(words[0]))
            {
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
            else
            {
                int count = words.Count(
                    delegate(string s)
                    {
                        if (s == words[0]) { return true; }
                        else { return false; }
                    });
                dictionary.Add(words[0], count);
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
        }
    }

    public int[] GetWordCount(string text)
    { 
        CountWords(text);
        return dictionary.Values.ToArray<int>();
    }


}

Would be this helpful to you:

SortedDictionary<string, int> dic = new SortedDictionary<string, int>();

            for (int i = 0; i < 10; i++)
            {
                if (dic.ContainsKey("Word" + i))
                    dic["Word" + i]++;
                else
                    dic.Add("Word" + i, 0);
            }

            //to get the array of words:
            List<string> wordsList = new List<string>(dic.Keys);
            string[] wordsArr = wordsList.ToArray();

            //to get the array of values
            List<int> valuesList = new List<int>(dic.Values);
            int[] valuesArr = valuesList.ToArray();

If all you're trying to do is calculate cosine similarity, you don't need to convert your data to 20,000-length arrays, especially considering the data would likely be sparse with most entries being zero.

While processing the files, store the file output data into a Dictionary keyed on the word. Then to calculate the dot product and magnitudes, you iterate through the words in the full word list, look for the word in each of the file ouptut data, and use the found value if it exists and zero if it doesn't.

继续阅读：.net-3.5

Bag of Words representation problem

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？