开发者

what is the efficient way to use google to search duplicate content..?

I am making a software that helps to search duplicate contents(only text) in web. I think I can use google as it is very efficient and faster. So I developed an algorithm But it is not efficient.

Here is my idea. The user enters a content of 300-500 character length. This content is searched in google. 1st page results are considered.

ex: Content is "The definition of a breed is a matter of some controversy. Some groups use a definition that ultimately requires extreme in-breeding to qualify. Dogs that are bred in this manner often end up with severe health problems. Other organizations define a breed more loosely, such that an individual may be considered of one breed as long as, say, three of its grandparents were of that breed".

1st result in google : Brief History of Dogs and Breeds. Dog usually means the domestic dog, ... Some groups use a definition that ultimately requires extreme in-breeding to qualify. Dogs that are bred in this manner often end up with severe health problems. Other organizations define a breed more loosely, such that an individual may be ...

So from 1st result we can say the content is present on web ..

My algorithm

 bool checkContentVsResult(string googletext, string content)
    {
        bool found = false;
        int len = 0;
        string[] ch = new string[] { "." };
        string[] texts = googletext.Split(ch, StringSplitOptions.RemoveEmptyEntries);
        int count = 0,qualify=0;
        len = text.Length;
        if (len > 300)
            qualify = 3;
        else if (len > 200)
            qualify = 2;
        else
            qualify = 1;
        foreach (string s in texts)
        {
            if (s==" ")
                continue;
            if (content.Contains(s))
                count++;
            if (count >= qualify)
     开发者_开发问答       {
                found = true;
                break;
            }
        }
        return found;
    }

As you can see the algorithm is not much efficient.. How to make it more efficient..?


Try a google search for "levenshtein distance c"?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜