Dataset for URL normalization
I'm working on a project for normal开发者_开发技巧izing URL's.(i.e different URL's that map to the same web page should be identified and redundancy should be reduced as like a search engine).
So I'd like a dataset containing different URL's in order to test my method. Please provide links for normalization dataset(s).
I'm implementing this project in C# and I'd like your suggestions. Thanks in advance.
Since you asked I'd like your suggestions
, leaving your question very open and thus open to which kind of suggestion you might get, I will go ahead and give you my suggestions. Though I admit I am not 100% sure what problem you wish to tackle? Are you asking for a program/code specific suggestion? A strategy for how to setup such a project? or do you wish to collect inspirations/idea's and improve your existing workflow? If you are seeking this third thing, I would suggest to take a look into two scenarios, inspired by a lecture that one of my Artificial Intelligence teachers once gave. Lets dive for a moment to how Ant colonies organise themselves:
top-down approach: a fantasy Imagine a queen in an antcology prescribing for each and every ant their routes to the sub colonies and thereby normalising multiple trace routes that varous ants all undertake to go to the same place, then it seems you want to group the ants together and let each group use just 1 route to their goals, and remove possible duplicate routes. This is one way how to make their routes more efficient. In reality ants actually work differently :
bottom-up approach: the reality: A single ant has little meaning, but when a whole ant colony is studies, an organisation reveals. Thi sis because the ants themselves follow the scent traces of other ants, that way following eachother and ultimately finding their way to the nest. This way, the cleverness does not need to come from above/from a central database, but a tiny bit of intelligence built in each ant will make the same path re-useable. >> In this way you might want to think building your normalisation technique within each hyperlink that needs to be normalized.
I hope this can give you the suggestions you wished, otherwise if your question was not strategy based but specific code-problem related, ask question with program code in it, that is often much easier to solve than finding the best strategy. Good luck! My 2 cents.
精彩评论