filter duplicate URLs domain from List c#

2023-03-14 20:13 问答作者：

I have a list of 100,000 urls in list(Of string) which can contain urls in the form.

yahoo.com
http://yahoo.com
http://www.yahoo.com

开发者_如何学C

i have tried using a combination of regex and the Uri class, but that didn't help, so i dumped the code. i also tried using this code, but it will only remove duplicatse of exact form, since its not domain specific.

list = new ArrayList<T>(new HashSet<T>(list))

How filter these duplicates and keep just one of these url if it contains the same name e.g yahoo.

thanks

[EDIT]

Please note that

all URL are of different domains, but can usually have duplicates like the example i gave above

also, am using .net 2.0, so i can't use linq

This worked for me

    [TestMethod]
    public void TestMethod1()
    {
        var sites = new List<string> {"yahoo.com", "http://yahoo.com", "http://www.yahoo.com"};

        var result = sites.Select(
            s =>
            s.StartsWith("http://www.")
                ? s
                : s.StartsWith("http://") 
                      ? "http://www." + s.Substring(7) 
                      : "http://www." + s).Distinct();

        Assert.AreEqual(1, result.Count());
    }

I think the Uri Class would be able to help in this case. I am not at a VS machine where I can test; however, pass the Uri constructor the string of the Url, and try the Host property for comparison:

List<string> distinctHosts = new List<string>();

foreach (string url in UrlList)
{
    Uri uri = new Uri(url)

    if (! disctinctHosts.Contains(uri.Host))
    {
        distinctHosts.Add(uri.Host);
    }
}

This feels a bit primitive, and could probably be more elegant - possibly without a foreach; but like I said, I'm not at a development machine where I could work with it.

I think this would be able to handle any variation of a valid Url. Building an ArrayList is not a good idea; in my opinion, Regex would require that you maintain some sort of custom 'MatchList' that could get unwieldy.

As @Damokles points out, you should have some form of validation. The Uri class does require a protocol: 'http://' or 'ftp://'. You do not want to assume 'badurl.com' is actually invalid; however:

if (!url.StartsWith("http://")) { /* add protocol */ } // then check Host domain as above

...should be sufficient simply to retrieve a distinct host or domain name. I recommend any option that does not require guessing the index position of any part of the Url as that is tightly bound to specific formats.

You can do this with the Uri class and Linq/extension methods. The trick is to normalize the Url before using it with the Uri class. Also note that the Uri class requires the scheme, so that will have to be added for ones where it's not present. You can use a different property of the Uri class to achieve different results. The example below returns all unique Urls and treats yahoo.com differently than www.yahoo.com.

string[] urls = new[] { 
  "yahoo.com",
  "http://yahoo.com",
  "http://www.yahoo.com" };
var unique = urls.
  Select(url => new System.Uri(
    url.StartsWith("http") ? url : "http://" + url).Host).
  Distinct();

(Edited to clean up formatting and to make the scheme addition part support both "http://" and "https://")

Try a Regex then .*?(\w+\.\w+)$ assuming you don't have anything after the tld.

继续阅读：.net filter winforms

filter duplicate URLs domain from List c#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？