Determining an a priori ranking of what sites a user has most likely visited

2022-12-23 02:36 问答作者：

This is for http://cssfingerprint.com

I have a largish database (~100M rows) of websites. This includes both main domains (both 2LD and 3LD) and particular URLs scraped from those domains (whether hosted there [like most blogs] or only linked from it [like Digg], and with a reference to the host domain).

I also scrape the Alexa top million, Bloglines top 1000, Google pagerank, Technorati top 100, and Quantcast top million rankings. Many domains will have no ranking though, or only a partial set; and nearly all sub-domain URLs have no ranking at all other than Google's 0-10 pagerank (some don't even have that).

I can add any new scrapings necessary, assuming it doesn't require a massive amount of spidering.

I also have a fair amount of information about what sites previous users have visited.

What I need is an algorithm that orders these URLs by how likely a visitor is to have visited that URL without any knowledge of the current visitor. (It can, however, use aggregated information about previous users.)

This question is just about the relatively fixed (or at least aggregated) a priori ranking; there's another question that deals with getting a dynamic ranking.

Given that I have limited resources (both computational and financial), what's the be开发者_Go百科st way for me to rank these sites in order of a priori probability of their having been visited?

继续阅读：pagerank popularity ranking web-crawler

Determining an a priori ranking of what sites a user has most likely visited

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？