What's the best way to map the link connection between blogs?
I wish to perform a social network analysis on a bunch of blogs, plotting who is linking to who (not just by开发者_StackOverflow社区 their blogroll but also inside their posts). What software can perform such crawling/data-collecting/mapping ?
Thanks!
By "mapping" I'm not sure if you are referring to mapping of raw data to an orthodox graph data structure or mapping of that data structure to an aesthetics library in order to render it. If the former, then i would guess it's a straightforward matter of writing a function to translate raw data (w/r/t which blogs link to which, and how much) into a graph data structure, such as an adjacency matrix. Mapping such a data structure for viewing can be done like this:
library(Rgraphviz)
# create an synthetic adjacency matrix for 10 blogs
M = sapply(rep(10, 10), function(x){sample(c(0, 1), 10, T, c(0.7, 0.3))})
colnames(M) = paste(rep("b", 10), 1:10, sep="-")
rownames(M) = colnames(M)
# 0's down the main diagonal (eliminate self-edges)
diag(M) = rep(0, 10)
# call the graphviz constructor, passing in adjacency matrix
M_gr = new("graphAM", adjMat=M, edgemode="directed")
g1 = layoutGraph(M_gr)
# (optional) aesthetic parameters for nodes & edges
graph.par( list(edges = list(col="gray", lty="dashed", lwd=1),
nodes = list( col="midnightblue", shape="ellipse",
textCol="darkred", fill="#B0B7C6", fontsize=11,
lty="dotted", lwd=2)) )
# call the device driver
png(file='somefilename.png', width=600, height=460, res=128)
# call the plot function
renderGraph(g1)
# kill the device
dev.off()
alt text http://img13.imageshack.us/img13/7683/bloggraph.png
If you want to show not just connections but the strength of those connections, e.g., number, or perhaps frequency of links from one blog to another, you can do that by setting line thickness individually, through the parameter 'lwd', which i've set at 2 for all edges, for this example (another option is to show connection strength by line type, e.g., dotted, dashed, solid, color). Of course, these edge weights will have to be set in your adjacency matrix, which is simple enough--instead of '0'/'1' to represent 'not connected'/connected, you'll probably want to use '0'/'integers'.
You could also do this in R with a combination of something like RCurl or XML (to get the blog posts) and something like igraph (for the SNA). You will need to parse the HTML to get all the links, and the XML package can handle that kind of processing very easily.
Have a look at this related question for some pointers on the SNA analysis, although this is a big field of study.
Nutch is a decent enough crawler, but you'd have to do your own analysis on the indexed data.
For the record, I highly recommend the mechanize library in Python- it makes building your own personalized crawler/scraper a snap.
精彩评论