Data visualization: Bubble charts, Venn diagrams, and tag clouds (oh my!)

2023-01-07 05:45 问答作者：

Suppose I have a large list of objects (thousands or tens of thousands), each of which is tagged with a handful of tags. There are dozens or hundreds of possible tags and their usage follows a typical power law: some tags are used extremely often but most are rare. All but the most frequent couple dozen tags could typically be ignored, in fact.

Now the problem is how to visualize the relationship between these tags. A tag cloud is a nice visualization of just their frequencies but it ignores which tags occur with which other tags. Suppose tag :bar only occurs on objects also tagged :foo. That should be visually apparent. Similarly for three tags that tend to occur together.

You could make each tag a bubble and let them partially overlap with each other. Technically that's a Venn diagram but treating it that way might be unwieldy. For example, Google charts can create Venn diagrams, but only for 3 or fewer sets (tags): http://code.google.com/apis/chart/docs/gallery/venn_charts.html

The reason they limit it to 3 sets is that any more and it looks horrendous. See "extentions to higher numbers of sets" on the Wikipedia page: http://en.wikipedia.org/wiki/Venn_diagrams

But that's only if every possible intersection is non-empty. If no more than 3 tags ever co-occur (maybe after throwing out the rare tags) then a collection of Venn diagrams could work (with the sizes of the bubbles representing tag frequency).

Or perhaps a graph (as in vertices and edges) with visually thicker or thinner edges to represent frequency of co-occurrence.

Do you 开发者_运维百科have any ideas, or pointers to tools or libraries? Ideally I'd do this with javascript but I'm open to things like R and Mathematica or really anything else. I'm happy to share some actual data (you'll laugh if I tell you what it represents) if anyone is curious.

Addendum: The application I originally had in mind was TagTime but it occurs to me that this also maps well to the problem of visualizing one's delicious bookmarks.

If i understand your question correctly, an image matrix should work nicely here. The implementation i have in mind would be an n x m matrix in which the tagged items are rows, and each tags type is a separate column. Every cell in the matrix would consist entirely of "1's" and "0's", i.e., a particular item either has a given tag or it doesn't.

In the matrix below (which i rotated 90 degrees so it would fit better in this window--so columns actually represent tagged items, and each row shows the presence or absence of a given tag across all items), i simulated the scenario in which there are 8 tags and 200 tagged items. , a "0" is blue and a "1" is light yellow.

All values in this matrix were randomly selected (each tagged item is eight draws from a box consisting of two tokens, one blue and one yellow (no tag and tag, respectively). So not surprisingly there's no visual evidence of a pattern here, but if there is one in your data, this technique, which is dead simple to implement, can help you find it.

I used R to generate and plot the simulated data, using only base graphics (no external packages or libraries):

# create the matrix
A = matrix(data=r1, nrow=1, ncol=8)

# populate it with random data
for (i in seq(0, 200, 1)){r1 = sample(0:1, 8, replace=TRUE); A = rbind(A, r1)}

# now plot it
image(z=A, ann=F, axes=F, col=topo.colors(12))

Data visualization: Bubble charts, Venn diagrams, and tag clouds (oh my!)

I would create something like this if you are targeting the web. Edges connecting the nodes could be thicker or darker in color, or perhaps a stronger force connecting them so they are close in distance. I would also add the tag name inside the circle.

Some libraries that would be very good for this include:

Protovis (Javascript)
Flare (Adobe Flash)

Some other fun javascript libraries worth looking into are:

Processing for Javascript
Raphael

Although this is an old thread, I just came across it today.

You may also want to consider using a Self-Organizing Map.

Here is an example of a self-organizing map for world poverty. It used 39 of what you call your "tags" to arrange what you call your "objects".

http://www.cis.hut.fi/research/som-research/povertymap.gif

Data visualization: Bubble charts, Venn diagrams, and tag clouds (oh my!)

Note sure it would work as I did not test that, but here is how I would start:

You can create a matrix as doug suggests in his answer, but instead of having documents as rows and tags as columns, you take a square matrix where tags are rows and columns. Value of the cell T1;T2 will be the number of documents tagged with both T1 and T2 (note that by doing that you'll get a symetric matrix because [T1;T2] will have the same value as [T2;T1]).
Once you have done that, each row (or column) is a vector locating the tag in a space with T dimensions. Tags near each others in this space often occur together. To visualize co-occurrence you can then use a method to reduce your space dimensionality or any clustering method. For example you can use a kohonen self organizing map to project your T-dimensions space to a 2D space, you'll then get a 2D matrix where each cell represents an abstract vector in the tag space (meaning the vector won't necessary exists in your data set). This vector reflect a topological constraint of your source space, and can be seen as a "model" vector reflecting a significant co-occurence of some tags. Moreover, cells near each others on this map will represent vectors close to each other in the source space, thus allowing you to map the tag space on a 2D matrix.
Final visualization of the matrix can be done in many ways but I cannot give you advice on that without first seeing the results of the previous processing.

继续阅读：charts data-visualization javascript r visualization

Data visualization: Bubble charts, Venn diagrams, and tag clouds (oh my!)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？