开发者

How to make a group by statement (from MySQL) in Stata?

I am not a statistics guy but have to deal with quite some data. In most cases these data sets come from an online survey; hence I do have a MySQL database and know how to get some results out of that.

However, now I got a Stata file and I am required to do some analysis. In MySQL I'd know how to do that, but I am stuck in Stata and ask for your help.

I have a not too small table (roughly 50k rows) containing following columns (there are more cols but these are the ones I have to work with):

id - Object ID, unique values

name - Name of object, string value

class - Class of object, integer range 1 - 6

origin - Origin of object, integer range 1 - 2

Within the 50k rows there are only about 7k different names. In Stata I can retrieve all names with list name and could even restrict it to a single class with list name if class == 2.

Now I want a list of all different names along with a count of objects having that name and have the list sorted by count. In MySQL I'd query SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC. But how would that be done in Stata?

Next steps would be to get such lists for each class or for both origins, i.e. SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC, is that possible with Stata, too?

ps: I don't know if stackoverflow is the right place as Stata is not really a programming language, is it? But I 开发者_如何转开发found some Stata-related questions here; that's why I posted it here. If there's a better place to do so, please point me to the right place.


Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.

For the first problem (the list of names and frequencies), you can

   collapse (count) freq = name, by(class)
   sort class freq name
   list class name freq, sepby(class)

Note that collapse will delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don't want this to happen, here's a more sophisticated trick:

   bysort class name : generate long freq = _N
   bysort class name : generate byte first = (_n==1)
   sort class freq name
   list class name freq if first, sepby(class)

(Explanation: _N is the number of observations in by-group, and _n is the number of the current observation within the by-group.)

You can then subset this to the class of interest with if class==#, as you already know.


Also check out the groups command downloadable using ssc inst groups.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜