How to make a group by statement (from MySQL) in Stata?
I am not a statistics guy but have to deal with quite some data. In most cases these data sets come from an online survey; hence I do have a MySQL database and know how to get some results out of that.
However, now I got a Stata file and I am required to do some analysis. In MySQL I'd know how to do that, but I am stuck in Stata and ask for your help.
I have a not too small table (roughly 50k rows) containing following columns (there are more cols but these are the ones I have to work with):
id - Object ID, unique values
name - Name of object, string value
class - Class of object, integer range 1 - 6
origin - Origin of object, integer range 1 - 2
Within the 50k rows there are only about 7k different names. In Stata I can retrieve all names with list name
and could even restrict it to a single class with list name if class == 2
.
Now I want a list of all different names along with a count of objects having that name and have the list sorted by count.
In MySQL I'd query SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC
. But how would that be done in Stata?
Next steps would be to get such lists for each class or for both origins, i.e. SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC
, is that possible with Stata, too?
ps: I don't know if stackoverflow is the right place as Stata is not really a programming language, is it? But I 开发者_如何转开发found some Stata-related questions here; that's why I posted it here. If there's a better place to do so, please point me to the right place.
Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.
For the first problem (the list of names and frequencies), you can
collapse (count) freq = name, by(class)
sort class freq name
list class name freq, sepby(class)
Note that collapse
will delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don't want this to happen, here's a more sophisticated trick:
bysort class name : generate long freq = _N
bysort class name : generate byte first = (_n==1)
sort class freq name
list class name freq if first, sepby(class)
(Explanation: _N
is the number of observations in by
-group, and _n
is the number of the current observation within the by
-group.)
You can then subset this to the class of interest with if class==#
, as you already know.
Also check out the groups
command downloadable using ssc inst groups
.
精彩评论