How to find max value of variable for each unique observation in "stacked" dataset
Sorry for the vauge title.
My 开发者_如何学JAVAdata set looks essentially like this:
ID X
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.
Any feedback would be greatly appreciated.
Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):
data sample; input id x; datalines; 18 1 18 1 18 2 18 1 18 2 369 2 369 3 369 3 361 1 ; run; proc means data=sample noprint max nway missing; class id; var x; output out=sample_max (drop=_type_ _freq_) max=; run;
Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).
I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
proc sort data=sample;
by myid myvalue;
run;
data result;
set sample;
by myid;
if last.myid then output;
run;
proc print data=result;
run;
This would give you this result:
Obs myid myvalue
1 18 2
2 361 1
3 369 3
If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:
PROC SORT DATA=A; BY ID DESCENDING X; RUN;
DATA B; SET A;
BY ID;
RETAIN X_MAX;
IF FIRST.ID THEN X_MAX = X;
ELSE X_MAX = X_MAX;
RUN;
You could try this:
PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;
A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:
- If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a
by
statement rather than aclass
statement - this reduces the memory usage.
data sample; input id x; datalines; 18 1 18 1 18 2 18 1 18 2 369 2 369 3 369 3 361 1 ; run; data want; do until(last.ID); set sample; by ID; xmax = max(x, xmax); end; x = xmax; drop xmax; run;
- Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than
proc means
and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.
data _null_; set sample end = eof; if _n_ = 1 then do; call missing(xmax); declare hash h(ordered:'a'); rc = h.definekey('ID'); rc = h.definedata('ID','xmax'); rc = h.definedone(); end; rc = h.find(); if rc = 0 then do; if x > xmax then do; xmax = x; rc = h.replace(); end; end; else do; xmax = x; rc = h.add(); end; if eof then rc = h.output(dataset:'want2'); run;
In this example, on my PC, the hash approach used this much memory:
memory 966.15k
OS Memory 27292.00k
vs. this much for an equivalent proc summary:
memory 8706.90k
OS Memory 35760.00k
Not a bad saving if you really need it to scale up!
Use an appropriate proc
with the by
statement. For instance,
data sample;
input myid myvalue;
datalines;
18 1
18 1
18 2
18 1
18 2
369 2
369 3
369 3
361 1
;
run;
proc sort data=sample;
by myid;
run;
proc means data=sample;
var myvalue;
by myid;
run;
I would just sort by x and id putting the highest value for each ID at the top. NODUPKEY removes every duplicate below.
proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;
proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;
A multidata hash should be used if you want the result to show each id
at max value
. That is, for the cases when more than one id
is found having a max value
Example code:
Find the ids associated with the max value of 40 different numeric variables.
The code is Proc DS2
data program.
data have;
call streaminit(123);
do id = 1 to 1e5; %* 10,000 rows;
array v v1-v40; %* 40 different variables;
do over v; v=ceil(rand('uniform', 2e5)); end;
output;
end;
run;
proc ds2;
data _null_;
declare char(32) _name_ ; %* global declarations;
declare double value id;
declare package hash result();
vararray double v[*] v:; %* variable based array, limit yourself to 1,000;
declare double max[1000]; %* temporary array for holding the vars maximum values;
method init();
declare package sqlstmt s('drop table want'); %* DS2 version of `delete`;
s.execute();
result.keys([_name_]); %* instantiate a multidata hash;
result.data([_name_ value id]);
result.multidata();
result.ordered('ascending');
result.defineDone();
end;
method term();
result.output('want'); %* write the results to a table;
end;
method run();
declare int index;
set have;
%* process each variable being examined for 'id at max';
do index = 1 to dim(v);
if v[index] > max[index] then do; %* new maximum for this variable ?
_name_ = vname(v[index]); %* retrieve variable name;
value = v[index]; %* move value into hash host variable;
if not missing (max[index]) then do;
result.removeall(); %* remove existing multidata items associated with the variable;
end;
result.add(); %* add new multidata item to hash;
max[index] = v[index]; %* track new maximum;
end;
else
if v[index] = max[index] then do; %* two or more ids have same max;
_name_ = vname(v[index]);
value = v[index];
result.add(); %* add id to the multidata item;
end;
end;
end;
enddata;
run;
quit;
%let syslast=want;
Reminder: Proc DS2
defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:
- Use table option
overwrite=yes
when syntax allows- The
package hash
.output()
method does not recognize the table option
- The
- Drop the table before recreating it
The above code can be used in a Base SAS DATA
step with minor modifications.
精彩评论