开发者

How to find max value of variable for each unique observation in "stacked" dataset

Sorry for the vauge title.

My 开发者_如何学JAVAdata set looks essentially like this:

ID   X
18   1
18   1
18   2
18   1
18   2
369  2
369  3
369  3
361  1

what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.

Any feedback would be greatly appreciated.


Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):

data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 


proc means data=sample noprint max nway missing; 
   class id;
   var x;
   output out=sample_max (drop=_type_ _freq_) max=;
run; 

Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).


I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;

proc sort data=sample;
    by myid myvalue;
run;

data result;
    set sample;
    by myid;

    if last.myid then output;
run;

proc print data=result;
run;

This would give you this result:

Obs    myid    myvalue

 1       18       2   
 2      361       1   
 3      369       3   


If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:

PROC SORT DATA=A; BY ID DESCENDING X; RUN;

DATA B; SET A;
BY ID;
RETAIN  X_MAX;
IF FIRST.ID  THEN   X_MAX = X;
ELSE                X_MAX = X_MAX;
RUN;


You could try this:

PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;


A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:

  1. If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a by statement rather than a class statement - this reduces the memory usage.
data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 

data want;
  do until(last.ID);
    set sample;
    by ID;
    xmax = max(x, xmax);
  end;
  x = xmax;
  drop xmax;
run;
  1. Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than proc means and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.
data _null_;
  set sample end = eof;
  if _n_ = 1 then do;
    call missing(xmax);
    declare hash h(ordered:'a');
    rc = h.definekey('ID');
    rc = h.definedata('ID','xmax');
    rc = h.definedone();
  end;
  rc = h.find();
  if rc = 0 then do;
    if x > xmax then do;
        xmax = x;
        rc = h.replace();
    end;
  end;
  else do;
    xmax = x;
    rc = h.add();
  end;
  if eof then rc = h.output(dataset:'want2');
run;

In this example, on my PC, the hash approach used this much memory:

   memory              966.15k
   OS Memory           27292.00k

vs. this much for an equivalent proc summary:

   memory              8706.90k
   OS Memory           35760.00k

Not a bad saving if you really need it to scale up!


Use an appropriate proc with the by statement. For instance,

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;
run;

proc sort data=sample;
  by myid;
run;

proc means data=sample;
   var myvalue;
   by myid;
run;


I would just sort by x and id putting the highest value for each ID at the top. NODUPKEY removes every duplicate below.

proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;

proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;


A multidata hash should be used if you want the result to show each id at max value. That is, for the cases when more than one id is found having a max value

Example code:

Find the ids associated with the max value of 40 different numeric variables. The code is Proc DS2 data program.

data have;
  call streaminit(123);

  do id = 1 to 1e5;                                  %* 10,000 rows;
    array v v1-v40;                                  %* 40 different variables;
    do over v; v=ceil(rand('uniform', 2e5)); end;
    output;
  end;
run;

proc ds2;
  data _null_;
    declare char(32) _name_ ;        %* global declarations;
    declare double value id;
    declare package hash result();

    vararray double v[*] v:;         %* variable based array, limit yourself to 1,000;
    declare double max[1000];        %* temporary array for holding the vars maximum values;

    method init();
      declare package sqlstmt s('drop table want');  %* DS2 version of `delete`;
      s.execute();

      result.keys([_name_]);                         %* instantiate a multidata hash;
      result.data([_name_ value id]);
      result.multidata();
      result.ordered('ascending');
      result.defineDone();
    end;

    method term();
      result.output('want');                         %* write the results to a table;
    end;

    method run();
      declare int index;
      set have;

      %* process each variable being examined for 'id at max';

      do index = 1 to dim(v);
        if v[index] > max[index] then do;         %* new maximum for this variable ?
          _name_ = vname(v[index]);               %* retrieve variable name;
          value = v[index];                       %* move value into hash host variable;
          if not missing (max[index]) then do;
            result.removeall();                   %* remove existing multidata items associated with the variable;
          end;
          result.add();                           %* add new multidata item to hash;
          max[index] = v[index];                  %* track new maximum;
        end;
        else 
        if v[index] = max[index] then do;         %* two or more ids have same max;
          _name_ = vname(v[index]);
          value = v[index];
          result.add();                           %* add id to the multidata item;
        end;
      end;
    end;
  enddata;
run;
quit;

%let syslast=want;

Reminder: Proc DS2 defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:

  • Use table option overwrite=yes when syntax allows
    • The package hash .output() method does not recognize the table option
  • Drop the table before recreating it

The above code can be used in a Base SAS DATA step with minor modifications.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜