How to find max value of variable for each unique observation in "stacked" dataset

2023-01-07 12:42 问答作者：

Sorry for the vauge title.

My 开发者_如何学JAVAdata set looks essentially like this:

what I want is to find the max value of x for each ID. In this dataset, that would be 2 for ID=18 and 3 for ID=361.

Any feedback would be greatly appreciated.

Proc Means with a class statement (so you don't have to sort) and requesting the max statistic is probably the most straightforward approach (untested):

data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 


proc means data=sample noprint max nway missing; 
   class id;
   var x;
   output out=sample_max (drop=_type_ _freq_) max=;
run;

Check out the online SAS documentation for more details on Proc Means (http://support.sas.com/onlinedoc/913/docMainpage.jsp).

I don't quite understand your example. I can't imagine that the input data set really has all the values in one observation. Do you instead mean something like this?

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;

proc sort data=sample;
    by myid myvalue;
run;

data result;
    set sample;
    by myid;

    if last.myid then output;
run;

proc print data=result;
run;

This would give you this result:

Obs    myid    myvalue

 1       18       2   
 2      361       1   
 3      369       3

If you want to keep both all records and the max value of X by id, I would use either the PROC MEANS aproach followed by a merge statement, or you can sort the data by Id and DESCENDING X first, and then use the RETAIN statement to create the max_value directly in the datastep:

PROC SORT DATA=A; BY ID DESCENDING X; RUN;

DATA B; SET A;
BY ID;
RETAIN  X_MAX;
IF FIRST.ID  THEN   X_MAX = X;
ELSE                X_MAX = X_MAX;
RUN;

You could try this:

PROC SQL;
CREATE TABLE CHCK AS SELECT MYID, MAX(MYVALUE) FROM SAMPLE
GROUP BY 1;
QUIT;

A couple of more over-engineered options that might be of interest for anyone who needs to do this with a really big dataset, where performance is more of a concern:

If your dataset is already sorted by ID, but not by X within each ID, you can still do this in a single data step without any sorting, using a retained max within each by group. Alternatively, you can use proc means (as per the top answer) but with a by statement rather than a class statement - this reduces the memory usage.

data sample; 
    input id x; 
datalines; 
18  1 
18  1 
18  2 
18  1 
18  2 
369 2 
369 3 
369 3 
361 1 
; 
run; 

data want;
  do until(last.ID);
    set sample;
    by ID;
    xmax = max(x, xmax);
  end;
  x = xmax;
  drop xmax;
run;

Even if your dataset is not sorted by ID, you can still do this in one data step, without sorting it, by using a hash object to keep track of the maximum x value you've found for each ID as you go along. This will be a little faster than proc means and will typically use less memory, as proc means does various calculations in the background which are not needed in the output dataset.

data _null_;
  set sample end = eof;
  if _n_ = 1 then do;
    call missing(xmax);
    declare hash h(ordered:'a');
    rc = h.definekey('ID');
    rc = h.definedata('ID','xmax');
    rc = h.definedone();
  end;
  rc = h.find();
  if rc = 0 then do;
    if x > xmax then do;
        xmax = x;
        rc = h.replace();
    end;
  end;
  else do;
    xmax = x;
    rc = h.add();
  end;
  if eof then rc = h.output(dataset:'want2');
run;

In this example, on my PC, the hash approach used this much memory:

   memory              966.15k
   OS Memory           27292.00k

vs. this much for an equivalent proc summary:

   memory              8706.90k
   OS Memory           35760.00k

Not a bad saving if you really need it to scale up!

Use an appropriate proc with the by statement. For instance,

data sample;
    input myid myvalue;
datalines;
18  1
18  1
18  2
18  1
18  2
369 2
369 3
369 3
361 1
;
run;

proc sort data=sample;
  by myid;
run;

proc means data=sample;
   var myvalue;
   by myid;
run;

I would just sort by x and id putting the highest value for each ID at the top. NODUPKEY removes every duplicate below.

proc sort data=yourstacked_data out=yourstacked_data_sorted;
by DECENDING x id;
run;

proc sort data=yourstacked_data NODUPKEY out=top_value_only;
by id;
run;

A multidata hash should be used if you want the result to show each id at max value. That is, for the cases when more than one id is found having a max value

Example code:

Find the ids associated with the max value of 40 different numeric variables. The code is Proc DS2 data program.

data have;
  call streaminit(123);

  do id = 1 to 1e5;                                  %* 10,000 rows;
    array v v1-v40;                                  %* 40 different variables;
    do over v; v=ceil(rand('uniform', 2e5)); end;
    output;
  end;
run;

proc ds2;
  data _null_;
    declare char(32) _name_ ;        %* global declarations;
    declare double value id;
    declare package hash result();

    vararray double v[*] v:;         %* variable based array, limit yourself to 1,000;
    declare double max[1000];        %* temporary array for holding the vars maximum values;

    method init();
      declare package sqlstmt s('drop table want');  %* DS2 version of `delete`;
      s.execute();

      result.keys([_name_]);                         %* instantiate a multidata hash;
      result.data([_name_ value id]);
      result.multidata();
      result.ordered('ascending');
      result.defineDone();
    end;

    method term();
      result.output('want');                         %* write the results to a table;
    end;

    method run();
      declare int index;
      set have;

      %* process each variable being examined for 'id at max';

      do index = 1 to dim(v);
        if v[index] > max[index] then do;         %* new maximum for this variable ?
          _name_ = vname(v[index]);               %* retrieve variable name;
          value = v[index];                       %* move value into hash host variable;
          if not missing (max[index]) then do;
            result.removeall();                   %* remove existing multidata items associated with the variable;
          end;
          result.add();                           %* add new multidata item to hash;
          max[index] = v[index];                  %* track new maximum;
        end;
        else 
        if v[index] = max[index] then do;         %* two or more ids have same max;
          _name_ = vname(v[index]);
          value = v[index];
          result.add();                           %* add id to the multidata item;
        end;
      end;
    end;
  enddata;
run;
quit;

%let syslast=want;

Reminder: Proc DS2 defaults are to not overwrite existing tables. To 'overwrite' a table you need to either:

Use table option overwrite=yes when syntax allows
- The package hash .output() method does not recognize the table option
Drop the table before recreating it

The above code can be used in a Base SAS DATA step with minor modifications.

继续阅读：sas

How to find max value of variable for each unique observation in "stacked" dataset

更多精彩内容

精彩评论

最新问答

宫颈癌术后可以性生活吗？

决战平安京人面树赏金特典皮肤什么时候上线?？

CF2024宠粉节活动入口在哪?？

原神养石任务怎么做?？

射戮骑士什么时候发售?？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？