How to remove duplicated records\observations WITHOUT sorting in SAS？

2023-02-27 04:14 问答作者：

I wonder if there is a way to unduplicate records WI开发者_StackOverflow中文版THOUT sorting?Sometimes, I want to keep original order and just want to remove duplicated records.

Is it possible?

BTW, below are what I know regarding unduplicating records, which does sorting in the end..

proc sql;
   create table yourdata_nodupe as
   select distinct *
   From abc;
quit;

proc sort data=YOURDATA nodupkey;    
    by var1 var2 var3 var4 var5;    
run;

You could use a hash object to keep track of which values have been seen as you pass through the data set. Only output when you encounter a key that hasn't been observed yet. This outputs in the order the data was observed in the input data set.

Here is an example using the input data set "sashelp.cars". The original data was in alphabetical order by Make so you can see that the output data set "nodupes" maintains that same order.

data nodupes (drop=rc);;
  length Make $13.;

  declare hash found_keys();
    found_keys.definekey('Make');
    found_keys.definedone();

  do while (not done);
    set sashelp.cars end=done;
    rc=found_keys.check();
    if rc^=0 then do;      
      rc=found_keys.add(); 
      output;              
    end;
  end;
  stop;
run;

proc print data=nodupes;run;

/* Give each record in the original dataset and row number */
data with_id ;
  set mydata ;
  _id = _n_ ;
run ;

/* Remove dupes */
proc sort data=with_id nodupkey ;
  by var1 var2 var3 ;
run ;

/* Sort back into original order */
proc sort data=with_id ;
  by _id ;
run ;

I think the short answer is no, there isn't, at least not a way that wouldn't have a much bigger performance hit than a method based on sorting.

There may be specific cases where this is possible (a dataset where all variables are indexed? A relatively small dataset that you could reasonably load into memory and work with there?) but this wouldn't help you with a general method.

Something along the lines of Chris J's solution is probably the best way to get the outcome you're after, but that's not an answer to your actual question.

Depending on the number of variables in your data set, the following might be practical:

data abc_nodup;
   set abc;
   retain _var1 _var2 _var3 _var4;
   if _n_ eq 1 then output;
   else do;
      if (var1 eq _var1) and (var2 eq _var2) and
         (var3 eq _var3) and (var4 eq _var4)
         then delete;
      else output;
   end;
   _var1 = var1;
   _var2 = var2;
   _var3 = var3;
   _var4 = var4;
   drop _var:;
run;

Please refer to Usage Note 37581: How can I eliminate duplicate observations from a large data set without sorting, http://support.sas.com/kb/37/581.html . Usage Note 37581 shows how PROC SUMMARY can be used to more efficiently remove duplicates without the use of sorting.

The two examples given in the original post are not identical.

distinct in proc sql only removes lines which are fully identical
nodupkey in proc sort removes any line where key variables are identical (even if other variables are not identical). You need the option noduprecs to remove fully identical lines.

If you are only looking for records having common key variables, another solution I could think of would be to create a dataset with only the key variable(s) and find out which one are duplicates and then apply a format on the original data to flag duplicate records. If more than one key variable is present in the dataset, one would need to create a new variable containing the concatenation of all the key variable values - converted to character if needed.

This is the fastest way I can think of. It requires no sorting.

data output_data_name;
    set input_data_name (
        sortedby = person_id stay
        keep =
            person_id
            stay
            ... more variables ...);
    by person_id stay;
    if first.stay > 0 then output;
run;

data output;
set yourdata;
by var notsorted;
if first.var then output;
run;

This will not sort the data but will remove duplicates within each group.

继续阅读：duplicates sas sorting

How to remove duplicated records\observations WITHOUT sorting in SAS？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？