开发者

Categorize data without consolidating?

I have a table with about 1000 records and 2000 columns. What I want to do is categorize each row such that all records with equal column values for all columns except 'ID' are given a category ID. My final answer would look like:

ID A  B  C ..... Category ID
1  1  0  3           1
2  2  1  3           2 
3  1  0  3           1
4  2  1  3           2
5  4  5  6           3
6  4  5  6           3

where all columns (besides ID) are equal for IDs 1,3 so they get the same category ID and so on.

I guess my thought was to just write a SQL query that does a group by on every single column besides 'ID' and assign a number to each group and then join back to my original table. My current input is a text file, and I have SAS, MS Access, and Excel to work with. (I could use proc sql from within SAS).

Before I go this route and construct the whole query, I was just wondering if there was a better way to do this? It will take some work just to write the query, and I'm not even sure if it is practical to join on 2000 columns (never tried), so I thought I'd ask for ideas before I got too far down the wrong path.

EDIT: I just realized my title doesn't really make sense. What I was originally thinking was "Is there a way I can group by and categorize at the same time without actually consolidating into groups?"

EDIT2: After importing the table into Excel, I was easily able to determine that only about 200 of the 2000 columns actually varied, so problem with too many columns went away. To categorize, I only imported the columns that varied, and I did something like the following:

proc sql;
   create table categories as 
   select distinct *开发者_开发技巧
   from inputTable;
quit;

data categories;
   set categories;
   categoryID = _N_;
run;

proc sql;
  create table tableCategorized as
  select a.ID, b.CategoryID
  from inputTable as a, categories as b
  where 
     (
     a.A=b.A and
     a.B=b.B and
     a.C=b.C and
     ...
     a.XYZ=b.XYZ);
  ;
quit;

It was a pain to generate all the "=" comparisons, but I just did it using string manipulation techniques in Excel, so it wasn't too bad at all. Thanks for all of the suggestions.


Well, I can think of an easy way, but I don't know whether you are going to hit SAS's memory/performance limits... I have never tried proc sort with 2000 variables, but maybe someone else has and can comment.

proc sort data= mydata;
    by A B C D /* etc.... */ myLastColumn;
run;

data mydata;
    set mydata;
    by A B C D /* etc....*/ myLastColumn;
    retain categoryID 0;
    if first.myLastColumn then categoryID +1;
run;


Try this

select *, dense_RANK() over(order by a,b,c) from table
-- order by id


Since there is no easy way to do this (as I know of), I wrote a small VBScript to handle preprocessing of the file. Here is my testfile and the script I wrote (place them on your desktop, and run proc.vbs to generate output.txt):

[input.txt]
id  a   b   c
1   1   0   3
2   2   1   3
3   1   0   3
4   2   1   3
5   4   5   6
6   4   5   6

[proc.vbs]
' init
set fso = createobject("scripting.filesystemobject")
set input = fso.opentextfile("input.txt")
set output = fso.createtextfile("output.txt")
set dict = createobject("scripting.dictionary")

' read columns
columns = split(input.readline,vbtab)

' write columns, and add categoryid
output.writeline join(columns,vbtab) & vbtab & "categoryid"

' read rows
do while not input.atendofstream
    fields1 = split(input.readline,vbtab)
    fields2 = fields1

    ' in fields1. clear columns that we don't need for category
    for x = 0 to ubound(fields1)
        if lcase(columns(x)) = "id" then
            fields1(x) = ""
        end if
    next

    ' from fields1. create unique category string & add to dict if not exists
    unique = join(fields1,"|")
    if (not dict.exists(unique)) then
        dict.add unique, dict.count + 1
    end if

    ' write fields
    output.writeline join(fields2,vbtab) & vbtab & cstr(dict(unique))
loop

output.close
input.close

msgbox "finished!"


You can do it with SAS. Since you're working with only 1000 row you can do something a little messy but thats easy to understand and works.

The idea i'm currently following is having an array with all the variables (two arrays if there are numeric and character variables) Then cycle through the arrays and have put the values of each single variables in a new container variable with an enormous length that will be an aggregate of all the values, in your example. CONT="103" for the first ID CONT="213" for the second ID ....

Since i'm not sure if you're working with numbers only i'd say having the new variable as an alfanumeric (thus converting all numeric when you append) is your best way

Once you have done this u'll have a new variable which is the aggregate of everything and you can easily use it to check if some obervation should be in the same category.

Code will follow shortly for your example =)

Here it is, i added some alfanumeric variable for flavor (sorry for formatting, trying to figure out how to format the code thing properly)

*ok i give up,after wasting more than 30 minutes i really cant put this code in a readable way which is a shame.code or blockquote doesnt work as i wanted.

data test;
ID=1; A=1; B=0; C=3; D='HI'; E=54; F='C';
output;
ID=2; A=2; B=7; C=3; D='KI'; E=3; F='C';
output;
ID=3; A=1; B=0; C=3; D='HI'; E=54; F='C';
output;
ID=4; A=2; B=7; C=3; D='KI'; E=3; F='C';
output;
ID=5; A=1; B=7; C=3; D='ZI'; E=3; F='C';
output;
ID=6; A=1; B=8; C=3; D='ZI'; E=3; F='D';
output;
ID=7; A=1; B=8; C=3; D='SI'; E=3; F='C';
output;
ID=8; A=1; B=8; C=3; D='SI'; E=3; F='C';
output;
run;

data test1;
set test;
array numeric(*) _NUMERIC_;
array chara(*) _CHARACTER_;
length cont $2000;
cont='';
do i=1 to dim(numeric);
IF vname(numeric{i})="ID" then continue;
cont=compress(cont)||compress(putn(numeric{i},"BEST."));
end;
do i=1 to dim(chara);
IF vname(chara{i})="ID" then continue;
cont=compress(cont)||compress(chara{i});
end;
drop i;
run;

proc sort data=test1;
by cont;
run;

data test2;
set test1;
by cont;
retain catid 0;
if first.cont then catid+1;
drop cont;
run;

proc sort data=test2 out=test3;
by id;
run;


I don't have SAS in front of me so I'll have to give you some untested code but you can do this quite easily like so:

1) Concatenate all variables except ID into a single delimited text string. Ensure that the delimiter does not appear in the data. Perhaps use a character like byte(10) for the delimiter. Use catx() to do the concatenation so each var is trimmed (leading and trailing) to save on space. As you have a tonne of vars you may need to split this into 2, 3, or 4 strings... no big deal - it's just easier to work with 4 strings than it is to work with 2000 vars.

2) Let's assume you managed to fit it all into 1 big string. Add the value to an array with 1000 dimensions because there could potentially be 1000 unique identifiers for your table, ie. if every obs had a unique string. Each array element will store the value of the string. The position of the element in the array will be used as the unique id. Before adding each string to the array, loop through the entire array to ensure that it is not already in there. If you do find it in there already, then use the position as the unique_id for the current row.

Sooo that was kind of wordy but I'll have a crack at doing that in some untested code...

proc sql noprint;
  select varname from sashelp.vcolumns into :vars separated by ',' where varname ne 'id' and memname eq 'xx';
quit;


data yy;

  length str1-str1000 unique_string $32767;

  set xx;

  array arr_unique {1000} str1-str1000;
  retain unique_counter 1;

  unique_string = catx(byte(10),&vars);

  exists_already = 0;
  do i = 1 to (unique_counter -1);
    if arr_unique[i] eq unique_string then do;
      unique_id = i;
      exists_already = 1;
      leave;
    end;
  end;

  if not exists_already then do;
    arr_unique[unique_counter] = unique_string;
    unique_counter = unique_counter + 1;
  end;

  drop str1-str1000;

run;

Good luck!

Cheers Rob


This is easy with proc sort. Just list all the variables you want to use in grouping after by, and then increment the group id when the last variable changes value in the next data step. Below worked with 9.2 (TS1M0) on Windows. HTH.

EDIT: did not know Louisa Grey already suggested this. I guess my answer just shows that this indeed works... :-)

/* test data */
%let seed = 1234567;
data one;
  length id 8 v2 $1;
  array v[3:2000] v3-v2000;
  keep id v:;
  do id = 1 to 1e3;
    v2 = substr("ABCDEF", ceil(6*ranuni(&seed)), 1);
    do j = 3 to 9, 11 to 1999;
      v[j] = j;
    end;
    v[10] = ceil(2*ranuni(&seed));
    v[2000] = ceil(3*ranuni(&seed));
    output;
  end;
run;
/* on log
NOTE: The data set WORK.ONE has 1000 observations and 2000 variables.
*/

/* group observations based on all the variables
   values except id */
proc sort data=one;
  by v:;
run;
data two;
  set one;
  by v:;
  if first.v2000 then group + 1;
run;

/* check */
proc freq data=two;
  tables group;
run; 
proc print data=two;
  var id group v2 v10 v2000;
run;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜