开发者

Does SAS Proc SQL ever use an index when merging

Consider the following (admittedly long) example.

The sample code creates two data sets, data one with "key" variables i,j,k and data two with key variables j,k and a "value" variable x. I'd like to merge these two data sets as efficiently as possible. Both of the data sets are indexed with respect to j and k: the index for the first data should not be needed but it's there anyway.

Proc SQL does not use the index in the data two, which I suppose would be the case if the data were in a relational database. Is this just a limitation of the query optimizer I have to accept?

EDIT: The answer to this question is yes, SAS can use an index to optimize a PROC SQL join. In the following example, the relative sizes of the data sets matters: If you modify the code so that data two becomes relatively larger than data one, index will be used. Whether the data sets are sorted or not, does not matter.

* Just to control the size of the data;
%let j_max=10000;

* Create data sets;
data one;
    do i=1 to 3;
        do j=1 to &j_max;
            do k=1 to 4;
                if ranuni(0)<0.9 then output;
            end;
        end;
    end;
run;

data two;
    do j=1 to &j_max;
        do k=1 to 4;
            x=ranuni(0);
            if ranuni(0)<0.9 then output;
        end;
    end;
run;

* Create indices;
proc datasets library=work nolist;
    modify one;
    index create idx_j_k=(j 开发者_如何学运维k);
    modify two;
    index create idx_j_k=(j k) / unique;
run;quit;

* Test the use of an index for the other data set:
* Log should display "INFO: Index idx_j_k selected for WHERE clause optimization.";
options msglevel=i;
data _null_;
    set two(where=(j<100));
run;

* Merge the data sets with proc sql - no index is used;
proc sql;
    create table onetwo as
    select
        one.*,
        two.x
    from one, two
    where
        one.j=two.j and
        one.k=two.k;
quit;


You may be comparing apples and oranges. For the join you do with proc sql, the index may not help because the observations are already ordered by j and k and there are faster ways to do "merging" than using indices.

For the subsetting you do with the data _null_ step, on the other hand, an index on j would surely help. If you do the same subsetting with the proc sql, you will see that it is using the index.

proc sql;
  select * from two where j < 100;
quit;
/* on log
INFO: Index idx_j_k selected for WHERE clause optimization.
*/

By the way, you can use the undocumented _method option to examine how proc sql executes your query. On my sas 9.2 on windows, it reports that it is doing what is called a "hash join":

proc sql _method;
  create table onetwo as
  select
    one.*,
    two.x
  from one, two
  where
    one.j=two.j and
    one.k=two.k;
quit;

/* on log
NOTE: SQL execution methods chosen are:

  sqxcrta
      sqxjhsh
          sqxsrc( WORK.ONE )
          sqxsrc( WORK.TWO )
*/

See Paul Kent's Tech note for more information .

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜