How can I speed up queries against huge data warehouse tables with effective-dated data?
So I am querying some extremely large tables. The reason they are so large is because PeopleSoft inserts new records every time a change is made to some data, rather than updating existing records. In effect, its transactional tables are also a data warehouse.
This necessitates queries that have nested selects in them, to get the most recent/current row. They are both effective dated and within each date (cast to a day) they can have an effective sequence. Thus, in order to get the current record for user_id=123
, I have to do this:
select * from sometable st
where st.user_id = 123
and st.effective_date = (select max(sti.effective_date)
from sometable sti where sti.user_id = st.user_id)
and st.effective_sequence = (select max(sti.effective_sequence)
from sometable sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
There are a phenomenal number of indexes on these tables, and I can't find anything else that would speed up my queries.
My trouble is that I often times want to get data about an individual from these tables for maybe 50 user_ids, but when I join my tables having only a few records in them with a few of these PeopleSoft tables, things just go to crap.
The PeopleSoft tables are on a remote database that I access through a database link. My queries tend to look like this:
select st.* from local_table lt, sometable@remotedb st
where lt.user_id in ('123', '456', '789')
and lt.user_id = st.user_id
and st.effective_date = (select max(sti.effective_date)
from sometable@remotedb sti where sti.user_id = st.user_id)
and st.开发者_如何学Goeffective_sequence = (select max(sti.effective_sequence)
from sometable@remotedb sti where sti.user_id = st.user_id
and sti.effective_date = st.effective_date)
Things get even worse when I have to join several PeopleSoft tables with my local table. Performance is just unacceptable.
What are some things I can do to improve performance? I've tried query hints to ensure that my local table is joined to its partner in PeopleSoft first, so it doesn't attempt to join all its tables together before narrowing it down to the correct user_id. I've tried the LEADING
hint and toyed around with hints that tried to push the processing to the remote database, but the explain plan was obscured and just said 'REMOTE' for several of the operations and I had no idea what was going on.
Assuming I don't have the power to change PeopleSoft and the location of my tables, are hints my best choice? If I was joining a local table with four remote tables, and the local table joined with two of them, how would I format the hint so that my local table (which is very small -- in fact, I can just do an inline view to have my local table only be the user_ids I'm interested in) is joined first with each of the remote ones?
EDIT: The application needs real-time data so unfortunately a materialized view or other method of caching data will not suffice.
Does refactoring your query something like this help at all?
SELECT *
FROM (SELECT st.*, MAX(st.effective_date) OVER (PARTITION BY st.user_id) max_dt,
MAX(st.effective_sequence) OVER (PARTITION BY st.user_id, st.effective_date) max_seq
FROM local_table lt JOIN sometable@remotedb st ON (lt.user_id = st.user_id)
WHERE lt.user_id in ('123', '456', '789'))
WHERE effective_date = max_dt
AND effective_seq = max_seq;
I agree with @Mark Baker that performance joining over DB Links really can suck and you're likely to be limited in what you can accomplish with this approach.
One approach would be to stick PL/SQL functions around everything. As an example
create table remote (user_id number, eff_date date, eff_seq number, value varchar2(10));
create type typ_remote as object (user_id number, eff_date date, eff_seq number, value varchar2(10));
.
/
create type typ_tab_remote as table of typ_remote;
.
/
insert into remote values (1, date '2010-01-02', 1, 'a');
insert into remote values (1, date '2010-01-02', 2, 'b');
insert into remote values (1, date '2010-01-02', 3, 'c');
insert into remote values (1, date '2010-01-03', 1, 'd');
insert into remote values (1, date '2010-01-03', 2, 'e');
insert into remote values (1, date '2010-01-03', 3, 'f');
insert into remote values (2, date '2010-01-02', 1, 'a');
insert into remote values (2, date '2010-01-02', 2, 'b');
insert into remote values (2, date '2010-01-03', 1, 'd');
create function show_remote (i_user_id_1 in number, i_user_id_2 in number) return typ_tab_remote pipelined is
CURSOR c_1 is
SELECT user_id, eff_date, eff_seq, value
FROM
(select user_id, eff_date, eff_seq, value,
rank() over (partition by user_id order by eff_date desc, eff_seq desc) rnk
from remote
where user_id in (i_user_id_1,i_user_id_2))
WHERE rnk = 1;
begin
for c_rec in c_1 loop
pipe row (typ_remote(c_rec.user_id, c_rec.eff_date, c_rec.eff_seq, c_rec.value));
end loop;
return;
end;
/
select * from table(show_remote(1,null));
select * from table(show_remote(1,2));
Rather than having user_id's passed individually as parameters, you could load them into a local table (eg a global temporary table). The PL/SQL would loop then through the table, doing the remote select for each row in the local table. No single query would have both local and remote tables. Effectively you would be writing your own join code.
One option is to first materialize the remote part of the query using a common table expression so you can be sure only relevantt data is fetched from remote db.Another improvement would be to merge the 2 subqueries against the remote db into one analytical function based subquery.Such a query can be used in your current query also. I can make other suggestions only after playing with the db.
see below
with remote_query as
(
select /*+ materialize */ st.* from sometable@remotedb st
where st.user_id in ('123', '456', '789')
and st.rowid in( select first_value(rowid) over (order by effective_date desc,
effective_sequence desc ) from sometable@remotedb st1
where st.user_id=st1.user_id)
)
select lt.*,st.*
FROM local_table st,remote_query rt
where st.user_id=rt.user_id
You haven't mentioned the requirements for the freshness of the data, but one option would be to create materialized views (you'll be restricted to REFRESH COMPLETE since you can't create snapshot logs in the source system) that have data only for the current versioned row of the transaction tables. These materialized view tables will reside in your local system and additional indexing can be added to them to improve query performance.
The performance issue is going to be the access across the link. With part of the query against local tables, it's all being executed locally so no access to the remote indexes and it's pulling all the remote data back to test lkocally.
If you could use materialized views in a local database refreshed from the peoplesoft database on a periodic (nightly) basis for the historic data, only accessing the remote peoplesoft database for today's changes (adding a effective_date = today to your where clause) and merging the two queries.
Another option might be to use an INSERT INTO X SELECT FROM just for the remote data to pull it into a temporary local table or materialized view, then a second query to join that with your local data... similar to josephj1989's suggestion
Alternatively (though there may be licensing issues) try RAC Clustering your local db with the remote peoplesoft db.
Instead of using the subqueries, you can try this. I don't know if Oracle will perform better with this or not, since I don't use Oracle much.
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
LEFT OUTER JOIN Some_Table ST2 ON
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
WHERE
ST2.user_id IS NULL
Another possible solution would be:
SELECT
ST1.col1,
ST1.col2,
...
FROM
Some_Table ST1
WHERE
NOT EXISTS
(
SELECT
FROM
Some_Table ST2
WHERE
ST2.user_id = ST1.user_id AND
(
ST2.effective_date > ST1.effective_date OR
(
ST2.effective_date = ST1.effective_date AND
ST2.effective_sequence > ST1.effective_sequence
)
)
)
Would it be an option to create a database that you use for non-warehousing type stuff that you could update on a nightly basis? If it is you could create a nightly process that will move over only the most recent records. That would get rid of the MAX stuff you are doing for every day queries and significantly reduce the number or records.
Also, depends on whether you can have a 1 day lapse between the most recent data and what is available.
I'm not super familiar with Oracle so there may be a way to get improvements by making changes to your query also...
Can you ETL the rows with the desired user_id's into your own table, creating only the needed indexes to support your queries and perform your queries on it?
Is the PeopleSoft table a delivered one, or is it custom? Are you sure it's a physical table, and not a poorly-written view on the PS side? If it's a delivered record you're going against (example looks much like PS_JOB or a view that references it), maybe you could indicate this. PS_JOB is a beast with tons of indexes delivered, and most sites add even more.
If you know the indexes on the table, you can use Oracle hints to specify a preferred index to use; that sometimes helps.
Have you done an explain plan to see if you can determine where the problem is? Maybe there's a cartesian join, full table scan, etc.?
It looks to me that you are dealing with a type 2 dimension in the data warehouse. There are several ways how to implement type 2 dimension, mostly having columns like ValidFrom, ValidTo, Version, Status
. Not all of them are always present, it would be interesting if you could post the schema for your table. Here is an example of how it may look like (John Smith moved from Indiana to Ohio on 2010-06-24)
UserKey UserBusinessKey State ValidFrom ValidTo Version Status
7234 John_Smith_17 Indiana 2005-03-20 2010-06-23 1 expired
9116 John_Smith_17 Ohio 2010-06-24 3000-01-01 2 current
To obtain the latest version of a row, it is common to use
WHERE Status = 'current'
or
WHERE ValidTo = '3000-01-01'
Note that this one has some constant far in the future.
or
WHERE ValidTo > CURRENT_DATE
Seems that your example uses ValidFrom
(effective_date), so you are forced to find max()
in order to locate the latest row. Take a look at the schema -- is there Status or ValidTo
equivalents in your tables?
精彩评论