How to split up a massive data query into multiple queries
I have to select all rows from a table with millions of rows (to preload a Coherence datagrid.) How do I split up this query into multiple queries 开发者_如何学运维that can be concurrently executed by multiple threads?
I first thought of getting a count of all records and doing:
SELECT ...
WHERE ROWNUM BETWEEN (packetNo * packetSize) AND ((packetNo + 1) * packetSize)
but that didn't work. Now I'm stuck.
Any help will be very appreciated.
If you have the Enterprise Edition license, the easiest way of achieving this objective is parallel query.
For one-off or ad hoc queries use the PARALLEL hint:
select /*+ parallel(your_table, 4) */ *
from your_table
/
The number in the hint is the number of slave queries you want to execute; in this case the database will run four threads.
If you want every query issued on the table to be parallelizable then permanently alter the table definition:
alter table your_table parallel (degree 4)
/
Note that the database won't always use parallel query; the optimizer will decide whether it's appropriate. Parallel query only works with full table scans or index range scans which cross multiple partitions.
There are a number of caveats. Parallel query requires us to have sufficient cores to satisfy the proposed number of threads; if we only have a single dual-core CPU setting a parallel degree of 16 isn't going to magically speed up the query. Also, we need spare CPU capacity; if the server is already CPU bound then parallel execution is only going to make things worse. Finally, the I/O and storage subsystems need to be capable of satisfying the concurrent demand; SANs can be remarkably unhelpful here.
As always in matters of performance, it is crucial to undertake some benchmarking against realistic volumes of data in a representative environment before going into production.
What if you don't have Enterprise Edition? Well, it is possible to mimic parallel execution by hand. Tom Kyte calls it "Do-It-Yourself Parallelism". I have used this technique myself, and it works well.
The key thing is to work out the total range ROWIDs which apply to the table, and split them across multiple jobs. Unlike some of the other solutions proposed in this thread, each job only selects the rows it needs. Mr Kyte summarized the technique in an old AskTom thread, including the vital split script: find it here.
Splitting the table and starting off threads is a manual task: fine as a one-off but rather tiresome to undertake frequently. So if you are running 11g release 2 you ought to know that there is a new PL/SQL package DBMS_PARALLEL_EXECUTE which automates this for us.
Are you sure a parallel execution of the query will be faster? This will only be the case if the huge table is stored on a disk array with many disks or if it is partitioned over several disk. In all other cases, a sequential access of the table will be many times faster.
If you really have to split the query, you have to split it in a way so that a sequential access for each part is still possible. Please post the DLL of the table so we can give a specific answer.
If the processing of the data or the loading into the data grid is the bottleneck, then you are better of reading the data with a single process and the splitting the data before futher processing it.
Assuming that reading is fast and further data processing is the bottleneck, you could for exmaple read the data and write it into very simple text files (such a fixed length or CSV). After every 10,000 rows you start a new file and spawn a thread or process to process the just finished file.
try with something like this:
select * from
( select a.*, ROWNUM rnum from
( <your_query_goes_here, with order by> ) a
where ROWNUM <= :MAX_ROW_TO_FETCH )
where rnum >= :MIN_ROW_TO_FETCH;
Have you considered using MOD 10 on ROWNUM to pull the data one tenth at a time?
SELECT A.*
FROM Table A
WHERE MOD(ROWNUM,10) = 0;
精彩评论