how to migrate and process data in java with low memory and faster
I have an application that receives files with a flat table in DBF, which normalize and convert and insert into a MySQL database, files that grow to between 250,000 to 270,000 records and now consumes about 1.5 - 2 Gb at all I must repeat the process and by parts.
I check in the process if there are no prior records, help in HashMaps me the keys that are unique records to not save all the information in them, the problem here is that as to be uploading files to the database DBF total records data grows and every time you import something becomes slower and consume much more memory, easily get to 770,000 records in one table and a call to spend more of the 2,000,000
example:
DBF file has 91 fields and I have normalized data base gives 5 main tables and I must repeat the process of remembering the DBF 5 times.
nothing can take me about 1 hour migrate all DBF file of 250,000 records.
How can you minimize the time and RAM consumption, unless the RAM consumption is lower and I get several Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
DBF file structure has many redundant data is like this (columns):
- survey code
- overall survey data
- group data within the encuenta
- details of the persons
example:
| survey_code | survey_columns | group_survey | more_data | person_details |
|:------------|:---------------|:----------------|:----------|:---------------|
| 1 | ... | 1 | ... |1-oscar |
| 1 | ... | 1 | ... |2-juan |
| 1 | ... | 2 | ... |1-catalina |
| 2 | ... | 1 | ... |7-john |
note: As you can see there is enough data redundancy
the structure of the MySQL database is like this:
Survery (table)
id
survey_data
Group_in_survey (table)
survey_id
group_data
Person (table)
group_id
person_details
the import process as divided according to the tables and the order defined by their dependence, for each table, runs through all the records in the DBF file.
Before I make a r开发者_高级运维equest to start the database and then I load into a HashMap information necessary to perform the verification if records exist before being added to the database in MySQL.
Also every time you add a record to the MySQL database that record tamibén add to HashMap, since one of the debugging process is duplicated. This latter is done in each of the processes
From what I understand you are inserting your records first into the HashMap to get rid of duplicates and they you copy it to the MySQL database. If that is the case you can try to use HugeCollections library and their HugeMap instead of simple HashMap
I think I'd prefer processing the flat-file, stripping duplicates first and saving to a temporary file, then doing a full bulk import. This way you'll never run in to memory issues and you'll have a backup of your first step if there are any issues.
It seems that you are doing something strange there , it looks that you are parsing whole dbf file loading it in to memory and then inserting one by one in to the database (hopefully not creating connection each time).
Now as far as i know dbf is easily read by stream readers so there is no need to load everything in to memory , i think it wold be better to do it in batches read 100 records insert all of them to database , or you can create sql file with inserts and then run whole file on mysql (but it would tale loads of hdd space).
now whole process should take only as much memory as processing one row and this should be stable.
i would need to know a bit more about structure of the dbf file and your tables to say anything more
精彩评论