开发者

Searching large Excel files and handling large amounts of data efficiently

I've started on project, here's what I have:

Excel file (exl1) containing 450K records, with 50K more each month.

exl1 format

invoice#|Box#|Serial-#1|Serial-#2|5-val-enum#1|5-val-enum#2|10-val-enum|4-val-enum|timestamp

ex1: abc123|box1|0987654321|A123456789|Plant|Tree|PersonName1|North|DateTime.Now

ex2: qwe345|box9|12345678901234567890|#NA|Animal|Cat|PersonName1|South|DT.Now

Excel file (exl2) containing roughly 50K records. (named searchcVal for purpose of explanation) exl2 format Serial1

ex1a: A123456789

ex1b: 0987654321

ex2a: 12345678901234567890

Here's what I have to do:

  1. Compare each value in exl2(searchval) to either Serial#1 or Serial#2 depending on the value of 5-val-enum#1 of exl1 (example1 = Plant , example2 = Animal)

  2. if searchVal starts with [a-z] search serial2 else search serial1 ; so, with searchVal ex1a search col3 and serachval ex1b search col2

        if (exl1.Rows[columnHeader][col4].ToString() == "Plant")
        {
            string rowVal = exl2.Rows[rowIterator][col0].ToString();
            if (regex.IsMatch(rowVal[0].ToString()))//checks to see if serial1 or  serial2
            {
                if (rowVal == exl1.Rows[rowIterator][col3].ToString())
                {
                    //add matched row to ResultsDT
                }
                else
                {  //next row
                }
            }
            else
            { //search col2 with same procedure
            }
        }
        else
        {//search col2
        }
    
  3. for the sake of explanation lets say Person1 matched 400 Plants of which 100 were trees, 100 were bushes , 100 were grasses and 100 were flowers and he matched 400 Animals of which 100 were cats, dogs, 开发者_JAVA技巧snakes and birds each. with these matches I'd like to summarize it to the output of SUMMARY1 : PersonName|Plants|Animals|category3|Category4|Category5 with a more detailed one for each of the categories like SUMMARY 2: PersonName|Trees|Bushes|Grasses|Flowers leading to SUMM1 : Person1|400|400|x|n|y SUMM2(plants only) : Person1|100|100|100|100

  4. Most importantly : do all this without killing the PC it's running on for 3 hours while it computes


There are at least two options:

  1. Treat excel as the database and check its performance - here is how to do this: http://www.beansoftware.com/NET-Tutorials/Excel-ADO.NET-Database.aspx.
  2. If option no 1 is too slow import this data to a real database (ms sql, mysql, postgresql, etc.), add appropriate indexes and perform your searches in the db. The excel would be treated just as a datasource for an initial import.


I'm assuming the question here is "how can I perform this tasks efficiently?"

The answer is, you shouldn't. It sounds like you are trying to do OLAP on the cheap (except that, well, it may not be happening strictly online), and there are a lot of solutions already available for this.

Since you already have an established procedure of using an excel spreadsheet, PALO may serve your needs (edit: it's free).

Alternatively, what you have there is a denormalized set of records; if you normalize it into several sets & enter it into a database (using a script, obviously), you can let your database take care of the intensive computations. Edit: There are a lot of free databases you can use (SQL is a language, not a brand). E.g. PostgrSQL, MySQL

If you insist on parsing the files & analyzing the files yourself, then I suggest you modify your algorithm to do 2 things:

Firstly, get your 50k set of records to fit into as little memory as is reasonable possible. Obviously, you don't want to store your records as 50k strings: parse them, and build up a memory structure which lets you access only the information you need. Edit: Nevermind, misunderstood your input data.

Secondly, modify your algorithm so that it can be run piecemeal. Currently you have 1 set of 50k records, and another set of 450k records, and it sounds like you expect to run your program each month (or more frequently) on the full set of records + whatever records have been added to the 450k set of records. If you start storing incremental results, you can structure your script so that it processes (for example) up to 10k records at a time from your 450k record set at a time, and run several instances of your script in sequence, you can avoid re-analyzing the whole 450k of records every month and also have a handy way to stop & start the process midway (using some kind of parent script).

For a more complex approach, look Divide and Conquer as it applies to algorithms.


Depending on the ratio of Excel updates/Queries run, it might be a good idea to simply read the values into an sql server database and query/process the data there? I would imagine that it takes some time to read the values into sql server, but the queries should take no time...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜