开发者

Best way to deal with 900,000 record database and zip codes?

A company we do business with wants to give us a 1.2 gb CSV file every day containing about 900,000 product listings. Only a small portion of the file changes every day, maybe less than 0.5%, and it's really just products being added or dropped, not modified. We need to display the product listings to our partners.

What makes this more complicated is that our partners should only be able to see product listings available within a 30-500 mile radius of their zip code. Each product listing row has a field for what the actual radius for the product is (some are only 30, some are 500, some are 100, etc. 500 is the max). A partner in a given zip code is likely to only have 20 results or so, meaning that there's going to be a ton of unused data. We don't know all the partner zip codes ahead of time.

We have to consider performance, so I'm not sure what the best way to go about this is.

Should I have two databases- one with zip codes and latitude/longitude and use the Haversine formula for calculating distance...and the other the actual product database...and then what do I do? Return all the zip codes within a given radius and look for a match in the product database? For a 500 mile radius that's going to be a ton of zip codes. Or write a MySQL function?

We could use Amazon SimpleDB to store the database...but then I still have this problem with the zip codes. I could make two "domains" as Amazon calls them, one for the products, and one for the zip codes? I don't think you can make a query across multiple SimpleDB domains, though. At least, I don't see that anywhere in their documentation.

I'm open to some other solution entirely. It doesn't have to be PHP/MySQL or SimpleDB. Just keep in mind our dedicated server is a P4 with 2 gb. We could upgrade the RAM, it's just that we can't throw a ton of processing power at 开发者_JS百科this. Or even store and process the database every night on a VPS somewhere where it wouldn't be a problem if the VPS were unbearably slow while that 1.2 gb CSV is being processed. We could even process the file offline on a desktop computer and then remotely update the database every day...except then I still have this problem with zip codes and product listings needing to be cross-referenced.


You might want to look into PostgreSQL and Postgis. It has similar features as MySQL spacial indexing features, without the need to use MyISAM (which, in my experience, tend to become corrupt as opposed to InnoDB).

In particular with Postgres 9.1, which allows k-nearest neighbour search queries using GIST indexes.


Well, that is an interesting problem indeed.

This seems like its actually two issues, one how should you index the databases and the second is how to you keep it up to date. The first you can achieve as you describe, but normalization may or may not be a problem, depending on how you are storing the zip code. This primarily comes down to what your data looks like.

As for the second one, this is more my area of expertise. You can have your client upload the csv to you as they currently are, keep a copy of the one from yesterday and run it through a diff utility, or you can leverage Perl, PHP, Python, Bash or any other tools you have, to find the lines that have changed. Pass those into a second block that would update your database. I have dealt with clients with issues along this line and scripting it away tends to be the best choice. If you need help with organizing your script that is always available.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜