Best way to get CSV data into App Engine when bulkloader takes too long/generates errors?
I have a 10 MB CSV file of Geolocation data that I tried to upload to my App Engine datastore yesterday. I followed the instructions in this blog post and used the bulkloader/appcfg tool. The datastore indicated that records were uploaded but it took several hours and used up my entire CPU quota for the day. The process broke开发者_开发百科 down in errors towards the end before I actually exceeded my quota. But needless to say, 10 MB of data shouldn't require this much time and power.
So, is there some other way to get this CSV data into my App Engine datastore (for a Java app).
I saw a post by Ikai Lan about using a mapper tool he created for this purpose but it looks rather complicated.
Instead, what about uploading the CSV to Google Docs - is there a way to transfer it to the App Engine datastore from there?
I do daily uploads of 100000 records (20 megs) through the bulkloader. Settings I played with: - bulkloader.yaml config: set to auto generate keys. - include header row in raw csv file. - speed parameters are set on max (not sure if reducing would reduce cpus consumed)
These settings burn through my 6.5 hrs of free quota in about 4 minutes -- but it gets the data loaded (maybe its' from the indexes being generated).
appcfg.py upload_data --config_file=bulkloader.yaml --url=http://yourapp.appspot.com/remote_api --filename=data.csv --kind=yourtablename --bandwidth_limit=999999 --rps_limit=100 --batch_size=50 --http_limit=15
(I autogenerate this line with a script and use Autohotkey to send my credentials).
I wrote this gdata connector to pull data out of a Google Docs Spreadsheet and insert it into the datastore, but it uses Bulkloader, so it kind of takes you back to square one of your problem.
http://code.google.com/p/bulkloader-gdata-connector/source/browse/gdata_connector.py
What you could do however is take a look at the source to see how I pull data out of gdocs and create a task(s) that does that, instead of going through bulkloader.
Also you could upload your document into the blobstore and similarly create a task that reads csv data out of blobstore and creates entities. (I think this would be easier and faster than working with gdata feeds)
精彩评论