Long (and failing) bulk data loads to Google App Engine datastore
I'm developing an application on Google App Engine using the current django non-rel and the now default, high replication datastore. I'm currently trying to bulk load a 180MB csv file locally on a dev instance with the following command:
appcfg.py upload_data --config_file=bulkloader.yaml --filename=../my_data.csv --kind=Place --num_threads=4 --url=http://localhost:8000/_ah/remote_api --rps_limit=500
bulkloader.yaml
python_preamble:
- import: base64
- import: re
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.bulkload.bulkloader_wizard
- import: google.appengine.ext.db
- import: google.appengine.api.datastore
- import: google.appengine.api.users
transformers:
- kind: Place
connector: csv
connector_options:
encoding: utf-8
columns: from_header
property_map:
- property: __key__
external_name: appengine_key
export_transform: transform.key_id_or_name_as_string
- property: name
external_name: name
The bulk load is actually successful for a truncated, 1000 record version of the CSV, but the full set eventually bogs down and starts erroring, "backing off" and waiting longer and longer. The bulkloader-log that I actually tail, doesn't reveal anything helpful and either does the server's stderr.
Any help in understanding this bulk load process wo开发者_开发知识库uld be appreciated. My plans are to be able to eventually load big data sets into the google data store, but this isn't promising.
180MB is a lot of data to load into the dev_appserver - it's not designed for large (or even medium) datasets; it's built entirely for small-scale local testing. Your best bet would be to reduce the size of your test dataset; if you can't do that, try the --use_sqlite
command line flag to use the new sqlite-based local datastore, which is more scalable.
精彩评论