App Engine Bulk Loader Performance
I am using the App Engine Bulk loader (Python Runtime) to bulk upload entities to the data store. The data that i am up开发者_Python百科loading is stored in a proprietary format, so i have implemented by own connector (registerd it in bulkload_config.py
) to convert it to the intermediate python dictionary.
import google.appengine.ext.bulkload import connector_interface
class MyCustomConnector(connector_interface.ConnectorInterface):
....
#Overridden method
def generate_import_record(self, filename, bulkload_state=None):
....
yeild my_custom_dict
To convert this neutral python dictionary to a datastore Entity, i use a custom post import function that i have defined in my YAML.
def feature_post_import(input_dict, entity_instance, bulkload_state):
....
return [all_entities_to_put]
Note: I am not using entity_instance, bulkload_state
in my feature_post_import
function. I am just creating new data store entities (based on my input_dict
), and returning them.
Now, everything works great. However, the process of bulk loading data seems to take way too much time. For e.g. a GB (~ 1,000,000 entities) of data takes ~ 20 hours. How can I improve the performance of the bulk load process. Am i missing something?
Some of the parameters that i use with appcfg.py are (10 threads with a batch size of 10 entities per thread).
Linked a Google App Engine Python group post: http://groups.google.com/group/google-appengine-python/browse_thread/thread/4c8def071a86c840
Update:
To test the performance of the Bulk Load process, I loaded entities
of a 'Test' Kind
. Even though this entity
has a very simple FloatProperty
, it still took me the same amount of time to bulk load those entities
.
I am still going to try to vary the bulk loader parameters, rps_limit
, bandwidth_limit
and http_limit
, to see if i can get any more throughput.
There is parameter called rps_limit
that determines the number of entities to upload per second. This was the major bottleneck. The default value for this is 20
.
Also increase the bandwidth_limit
to something reasonable.
I increased rps_limit
to 500
and everything improved. I achieved 5.5 - 6 seconds per 1000 entities which is a major improvement from 50 seconds per 1000 entities.
精彩评论