Insert thousands entities in a reasonnable time into BigTable

2023-03-13 02:19 问答作者：

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:

import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery

def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
    region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
    mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)

It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function. When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached. I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:

Why i开发者_StackOverflows it so slow ?
Is there any way to do it in a reasonnable time ?

Thanks :)

First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.

The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:

Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.

In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.

Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.

FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.

Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)

Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.

Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

继续阅读：bigtable entities gql insert

Insert thousands entities in a reasonnable time into BigTable

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？