SQLAlchemy: Efficient way of merging duplicate tags
Let's say I have a database full of Tag
objects. Each Tag
has an id
and a name
. In the beginning of making the database I allowed for case sensitive Tags
, however, I later realized I didn't need/want this capability, so I started forcing each name
to be lowercase before storing the Tag
.
Now I have all these remnants of different names
which w开发者_如何学Pythonould now be stored under the same Tag
but previously weren't. For example,
Trendy, trendy
NotHalfBad, Nothalfbad, nothalfbad
SQL, sql, Sql
I am using Python and SQLAlchemy. I have created a function to clean up this mess that looks something like this:
todelete = []
for t1 in Session.query(Tag):
if t1 not in todelete: # If we haven't already encountered this tag
for t2 in Session.query(Tag).filter_by(name_insensitive=t1.name):
if t1.id != t2.id:
merge(t1,t2) # Calls a function I made that merges the two tags
todelete.append(tag)
Session.commit()
# Mark everything for deletion
for tag in todelete:
Session.delete(tag)
# Now commit the deletes
Session.commit()
This is horribly inefficient. Is there a better way?
If it's a tool for one-time use, do you really have to care about efficiency? Just let it run for a minute (or several), rather than spending even more time optimizing it.
That being said, queries are more expensive than Python loops, so loading all the Tags into a list first, and then looping over that list both times, should speed things up:
for t1 in tags:
if t1 not in todelete:
for t2 in tags:
if t2.name_insensitive == t1.name:
merge(t1,t2)
todelete.append(tag)
Also, remove the commit
call in the loop. Not only is it expensive, but if some other process changes the DB, the list of tags you're looping over could get out of sync.
Of course, the a proper way to make things more efficient is profiling first, and then concentrating on specific problems. You should do that if you're serious about performance.
精彩评论