Python Object Storage Memory/Dict/DB/Other Options

2023-01-22 03:59 问答作者：

I'm currently writing a python application that will take a directory of text files and parse them into custom python objects based on the the attributes specified in the text file. As part of my application, I compare the current loaded object data set to a previous dataset (same format) and scan it for possible duplicates, conflicts, updates, etc. However since there can be ~10,000+ objects at a time, I'm not really sure how to approach this.

I'm currently storing the previous data set in a DB as it's being used by another web app. As of now, my python application loads the 'proposed' dataset into memory (creating the rule objects), and then I store those objects in a dictionary (problem #1). Then when it comes time to compare, I use a combination of SQL queries and failed inserts to determine new/existing and exist开发者_StackOverflow中文版ing but updated entries (problem #2).

This is hackish and terrible at best. I'm looking for some advice on restructuring the application and handling the object storage/comparisons.

You can fake what Git does and load the entire set as basically a single file and parse from there. The biggest issue is that dictionaries are not ordered so your comparisons will not always be 1:1. A list of tuples will give you 1:1 comparisons. If a lot has changed this will be difficult.

Here is a basic flow for how you can do this.

Start with both tuple lists at index 0.
Compare a hash of each tuple hashlib.sha1(str(tuple1)) == hashlib.sha1(str(tuple2))
If they are equal, record the matching indexes and add 1 to each index and compare again
If the are unequal, search each side for a match and record the matching indexes
If there are no matches, you can assume there is an insert/update/delete happening and come back to it later

You can map your matching items as reference points to do further investigation into the ones that did not match. This technique can be applied at each level you drill down. You will end up with a map of what is different down to the individual values.

The nice thing is each of the slices that you create can be compared in parallel since they will not correspond to each other... unless you are moving things from one file to another.

Then again, it may be easier to use a diff library to compare the two data sets. Might as well not reinvent the wheel; even if it might be a really shiny wheel.

Check out http://docs.python.org/library/difflib.html

继续阅读：object python

Python Object Storage Memory/Dict/DB/Other Options

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？