Checking for membership inside nested dict

2022-12-31 17:59 问答作者：

This is a followup questions to this one:

Python DictReader - Skipping rows with missing columns?

Turns out I was being silly, and using the wrong ID field.

I'm using Python 3.x here, btw.

I have a dict of employees, indexed by a string, "directory_id". Each value is a nested dict with employee attributes (phone number, surname etc.). One of these values is a secondary ID, say "internal_id", and another is their manager, call it "manager_internal_id". The "internal_id" field is non-mandatory, and not every employee has one.

{'6443410501': {'manager_internal_id': '989634', 'givenName': 'Mary', 'phoneNumber': '+65 3434 3434', 'sn': 'Jones', 'internal_id': '434214'}
'8117062158': {'manager_internal_id': '180682', 'givenName': 'John', 'phoneNumber': '+65 3434 3434', 'sn': 'Ashmore', 'internal_id': ''}
'9227629067': {'manager_internal_id': '347394', 'givenName': 'Wright', 'phoneNumber': '+65 3434 3434', 'sn': 'Earl', 'internal_id': '257839'}
'1724696976': {'manager_internal_id': '907239', 'givenName': 'Jane', 'phoneNumber': '+65 3434 3434', 'sn': 'Bronte', 'internal_id': '629067'}

}

(I've simplified the fields a little, both to make it easier to read, and also for privacy/compliance reasons).

The issue here is that we index (key) each employee by their directory_id, but when we lookup their manager, we need to find managers by their "internal_id".

Before, when our dict was using internal_id as the key, employee.keys() was a list of internal_ids, and I was using a membership check on this. Now, the last part of my if statement won't work, since the internal_ids is part of the dict values, instead of the key itself.

def lookup_supervisor(manager_internal_id, employees):
    if manager_internal_id is not None and manager_internal_id != "" and manager_internal_id in employees.keys():
        return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
    else:
        return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')

So the first question is, how do I fix the if statement to check whether the manager_internal_id is present in the dict's list of internal_ids?

I've tried substituting employee.keys() with employee.values(), that didn't work. Also, I'm hoping for something a little more efficient, not sure if there's a way to get a subset of the values, specifically, all the entries for employees[directory_id]['internal_id'].

Hopefully there's some Pythonic way of doing this, without using a massive heap of nested for/if loops.

My second question is, how do I then cleanly return the required employee attributes (mail, givenname, surname etc.). My for loop is iterating over each employee, and calling lookup_supervisor. I'm feeling a bit stupid/stumped here.

def tidy_data(employees):
    for directory_id, data in employees.items():
        # We really shouldnt' be passing employees back and forth like this - hmm, classes?
        data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = lookup_supervisor(data['manager_internal_id'], employees)

Should I redesign my data-structure? Or is there another way?

EDIT: I've tweaked the code slightly, see below:

class Employees:

    def import_gd_dump(self, input_file="test.csv"):
        gd_extract = csv.DictReader(open(input_file), dialect='excel')
        self.employees = {row['directory_id']:row for row in gd_extract}

    def write_gd_formatted(self, output_file="gd_formatted.csv"):
        gd_output_fieldnames = ('internal_id', 'mail', 'givenName', 'sn', 'dbcostcenter', 'directory_id', 'manager_internal_id', 'PHFull', 'PHFull_message', 'SupervisorEmail', 'SupervisorFirstName', 'SupervisorSurname')
        try:
            gd_formatted = csv.DictWriter(open(output_file, 'w', newline=''), fieldnames=gd_output_fieldnames, extrasaction='ignore', dialect='excel')
        except IOError:
            print('Unable to open file, IO error (Is it locked?)')
            sys.exit(1)

        headers = {n:n for n in gd_output_fieldnames}
        gd_formatted.writerow(headers)
        for internal_id, data in self.employees.items():
            gd_formatted.writerow(data)

    def tidy_data(self):
        for directory_id, data in self.employees.items():
            data['PHFull'], data['PHFull_message'] = self.clean_phone_number(data['telephoneNumber'])
            data['SupervisorEmail'], data['SupervisorFirstName'], data['SupervisorSurname'] = self.lookup_supervisor(data['manager_internal_id'])

    def clean_phone_number(self, original_telephone_number):
        standard_format = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
        extra_zero = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})-(?P<local_second_half>\d{4})')
        missing_hyphen = re.compile(r'^\+(?P<intl_prefix>\d{2})\(0(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<local_second_half>\d{4})')
        if standard_format.search(original_telephone_number):
            result = standard_format.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), ''
        elif extra_zero.search(original_telephone_number):
            result = extra_zero.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Extra zero in area code - ask user to remediate. '
        elif missing_hyphen.search(original_telephone_number):
            result = missing_hyphen.search(original_telephone_number)
            return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), 'Missing hyphen in local component - ask user to remediate. '
        else:
            return '', "Number didn't match format. Original text is: " + original_telephone_number    

    def lookup_supervisor(self, manager_internal_id):
        if manager_internal_id is not None and manager_internal_id != "":# and manager_internal_id in self.employees.values():
            return (employees[manager_internal_id]['mail'], employees[manager_internal_id]['givenName'], employees[manager_internal_id]['sn'])
        else:
            return ('Supervisor Not Found', 'Supervisor Not Found', 'Supervisor Not Found')

if __name__ == '__main__':
    our_employees = Employees()
    our_employees.import_gd_dump('test.csv')
    our_employees.tidy_data()
    our_employees.write_gd_formatted()

I guess (1). I'm looking for a better way to structure/store Employee/Employees, and (2) I'm having issues in particular with lookup_supervisor().\

Should I be creating an Employee Class, and nesting these inside Employees?

And should I even be doing what I'm doing with tidy_data(), and calling clean_phone_number() and l开发者_开发知识库ookup_supervisor() on a for loop on the dict's items? Urgh. confused.

You probably will need to do some iteration to get the data. I assume you don't want an extra dict that can get out of date, so it won't be worth it trying to store everything keyed on internal ids.

Try this on for size:

def lookup_supervisor(manager_internal_id, employees):
    if manager_internal_id is not None and manager_internal_id != "":
        manager_dir_ids = [dir_id for dir_id in employees if employees[dir_id].get('internal_id') == manager_internal_id]
        assert(len(manager_dir_ids) <= 1)
        if len(manager_dir_ids) == 1:
            return manager_dir_ids[0]
    return None

def tidy_data(employees):
    for emp_data in employees.values():
        manager_dir_id = lookup_supervisor(emp_data.get('manager_internal_id'), employees)
        for (field, sup_key) in [('Email', 'mail'), ('FirstName', 'givenName'), ('Surname', 'sn')]:
            emp_data['Supervisor'+field] = (employees[manager_dir_id][sup_key] if manager_dir_id is not None else 'Supervisor Not Found')

And you're definitely right that a class is the answer for passing employees around. In fact, I'd recommend against storing the 'Supervisor' keys in the employee dict, and suggest instead getting the supervisor dict fresh whenever you need it, perhaps with a get_supervisor_data method.

Your new OO version all looks reasonable except for the changes I already mentioned and some tweaks to clean_phone_number.

def clean_phone_number(self, original_telephone_number):
    phone_re = re.compile(r'^\+(?P<intl_prefix>\d{2})\((?P<extra_zero>0?)(?P<area_code>\d)\)(?P<local_first_half>\d{4})(?P<hyph>-?)(?P<local_second_half>\d{4})')
    result = phone_re.search(original_telephone_number)
    if result is None:
        return '', "Number didn't match format. Original text is: " + original_telephone_number
    msg = ''
    if result.group('extra_zero'):
        msg += 'Extra zero in area code - ask user to remediate. '
    if result.group('hyph'):    # Note: can have both errors at once
        msg += 'Missing hyphen in local component - ask user to remediate. '
    return '0' + result.group('area_code') + result.group('local_first_half') + result.group('local_second_half'), msg

You could definitely make an individual object for each employee, but seeing how you're using the data and what you need from it, I'm guessing it wouldn't have that much payoff.

My python skills are poor, so I am far too ignorant to write out what I have in mind in any kind of reasonable time. But I do know how to do OO decomposition.

Why does the Employees class to do all the work? There are several types of things that your monolithic Employees class does:

Read and write data from a file - aka serialization
Manage and access data from individual employees
Manage relationships between exmployees.

I suggest that you create a class to handle each task group listed.

Define an Employee class to keep track or employee data and handle field processing/tidying tasks.

Use the Employees class as a container for employee objects. It can handle tasks like tracking down an Employee's supervisor.

Define a virtual base class EmployeeLoader to define an interface (load, store, ?? ). Then implement a subclass for CSV file serialization. (The virtual base class is optional--I'm not sure how Python handles virtual classes, so this may not even make sense.)

So:

create an instance of EmployeeCSVLoader with a file name to work with.
The loader can then build an Employees object and parse the file.
As each record is read, a new Employee object will be created and stored in the Employees object.
Now ask the Employees object to populate supervisor links.
Iterate over the Employees object's collection of employees and ask each one to tidy itself.
Finally, let the serialization object handle updating the data file.

Why is this design worth the effort?

It makes things easier to understand. Smaller, task focused objects are easier to create clean, consistent APIs for.

If you find that you need an XML serialization format, it becomes trivial to add the new format. Subclass your virtual loader class to handle the XML parsing/generation. Now you can seamlessly move between CSV and XML formats.

In summary, use objects to simplify and structure your data. Section off common data and behaviors into separate classes. Keep each class tightly focused on a single type of ability. If your class is a collection, accessor, factory, kitchen sink, the API can never be usable: it will be too big and loaded with dissimilar groups of methods. But if your classes stay on topic, they will be easy to test, maintain, use, reuse, and extend.

继续阅读：python

Checking for membership inside nested dict

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？