Normalizing/validation for international data sets in a database?
Lets say you are dealing with your normal contact database (you know... name, phone number, address, email, etc...). If you're concerened about this locally, it's generally not a big issue to deal with, but when we look at international sets it is.
Looking at the phone number system, you would think it's simple, but it's really not. In north america, we generally have 1-222-333-4444 format for calling people. This is of course divieded down into your international dialing code, area code, exchange prefix and line number. Problem: real phone numbers are limited, there are around 220 area codes in the US out of the potential 1000, each area code only has a limited number of exchanges, and the line numbers are restricted to specific use under that country (for example, patterns with 911 are restricted, only about 3/4ths of the 10,000 are in use). Take this over to the UK, they have their own set of rules for line numbers, such as reserving most of the 0300-0399 block to specific use, and other restrictions. International codes are also limited. Normalizing area codes, exchanges, and putting data validation checks onto phone numbers just got complicated. I'm not going into detail about when we go into places that are not part of the NPA scheme, but lets just identify that we can't really trust the north american template, kick back, and call it a day.
How d开发者_如何学Pythono we normalize for things like this? How do we validate data? How do we deal with these seemingly ad-hoc extension codes or instructions for internal dialing?
International addresses are not much better, the differences between not just data retained, but also output formats aren't the same across the board. How do we deal with international postal codes, when in canada the format is A1A1A1, and the USA has a system such as 55555[-4444]?
I'm tempted to just write classes for each of these situations as I encounter them, store them in the database as XML/JSON/similar, but then how do I relate fields and easily search my content? I don't want to end up creating a table moster of thousands of tables for each country. I want an easily scalable solution where I can normalize my addresses and validate the content. Is this too much to ask?
A way of approaching this problem may be:
adopt three views of addresses/phone numbers/post codes etc.
The first view is that of addresses (say) as multiple lines of text.
The second view is that of address tags (more on this below).
The third view is that of validation rules for addresses
The other components that are needed for this approach are a generic procedure (class/trigger) for validation; a formatting routine for printing purposes; and a rule base, together with an administrators mechanism to update the validation rules. A "catchall" rule that says this is a valid address - it has been validated manually - even if it meets none of the rules of your rule base.
The components: 1 Address is made up of multiple lines each of which has an associated sequence number and a number of tags (usually one). It is possible to also associate with an address line, the set of rules and versions of rules that they were validated against, but this is a refinement that depends on your update/insert/calculation rates.
2 The address tags are things like city; town; house number; and identify the different lines of an address. It is possible to have an address line that does not have any tags, but in this case, only generic searches (eg for New York) are possible on the full set of lines. Searches for "City = New York") are not possible.
3 Rules written in a domain specific language. This could be regular expression; a language specific to you; your normal development language (though this is likely to be the least useful approach, as programming languages can find it difficult to consisely and accurately represent rules of the sort I am talking about. An example of representative rules might be (dealing with your description of US Zip Codes) - the first five characters must be digits.
the first five characters represent an "area code".
the zip code must be the last line of an address.
The rules will be divided into groups and sets (eg US addresses) Rules must be able to refer to other rules as well as the address data.
Your validation routine needs to take the lines of an address and apply the rules to it (usually by set). It will return a boolean - valid or invalid address and optionally the set of rules it validated against.
The printing routine again will apply the appropriate rule set (probably different from your validation set) to the address data to provide a formatted address.
I hope the other componenets are obvious from the overall approach.
This approach is intended to deal with these issues identified in your question:
Partitioning of phone codes.
Limited number of possible area codes in use.
Blocks of phone numbers reserved for specific purposes.
Normalisation of data - the data is normalised. However, this type of normalisation (reverse indexing) is not generally used, except in data warehouse software, and databases containing massive real time sensor information. It is possible that in implementing this solution you may end up choosing to (controllably) duplicate data. This is not an inherent part of the solution, but may be convenient.
I would strongly suggest NOT adding classes for each variant - this is not scalable, nor maintainable.
Searching is covered in detail below.
Monster tables are avoided - the rule base is likely to be or the order of hundreds to low thousands of rules additional to the actual data.
The solution is scalable - one simply adds or amends rules.
and also deal with some related problems.
Even if you can apply validation rules to national formats of addresses, there will always be exceptions to the standards for the particular country. My own address is an example - I live on a boat, and need additional information included in my address, over and above the Post Office standard address. Anomalies of this sort are always likely to need manual intervention - hence the rule for accepted by manual intervention.
Different countries have different orderings for addresses - addresses in China for instance are written: Country; Post Code; City; City Zone; Street Name; House Number; Person Name.
Acceptance of the first address from an area where you have no rules, and the rules of the country are different from any you have recorded.
People wanting to use (eg an office) address different from "their" address.
Specific individual addressing concerns - somebody wishing to conceal their correspondence from their nearest and dearest.
The solution can be extended to like problems - eg referring to a person. This can involve titles - Dr, Rev, etc; multiply hyphenated names (ffoulkes-symthe); qualifications (CPA, BSc, etc; and familial qualifications (the third, etc); and multiple modes of naming depending on the persons culture (people from the Indian subcontinent often do not have a family name, and each will apparently have a different surname).
Changes to addressing rules (eg addition of a new Zip Code for a new development) can be made easily and quickly.
Problems still arising are:
Searches will be somewhat more complicated - needing to search all lines and associated tags of addresses rather than specific address fields
It takes time to build up a rule base of this sort - whilst initial rule loading can be done fairly quickly - examples are present in your question - anything like a complete set will only be present after you have dealt with multiple exceptions and anomalies.
EDIT:
I wasn't aware of BigTable when I wrote my response. Having had a very quick look at this, it does seem to be a very similar concept. As for implementing it in a ACID development I think this is possible for the contact dataset, and for the rule data base. From experience I do know it scales easily to 5*10^7 sets of contact details in such an environment (though with background, non-time critical validation of contact details).
In considering the ACID/SQL case my use of the word "view" may have started a hare going off in a direction I did not intend. A more appropriate word may have been overview or outlook (something with no relational model or DBMS freight attached). In fact I would consider each of the things I referred to as a view, as being a candidate table.
I set out a sketch of a schema for my approach to assist discussion. This schema uses some M:N connections, which would obviously be normalised out in an implementation as associative tables. This is left as an exercise for the reader :-). I have put a few sample attributes and data into some of the tables.
In the sketch, editing of Rule Set; Rule; and rules relating to other rules (the administrative application) can obviously be done with the ACID properties by SQL Normal CRUD operations on Address User; Address; and Address Line can equally be done in an ACID/SQL manner, taking the address lines as input by the user. You may wish to do some preprocessing of the transaction to separate out (in my example) House Number from Road Name and consequentially lose the "Road Name" rule. Other pre-processing that is likely to be useful is the standardisation of capitalisation - though this could also be part of the validation step. It is also possible to simply accept the complete line "9 Richmond Road" as input, and tag it as "Needs Validation". In either case there is no problem (that I am aware of) about making this transaction ACID.
Where I am less clear is how the validation and subsequent tagging can be incorporated into the ACID/SQL transaction. It does seem that to make the validation step ACID, it may be necessary to order the application of the testing against the rule sets (with the most common cases tested first, and stopping when a validation is successful). Also it may be necessary, in order to make the transaction ACID, to validate only against your most common case, leaving the others tagged as "Need Validation", which then would be done as a background task.
The actual task of validation involves checking the entire rule base, set by set, until a set of rules is found which validates all the input lines. There are obviously potential guidelines from the data itself - if you have recorded country, then this both enables you to tag the line with country in straight away, and provides a filter for the rule sets you have to test against. I am sorry but this is as far as I can take this aspect so far.
Incidentally, this sketch schema only goes part way towards total normalisation. As drawn, each address has its own sequence of address lines, and there is no normalisation of the data beyond the individual lines as input (plus or minus any pre-processing you choose to do). It is possible to take this approach even further - by making the link between Address and Address Line M:N and ensuring that the Line field of the Address Line table is its primary key.
When it comes to more detailed resources for this concept I have two problems.
The first (and trivial) one is that this concept is my original work, based on my experience over more than twenty years as a method consultant and IT Strategy consultant with a special interest in development environment technology and development methods. All my working life has been spent in environments where contact details have been a major concern (for financial and regulatory/legislative reasons). In fact my original response to your question was complete in my mind before I had finished reading your question, even though it did then take me about three-quarters of an hour to get it typed up.
The more important reason is that some of the sources of this idea are confidential or secret. In my working career, part of my job involved keeping up to date with technology developments, and predicting the impact of technology on the business in ten years time. This involved visiting research labs and having discussions with leading researchers about a variety of topics. While I am not, myself, a first class researcher, I do seem to be very good at synthesising the research efforts of others. However, while doing this I was always operating under conditions of commercial confidentiality and/or military secrecy. None of my answer has breached these conditions. As a result of this I can only give vague guidelines as to how the information was derived.
My sources are:
a seminar conducted by C J Date, at IBM, exploring the further reaches of normalisation and the Relational Model (NOT the Relational Model as implemented in SQL). This involved exploration of fifth (?) and sixth (?) normal forms.
a series of discussions with Oracle technical staff over a period of time, discussing meta data; meta meta data; and other such generalisations.
discussions with a UK based military research establishment. Though this was some years ago, I am not sure whether anything has ever been published on the topics we were discussing.
working in a large financial institution whose contact details system was shaped much like my proposal, but arose from non-relational roots; and for which the original technical impetus was space saving in an era when memory, persistent memory, and back up capacity was a major concern.
EDIT:
I completed the above edit, shut up my computer, did some household chores, went to bed, ley down, closed my eyes, and I had the solution to the part I could not complete in that previous edit.
While there are updates to be done when tagging/validating much of the work is actually reading and comparison. As such it is a prime candidate for optimistic locking. In pseudo code this would look like:
Start read transaction
Read set of address lines where one or more lines are "Needs Validation"
Read all rule set names
Read all rule lines which belong to the first rule set
If read is not successful then abandon and start process again
End read transaction
Do while address not validated and not end of rule sets
If set of address lines validates against first rule set then
prepare transaction by allocating the tags to be applied to each line of the
address and temporarily recording the rule set that has validated the address
Start validation transaction
Read same set of address lines and rule set (all fields)
Is the address and the rule set identical to what was obtained in the read
transaction?
If identical then
create appropriate set of Tag Usage records
if successful then
commit validation transaction
end overall process
if not successful or
if not identical then
rollback validation Tag and Tag Usage
** This is the point where there might be problems if the rule set has
changed since reading it. The ordering may have changed due to amendment
deletion or creation of rule sets. This is the reason for the read of all
the read set names initially. However if there is such a change one can
afford to fail to validate and move on, aiming to come back to this address
later.
End of validation transaction
Start read transaction
Read rule lines which belong to the next rule set
End read transaction
End while loop
If end of rule sets reached and address not validated
begin create Tag Usage transaction
create tag usage pointing to Tag "Manual Intervention needed"
end create Tag Usage transaction
All three types of transaction (Address, Address Line updates, Rule Set, Rule updates, and Tag, Tag Usage updates) fulfil the ACID conditions. I strongly believe (but have not proved) that any interleaving, combination, or intersecting set of these three types of transaction will also fulfill the ACID conditions.
Possibly already answered for phone numbers at least. You could do something similar for postal codes.
If I would implement this I would save phone numbers, zip codes etc. as regular strings. In particular the data should be stored in the format the end user needs it. (Assuming every end user has the same needs.) E.g. having a German adress: "Roadname 123", U.S. Adress? "123 Roadname". Doing the same for zip codes, combine them with the City name. You could save the addresses as address_line_1 (street name, house number in country specific order the user enters), address_line_2 (postal code, city name ...).
If you still need to search your database for specific postal codes you might write a regular expression or even a function for that. Considering City names you could wipe them out of address_line_2 and with high probability you end up having the postal code.
I think writing validations for each country must be tremendous work, that's 200 countries... How can you be sure that you didn't miss some local conventions? You could write a function eq that for instance evaluates eq("ABCDE-34", "ABCDE.34") == true.
Though I don't really see the point in writing Client side and Server side validations. Even if the client is a web browser, you might use the server's validations via AJAX.
In the end it depends on the DBMS you are using (support for Java stored procedures?), your client side language... Also how the data is entered (is it entered very inaccurately? in a Web browser?) and what you want to do with it. (Are you planning to feed Skype with phone numbers from your database or are these read by humans who type them in their phone?) Do you need to do some specific join operations? And of course it depends how many man-hours you are able to spend with solving that problem...
精彩评论