Find telephonenumbers - finding number with and without an phone extension
I've a table with about 130 000 records with telephonenumbers. The numbers are all formated like this +4311234567. The numbers always include international country code, local area code and then the phonenumber and sometimes an extension.
There is a webservice which checks for the caller's number in the table. That service works already. But now the client wants that also if someone calls from a company which number is already in the database but not his extension, that the service will return some result.
Example for table.
**id** | **telephonenumber** | **name** | 1 | +431234567 | company A | 2 | +431234567890 | employee in company A | 3 | +4398765432 | company b
now if somebody from company A calls with a different extension for example +43123456777, than it should return id1. But the problem is, that I don't know how many digits the extensions have. It could have 3,4 or more digits.
Are there any patterns for string kind of matchings?
The data is stored in a sql2005 database.
Thanks
EDIT:
The telephonenumbers i am getting from a crm system. I've talked with the admin of the crm and he is trying to send me the data in a different format.**id** | **telephonenumber** |**extension** | **name** | 1 | +431234567 | | company A | 开发者_C百科 2 | +431234567 | 890 | employee in company A | 3 | +4398765432 | | company b
Is there a way to determine which exact part of the stored number is an extension? Or are the "base" numbers without extansion are stored. IF yes you could just check if a number in your database(without extension) is a prefix of the current number to check. Prefix means a substring of the String that starting at the beginning.
But if you have only numbers in your database with extension and there is no way to find out how many digits belong to it, I believe you can not find an exact solution.
Given that the number of digits in the extension can be different for each company and the number of digits in the number could be different for each country and area code, this is a tricky problem to do efficiently.
Even if you get the data table split into base number and extension, you still have to split the incoming number into base number and extension, which I actually think complicates things.
What I would be inclined to try is:
Original format
- Try to match the incoming number with the database.
- If it matches one record, you have your answer - a specific person.
- If it matches more than one record, something has gone wrong, so fail.
- Otherwise, you have to find the company:
- Strip off the trailing digit from the incoming number and try to match this with the database again.
- If the number of digits drops below a threshold (probably 6 digits) then your search should probably fail. This is just to limit the number of database searches performed when the number isn't going to be found.
- If it matches no records, then you need to try this step again.
- If it matches more than one record, something has gone wrong, so fail.
- If it matches exactly one record, you have your next best answer - the company.
For example, searching for "+43123456777":
- +43123456777 matches 0 entries.
- +4312345677 matches 0 entries.
- +431234567 matches 1 entry: "Company A"
The main failure mode of this approach is if a company has variable length extension numbers. For instance consider what happens if both 431234567890 and 43123456789 are valid numbers but only the second one is in the database. If the incoming number is 431234567890, then 43123456789 will be matched in error.
Split format
This is a little more complex, but more robust.
- Try to match the incoming number with the database.
- If it matches one record, you have your answer - the company.
- If it matches more than one record, match the entry without an extension and you have found the company.
- Otherwise, you have to find the base company number and extension:
- Strip off the trailing digit from the incoming number and try to match this with the database again.
- If the number of digits drops below a threshold (probably 6 digits) then your search should probably fail. This is just to limit the number of database searches performed when the number isn't going to be found.
- If it matches no records, then you need to try this step again.
- If it matches one record, then you have found your answer - the company.
- If it matches more than one record, then you have found the base number of the company and thus now know the extension, so can try to look up the specific person:
- Strip the base number from the start of the original incoming number and use this to search the extensions of the records with that base number.
- If it matches exactly one record, you have found a specific person.
- If it doesn't match a specific person, match the entry without an extension and you have found the company.
For example, searching for "+43123456777":
- +43123456777 matches 0 entries.
- +4312345677 matches 0 entries.
- +431234567 matches 2 entries: "empty:Company A" & "890:employee in company A"
- Within these two matches "77" matches nothing, so return the empty extension: "Company A".
Implementation notes
This algorithm, as noted above, does have some efficiency problems. If the database lookup is expensive, it has a linear cost related to the length of the telephone number, especially in the case where no similar numbers exist in the database (for example, if the incoming number is from Kazakhstan, but there are no Kazakhstan numbers in the datsbase *8').
You could add some optimisations relatively easily though. If most of the companies you deal with use 3 or 4 digit extensions, you could start by stripping, say, 4 digits off the end and then doing a binary chop, until you reach an answer. This would reduce a 15 digit number to 4 or 5 in many cases and at most 6 lookups.
Also, every time you narrow the selection, you could select only within the previous selection rather than having to select within the whole database.
Additional implementation notes
Having finally worked out how Unreason's answer works, I can see that is a much simpler, more elegant solution. I wish I'd though of the simplicity of simply looking for the database number in the incoming number rather than the other way around.
My only concern is that performing this on every telephonenumber
in the database might impose excessive demands on the server. I would suggest benchmarking that solution under maximum stress and see if it causes problems. If not, fine - use that. If it does, consider implementing the simple form of my algorithm and doing the stress tests again. If the performance is still too low, try my binary search suggestion.
Instead of looking for the telephone number in the database, you could invert the problem and check every number in the database to see if it either matches or prefixes the incoming number.
Assuming you get a phone number such as +431234567891 from caller ID, then
SELECT name, id
FROM Table
WHERE CHARINDEX(telephonenumber, "+431234567891") > 0;
would return the company, and in case of +431234567890 would return 2 records
- company
- actual extension
If you can deal with two rows returned from the client side you should be fine with the above.
Preprocessing the data is better (performance wise), but for that you need to describe data in more detail,for example:
- are extensions only 3 and 4 digits,
- is the base number always 9 or 10 digits,
- do you always have at least one extension number for companies with extensions, etc...
The number of digits in an extension are PBX-specific. The number of digits in an area code+phone number are country/carrier-specific.
One way to do it would be to define additional rules, for example ...
+43123 | 12
... to say that anything begining with +43123 is a 12-digit number, and that anything beyond that is an extension: this lets you use (configurable instead of hard-coded) data to specify where an extension would begin.
Another way might be to insist that for any number-with-extention entries there should also be a corresponding number-without-extension, as shown in your example of "company A".
Well, my understanding of the phone number system is, that no two valid/complete numbers can exist where one is a prefix of the other. A common prank over here is to give out your number as 11 05 32 or something, where 110 is the German emergency police number.
So - if you can change the database structure and preprocess the data, you could look for numbers that have the same prefix (order them first, if the longer starts with the shortest they are extensions). Every match is
- A base number (the shortest one)
- A direct number plus extension (all longer ones)
I'd mark those in the database for faster lookup, if possible.
This approach falls short for the case where you have a common default extension. Over here lots of companies give out something like 1234567-0 as external number, where 0 can be replaced with the 2-4 digit extension. For these cases my approach would fall short - for your example data it would work though?
If you are dealing with phone numbers from different countries it will almost be impossible. The length often changes, even within the same country. If you know what the lengths will be (or you want to maintain a list like ChrisW) said, you can use the LEFT(field, x) function to truncate the phone number before searching for the company's phone number. Note that if you are doing a join, it will probably run much slower because it has to run the function on every row.
That will be impossible without further information: If your table is structured as above, the system has no means to know which part ist the base number and which one is the extension. So it would return "company b" for any(unknown) number starting with "+439".
EDIT (@MarkBooth)
I stand with my claim that its impossible without additional information. Just for making it clearer: Say we have the following information in our database
...
+43316852132 - ....
+433168731 - Company A (reception)
+433168739999 - Company A, Mr. X
+433168911321 - ....
...
The structure of these numbers ist +43 (316) 873 - 1, which the Program doesn't know. So if a number +43316872133 (+43 (316) 87 21 33 with structure) is calling (which is not in the database), you (and therefore your software :)) cannot tell if it belongs to company A or not without further information.
The only solution would be to maintain "base numbers" for companies against which you can do a simple prefix search.
精彩评论