Building postcode address lookup based on Royal Mail PAF Raw data
I am working on custom build software for postcode lookup based on Royal Mail PAF data. Main p开发者_C百科urpose of that software is to replace Quick Address (Third party software vendor).
I have a few questions
How come Quick Address data files including indexes are under 500MB whereas if you look at PAF raw data it's over 2.50GB. What cleanup and compression techniques they have performed on raw data to achieve that. My imported Db size is 2.50GB (sqlite). I have to use some free/open source Db and paid Db is not my option here.
There are 28 million records. How can I improve search by organization name or town for example considering it can be performed using "LIKE" statement?
Any idea?
Don't store information that you dont need such as DPS, occupancy and the various company flags
Rather than hold 28million addresses you could hold 1.8m addresses one for each postcode and have a list of devivery points for each postcode (i.e. house number, house/building names)
I'm not sure which version of the PAF you have, the relational version with keys or expanded version.
The keyed version will reduce file size as you just need to have addresses made up of numbers pointing to lookup tables for locality, street , street end etc. But using keys in your addresses will not help searching by org or town name.
Views will help format your output address from keys. Make sure that the database you use have views that can use indexes otherwise youl'll end up table scanning.
What I have done in the past is indexed the PAF using the full text search engine sphinx http://sphinxsearch.com/ which gives you very powerfull search (including partial words & fuzzy match) on whatever words you decide to index. Try all words in address. The result from sphinx is a list of keys that you can be used to iterate thru a sql result set. The sql query can be against a address table of keys that can be used to build the full address from lookup tables. The sphinx index build is remarkably fast and produces a reasoably small index size.
Mysql might be a better fit than sqlite for a database of this size.
Other things to consider. Are you doing batch processing or just transactional – forget sphinx for batch processing. Frequency of update. If you dont update monthly you will become hopelssly out of date in a very short time.
Note: If you have the keyed version of the PAF there are some horrible rules for formatting addresses and many undocumented exceptions.
Is file size an issue for you? I'd only worry about compression if file size matters - it almost never does anymore, and 2.5 GB is not prohibitive in most situations.
If you really do have to compress the data, you almost certainly won't be able to use an off the shelf database system; I'm guessing that Quick Address uses something like ZIP to compress the data.
As for the second question - can you give an example of your table, and the kind of query you want to optimize? In most post code lookup systems, the only query that matters is searching by postcode and returning the matching addresses; this should be really quick as long as you've indexed the post code column, no matter how many records you have.
You want to try a space-filling-curve or a spatial index. A sfc reduce 2d complexity to a 1d complexity. I've done something similar with a zipcode search. You want to look at my php implementation of a sfc at phpclasses.org (hilbert-curve). You want to look for Nick's hilbert curve quadtree spatial index blog.
For the cityname lookup you want to look for a trie datastructure. A trie is a dictionary datastructure. You want to look at my kart-trie implementation in php at phpclasses.org (kart-trie). It's worst-case complexity is IMO log(n+k) where n is the string length and k is the key length. You want to convert the kart-trie to a nested set because a kart-trie is different from a radix-trie or a crit-bit trie such that it has only 2 leafs per node. You want to look for php trie and wildcards http://phpir.com/tries-and-wildcards.
I second Tom Gurney's view... you are doing a lot of work for very little benefit. Plus you are taking on the responsibility of updating the data all the time - extra work.
I assume you are plugging postcode lookup into a website or an internal application?
There are a few hosted providers that do the work for you, plus your costs might be LOWER than going direct to Royal Mail, plus you hardly need to lift a finger once things are integrated....
I work for CraftyClicks, a PAF solutions provider, so have some vested interest here..... http://www.craftyclicks.co.uk/
Alternatively to the PAF from the Post Office, you can use 192.com website to lookup addresses.
I have been using this method successfully for the last few months and have not had any problems.
Here's my Lookup Class.
Imports System.Net
Imports System.IO
Public Class PCLookup
Property Addresses As List(Of Address)
Public Sub New(Postcode As String)
GetAddresses(CreatDoc(Postcode))
End Sub
Private Function CreatDoc(PostCode As String) As mshtml.HTMLDocument
Dim URL As String = FormatPostcode(PostCode)
If URL = "" Then Return New mshtml.HTMLDocument
Dim request As HttpWebRequest = WebRequest.Create(URL)
Dim response As HttpWebResponse = request.GetResponse()
Dim reader As StreamReader = New StreamReader(response.GetResponseStream())
Dim doc As New mshtml.HTMLDocument
Dim objDoc As mshtml.IHTMLDocument2 = doc
Dim param As Object() = {reader.ReadToEnd()}
objDoc.write(param)
response.Close()
reader.Close()
Return objDoc
End Function
Private Function FormatPostcode(Postcode As String) As String
Dim FullURL As String = "http://www.192.com/places/"
Do Until Postcode.Contains(" ") = False
Postcode = Replace(Postcode, " ", "")
Loop
If Len(Postcode) > 7 Or Len(Postcode) < 5 Then
Return ""
End If
If Len(Postcode) = 5 Then
FullURL &= Mid(Postcode, 1, 1) & "/"
FullURL &= Mid(Postcode, 1, 2) & "-" & Mid(Postcode, 3, 1) & "/"
FullURL &= Mid(Postcode, 1, 2) & "-" & Mid(Postcode, 3) & "/"
End If
If Len(Postcode) = 6 Then
If IsNumeric(Mid(Postcode, 2, 1)) Then
FullURL &= Mid(Postcode, 1, 1) & "/"
FullURL &= Mid(Postcode, 1, 3) & "-" & Mid(Postcode, 4, 1) & "/"
FullURL &= Mid(Postcode, 1, 3) & "-" & Mid(Postcode, 4) & "/"
Else
FullURL &= Mid(Postcode, 1, 2) & "/"
FullURL &= Mid(Postcode, 1, 3) & "-" & Mid(Postcode, 4, 1) & "/"
FullURL &= Mid(Postcode, 1, 3) & "-" & Mid(Postcode, 4) & "/"
End If
End If
If Len(Postcode) = 7 Then
FullURL &= Mid(Postcode, 1, 2) & "/"
FullURL &= Mid(Postcode, 1, 4) & "-" & Mid(Postcode, 5, 1) & "/"
FullURL &= Mid(Postcode, 1, 4) & "-" & Mid(Postcode, 5) & "/"
End If
Return FullURL
End Function
Private Sub GetAddresses(ObjDoc As mshtml.HTMLDocument)
Dim Obj As mshtml.IHTMLElementCollection = ObjDoc.getElementsByTagName("td")
Addresses = New List(Of Address)
For Each TD As mshtml.HTMLTableCell In Obj
If TD.className = "address" Then
Dim FullAddress As String = TD.innerText
Addresses.Add(New Address(FullAddress))
End If
Next
End Sub
End Class
And The Address Class
Public Class Address
Property Line1 As String
Property Line2 As String
Property Line3 As String
Property Line4 As String
Property Postcode As String
Public Sub New(FullAddress As String)
Dim Obj As Object = Split(FullAddress, ", ")
Select Case UBound(Obj)
Case 4
Line1 = Obj(0) & " " & Obj(1)
Line2 = ""
Line3 = Obj(2)
Line4 = Obj(3)
Postcode = Obj(4)
Case 5
Line1 = Obj(0) & " " & Obj(1)
Line2 = Obj(2)
Line3 = Obj(3)
Line4 = Obj(4)
Postcode = Obj(5)
Case 6
Line1 = Obj(0) & " " & Obj(1)
Line2 = Obj(2) & " " & Obj(3)
Line3 = Obj(4)
Line4 = Obj(5)
Postcode = Obj(6)
End Select
End Sub
End Class
I hope this is of use to someone else.
Rich.
Depends on your exact requirements.
frequency: you can get a one-off set of data files, annual or monthly, so depends how up to date your data needs to be. There is a free sample db with 2 towns addresses (York and somewhere else) to try out and start building with.
type: you can get a full set of data files each time, or deltas that you have to apply the changes yourself.
structure: as DGD said, you can get keyed or expanded.
If you need new addresses, using the deltas, considering RM make many thousands of changes each month, not just additions, but removals, combining addresses, and converting (business <-> residential), would be a huge amount of work to apply yourself. Especially in relation to maintaining unique address keys that could also be in use elsewhere in your own applications db.
Based on monthly full data files, expanded, which includes the 'NotYetBuilt' file of newly registered addresses, I built a system to reload the whole set each month, split into 2 parts: 1.) downloaded the last set of data, expand the files, etc, onto disk and 2). load the new data into a db
For part 2 as you load the data you could construct a full address string for each entry as you go (to return for search matches). As there are over 31 million addresses you can't use LIKE or regular search syntax. Build a FULLTEXT Index across the fields you want to use for searching and a stored proc to use CONTAINSTABLE for FREETEXT to search it.
Was easier to build then I expected, the difficulty is with things like: dealing with multiple files each with up to 31+ million records in, current and new addresses being in different files, no 'County' values (they were officially dropped from UK addresses in 2000) they are in a different file to load if needed, etc.
精彩评论