Monday, 9 June 2008

Importing the NLPG

The NLPG is the National Land and Property Gazetteer; a database of over 30 million residential and business properties in England and Wales. This dataset is available to local authorities for free under a licensing agreement as are updates to the data on a daily, weekly or monthly basis.

The first thing that is apparent when you view the dataset in Excel (Intelligent Addressing, the company that administers the data, provide a small sample in .csv format on their website http://www.nlpg.org.uk/) is that it does not adhere to a common schema - there are a mixture of different rows all with differing numbers of elements.

Here are some example rows:
10,"Intelligent Addressing Limited",520,2008-05-01,1,2008-05-01,101427,7.3,"F"
11,"I",1,17800298,1,520,2,2002-07-25,1,8,0,2002-07-25,2003-12-04,2002-07 25,,522749,272912,522601,272879,10
15,"I",2,17800298,"BASSENTHWAITE","","HUNTINGDON","CAMBRIDGESHIRE","ENG"
11,"I",3,17800366,1,520,2,2002-07-25,1,8,0,2002-07-25,2003-10-02,2002-07-25,,522686,272799,522590,273061,10
15,"I",4,17800366,"PROVENCE ROAD","","HUNTINGDON","CAMBRIDGESHIRE","ENG"

The data adheres to a standard called Data Transfer Format 7.3. This contains a range of different records - type 11 is a Street Record, type 24 is a Land and Property Identifer (essentially a house on a plot of land). This 'jagged' data format makes it extremely difficult to define a 'catch all' XML schema. It's clear then that the file needs some pre-processing before we can even think about loading the data into our hub or passing messages from it into Biztalk.

Here's a short script that will read the .csv file, debatch each row, then debatch each of these individual rows into the component elements and finally generate a dynamic SQL statement to get the data into the right table in a database. It's written in VB.net; check the comments on the top level method to find out what you need to have in place to make it work.



Public Function NLPGDataLoad() As Boolean
'this is the main function in the NLPG debatching process
'the process assumes that you have four tables set up in a SQL Server database
'(define the connection in the web.config file)
' 1) NLPGStreetRecord
' 2) NLPGStreetDescriptor
' 3) NLPGBasicLandAndPropertyUnit
' 4) NLPGLandAndPropertyIdentifier
'It also assumes that you have the NLPG .csv file somewhere on the server
Dim objReader As StreamReader 'streamreader object to hold the contents of the text file
'file path of .csv file
Dim sNLPGFilePath As String = "c:\\documents and settings\\perryma.kmbc\\desktop\\nlpg.csv"
'string to variable to pass the file into
Dim sNLPGFile As String
'string array to hold each row of the NLPG file
Dim sNLPGDebatched() As String
Dim bProcessNLPGRecords As Boolean
Try
'load the full NLPG .csv file into the StreamReader object
objReader = New StreamReader(sNLPGFilePath)
sNLPGFile = objReader.ReadToEnd() 'read csv file into a string variable
objReader.Dispose() 'kill the StremReader
'two more function calls now
'pass the full string into the
'DebatchNLPGFile function and
'get back a string array with all the seperate rows
sNLPGDebatched = DebatchNLPGFile(sNLPGFile)
'then pass the string array into the ProcessNLPGDebatched function and
'put each row of the array into the relevant database table
bProcessNLPGRecords = ProcessNLPGRecords(sNLPGDebatched)
Catch ex As Exception
End Try
Return True
End Function


'this function takes the full nlpg file in a string variable and debatches it
'into seperate array elements - one per row
Function DebatchNLPGFile(ByVal sNLPGFile As String) As String()
'use a counter to run through each character in the NLPG string
Dim iCounter As Integer
'use a marker to record where each cr/lf is in the string
Dim iMarker As Integer = 0
Dim iArrayElements As Integer = 0
'set up a one element (0) array
Dim sNLPGDebatched(iArrayElements) As String
Try
'run through the nlpg string one character at a time
For iCounter = 0 To sNLPGFile.Length - 2
'when the loop hits a cr/lf in the string
If (sNLPGFile.Substring(iCounter, 2) = vbCrLf) Then
'increase the size of the array by one and preserve its existing content
iArrayElements = iArrayElements + 1
ReDim Preserve sNLPGDebatched(iArrayElements)
'add the content of the string before the cr/lf to the array
sNLPGDebatched(iArrayElements) = sNLPGFile.Substring(iMarker, iCounter - iMarker)
'set the marker to the current counter and add 2 to the value for cr/lf
'stops the search from including all previous characters in string
iMarker = iCounter + 2
End If
Next
Catch ex As Exception
End Try
'return the array of debatched rows
Return sNLPGDebatched
End Function


'takes a string array of debatched rows and puts each row into the relevant
'database table
Function ProcessNLPGRecords(ByVal sNLPGDebatched() As String) As Boolean
Dim iCounter As Integer
Dim sMessage As String
Dim bInsertMessageIntoDatabase As Boolean
Try
'work through each element in the array one at a time...
'check to see what the 'header' of each row is
For iCounter = 1 To sNLPGDebatched.Length - 1
'********************************************************
'we're interested in four of the row types in the NLPG
'11, 15, 21 and 24
'every time one of these is found pass the row and its header type
'into the InsertMessageIntoDatabase function
'************** MESSAGE 10 (HEADER) *********************
If (sNLPGDebatched(iCounter).Substring(0, 2) = "10") Then
'************** MESSAGE 11 (STREET RECORD) ***********
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "11") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 11)
'************** MESSAGE 15 (STREET DESCRIPTOR) *******
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "15") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 15)
'************** MESSAGE 21 (BLPU) ********************
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "21") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 21)
'************** MESSAGE 24 (LPI) *********************
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "24") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 24)
End If
Next
Catch ex As Exception
End Try
Return True
End Function


'take the message / row content and its type and insert into relevant
'database table
Function InsertMessageIntoDatabase(ByVal sMessage As String, ByVal iMessageType As Integer) As Boolean
Dim iCounter As Integer
Dim iMarker As Integer = 0
Dim iArrayElements As Integer = 0
Dim sMessageDebatched(iArrayElements) As String
sMessageDebatched(iArrayElements) = "Empty"
Dim sQueryHeader As String
Dim sQueryValues As String = ""
Dim connection As SqlConnection
Dim command As SqlCommand
Try
'********************************************************************
'each individual message needs debatching too because it's divided into a number
'of comma seperated fields
'so this short algorithm divides each row into up into a number of array elements
'it then builds a dynamic sql string and processes this against the database
'********************************************************************
'work through each character in the message
For iCounter = 0 To sMessage.Length - 1
'when the loop hits a comma
If (sMessage.Substring(iCounter, 1) = ",") Then
'increase the size of the message array by 1 and preserve existing content
iArrayElements = iArrayElements + 1
ReDim Preserve sMessageDebatched(iArrayElements)
'add the text before the comma into the array element
sMessageDebatched(iArrayElements) = sMessage.Substring(iMarker, iCounter - iMarker)
'reset the marker
iMarker = iCounter + 1
End If
Next
'this is a little flaw in my rusty undergraduate programming skills
'if i set the loop above to work right to the end of the string it bombs out
'set it any shorter and it misses out the last element of the message
'so I have to start at the end of the message and work backwards to get the last
'character
'so set the marker to look at the end of the message
iMarker = sMessage.Length - 1
'then iterate backwards through the message
For iCounter = sMessage.Length - 1 To 0 Step -1
'hit a comma, resize array, add element and then exit loop - this only
'needs doing once
If (sMessage.Substring(iCounter, 1) = ",") Then
iArrayElements = iArrayElements + 1
ReDim Preserve sMessageDebatched(iArrayElements)
sMessageDebatched(iArrayElements) = sMessage.Substring(iCounter + 1, iMarker - iCounter)
iCounter = 0
End If
Next
'****************************************
'**** CLEAN UP CONTENTS OF ARRAY ********
'we're into the dynamic sql statement building phase now
'i.e. replace any blank elements with the NULL string (better for database)
'also identify any elements which are obviously datetime (NLPG has the yyyy-mm-dd
'format as standard which is great) and set them to be cast to datetime in the
'database otherwise they just end up as 1905-01-01
For iCounter = 1 To sMessageDebatched.Length - 1
If (sMessageDebatched(iCounter).Length < length =" 10)" squeryvalues =" sQueryValues" imessagetype =" 11)" squeryheader = "INSERT INTO NLPGStreetRecord VALUES (" imessagetype =" 15)" squeryheader = "INSERT INTO NLPGStreetDescriptor VALUES (" imessagetype =" 21)" squeryheader = "INSERT INTO NLPGBasicLandAndPropertyUnit VALUES (" imessagetype =" 24)" squeryheader = "INSERT INTO NLPGLandAndPropertyIdentifier VALUES (" squeryheader =" sQueryHeader" squeryheader =" sQueryHeader.Substring(0," squeryheader =" sQueryHeader.Insert(sQueryHeader.Length," squeryheader =" sQueryHeader.Replace(" connection =" New" command =" New" icounter =" command.ExecuteNonQuery()">

5 comments:

Samppa said...

Hi,Could you please post the actual link to the sample file. I can't find it from the nlpg.org.uk pages.

Thanks for the explanation

Nyssa and Mike said...

I've been importing NLPG Data into a SQL database.
My first strip was to put ALL .csv files into a separate Directory
Then Using Classic asp on NLPG 7.3 data files

Read the files in one at a time splitting into 3 new .csv files File_15,File_21 and File24.

I then read the files in one at a time using DTS and thus creating sql tables for each.


DTS is far quicker than SQL reading down a .csv file record for importing purposes.
I am now looking at SSIS to do the work for me, perhaps using a direct read per NLPG .csv file.

The biggest problem with the sql is getting the server timeout to be really big otherwise the batch keeps aborting

Mike Backhouse

just Google mikbak

Mark Perry said...

The link to the NLPG sample data is here: http://www.iahub.net/docs/1209656492446.csv

Mark Perry said...

Nyssa and Mike - after doing a load more work in this area with different data sets I agree that DTS / SSIS would be a much faster approach to solving this problem. We haven't done anything with the NLPG data and I think that if there was a requirement to import it I would go down the SSIS route.

Thanks for your comments - these are the first comments I have had on my blog and it's all very exciting.

Nyssa and Mike said...

Still importing the data and have imported into 3 sql tables 15,21 and 24
Have a routine written now which has cross table query and puts each postcode ref into a %%%%_postcode table which I then search using Ajax to display the records ...

Job done !!
Mike