Friday, 27 June 2008

Importing the NLPG (2)

Phew! This has been a bit of a nightmare. The individual .csv files are too large in size for my laptop or the development server that I use (each has 2GB of RAM). What I have ended up doing is the following:
  1. Written a VB script to split each master .csv file into a number of seperate ones, each containing 5000 rows (plus the remainder).
  2. Used a second VB script to 'pre-process' the file ready for import into SQL Server. This has involved working through each file created in 1 line by line and adding the correct number of commas to fit the number of columns in the table schema I am importing into.
  3. When I end up with thousands of pre-processed .csv files I will need to import them into SQL Server. My web trawling on how to do this with a T-SQL script has drawn a blank but it looks like I will be able to create an SSIS package to achieve the import. This is something I have never done before so it's yet more learning to go through.

I have tested 1 and 2 out over the last few days. 1 is pretty slow to run (it gets through a file in about two hours and there are 70 + files). I might have to look at splitting the job between several servers. 2 is very fast - I did 177 files in about 10 minutes on my laptop today.

Friday, 20 June 2008

Structure of the initial NLPG dataset

Today I got my first look at the initial NLPG dataset. I accessed the data on the NLPG / Intelligent Addressing FTP server. I was expecting to see one big .csv file but no, there are actually 19 seperate .zip files on the server representing large geographic areas - i.e. Greater London, North East, Wales etc. If you open one of the .zip files up the .csv's are contained inside - except each large geographic area is subdivided - so East Midlands has a different .csv for Leicestershire, Nottinghamshire, Derbyshire etc. Oh, and each large geographic region has a unitary and non - unitary .zip file but it looks like there is only one unitary .csv file in each .zip file.

Phew! Wikipedia (http://en.wikipedia.org/wiki/Administrative_counties_of_England) reckons there are 49 administrative counties in England and Wales so I can expect to see 49 .csv files (plus another 9 unitary .zip files) making 58 in total.

I will need to debatch each of these seperately into SQL Server and then run a join script to create (hopefully) a unique record for each distinct address in the country. I can then load these as entities into the data hub.

Thursday, 12 June 2008

Processing the NLPG (part 1)

Here's the plan for processing the NLPG update file every night:
  1. File is grabbed from NLPG FTP server and put onto a location on a server on our LAN. This might be something we can do with Biztalk or we might have to find another way.
  2. The file will need debatching and its data inserting into a series of seperate tables. Biztalk can automate this process - it can pick up the file, put its contents into a string object, pass this object to the web service that I have written for debatching and get a success/fail message back to report on.
  3. Now I have four tables containing the information that I need. I can run my filtering script against these to produce a collection of unique UPRN's that have changed since the last update (or are new inserts) and I need to pass these into the data hub.
  4. The changes / updates will be fed into an 'NLPGChanges' table and picked up by the Biztalk SQL Adapter. This will require a custom schema to be written for the NLPG data as defined by my database script.
  5. Biztalk will process each individual update / insert, convert it into hub - compliant XML and pass into the hub via the Biztalk adapter.

That's the plan so far.

What data do I want to take from the NLPG?

There is a wealth of property information in the NLPG and to take everything would be overkill. My aim is to build a series of single row address records for each distinct UPRN in the database - no duplicates allowed! This data can be brought together from four of the record types in the NLPG:

  • Street Record (11)
  • Street Descriptor (15)
  • Basic Land and Property Unit (21)
  • Land and Property Identifier (24)

by writing a database query that joins the tables together. Referential integrity is pretty easy - the join is either made on UPRN or USRN.

The quest for no duplicates is hampered, however, by the fact that each LPI and BLPU has a status. These range from 1 - 9 and indicate that the record is approved, awaiting approval or historic. Obviously I don't want to include all the historic data - I just want the most up to date information about an addres. Fortunately each record also has a last updated date - so I can take the most recently updated record - and a processing order - so I can take the highest value of this for each UPRN too. It's great working with a dataset that has been well designed!

By taking this approach you seriously reduce the amount of data that you have to store and searching is therefore quicker when you're trying to use this data at the enterprise level. The test extract file from Intelligent Addressing contained 1506 rows of LPI's; by using a filtering script only 347 distinct current addresses are created - 77% of the information is discarded because it's not required.

Monday, 9 June 2008

Importing the NLPG

The NLPG is the National Land and Property Gazetteer; a database of over 30 million residential and business properties in England and Wales. This dataset is available to local authorities for free under a licensing agreement as are updates to the data on a daily, weekly or monthly basis.

The first thing that is apparent when you view the dataset in Excel (Intelligent Addressing, the company that administers the data, provide a small sample in .csv format on their website http://www.nlpg.org.uk/) is that it does not adhere to a common schema - there are a mixture of different rows all with differing numbers of elements.

Here are some example rows:
10,"Intelligent Addressing Limited",520,2008-05-01,1,2008-05-01,101427,7.3,"F"
11,"I",1,17800298,1,520,2,2002-07-25,1,8,0,2002-07-25,2003-12-04,2002-07 25,,522749,272912,522601,272879,10
15,"I",2,17800298,"BASSENTHWAITE","","HUNTINGDON","CAMBRIDGESHIRE","ENG"
11,"I",3,17800366,1,520,2,2002-07-25,1,8,0,2002-07-25,2003-10-02,2002-07-25,,522686,272799,522590,273061,10
15,"I",4,17800366,"PROVENCE ROAD","","HUNTINGDON","CAMBRIDGESHIRE","ENG"

The data adheres to a standard called Data Transfer Format 7.3. This contains a range of different records - type 11 is a Street Record, type 24 is a Land and Property Identifer (essentially a house on a plot of land). This 'jagged' data format makes it extremely difficult to define a 'catch all' XML schema. It's clear then that the file needs some pre-processing before we can even think about loading the data into our hub or passing messages from it into Biztalk.

Here's a short script that will read the .csv file, debatch each row, then debatch each of these individual rows into the component elements and finally generate a dynamic SQL statement to get the data into the right table in a database. It's written in VB.net; check the comments on the top level method to find out what you need to have in place to make it work.



Public Function NLPGDataLoad() As Boolean
'this is the main function in the NLPG debatching process
'the process assumes that you have four tables set up in a SQL Server database
'(define the connection in the web.config file)
' 1) NLPGStreetRecord
' 2) NLPGStreetDescriptor
' 3) NLPGBasicLandAndPropertyUnit
' 4) NLPGLandAndPropertyIdentifier
'It also assumes that you have the NLPG .csv file somewhere on the server
Dim objReader As StreamReader 'streamreader object to hold the contents of the text file
'file path of .csv file
Dim sNLPGFilePath As String = "c:\\documents and settings\\perryma.kmbc\\desktop\\nlpg.csv"
'string to variable to pass the file into
Dim sNLPGFile As String
'string array to hold each row of the NLPG file
Dim sNLPGDebatched() As String
Dim bProcessNLPGRecords As Boolean
Try
'load the full NLPG .csv file into the StreamReader object
objReader = New StreamReader(sNLPGFilePath)
sNLPGFile = objReader.ReadToEnd() 'read csv file into a string variable
objReader.Dispose() 'kill the StremReader
'two more function calls now
'pass the full string into the
'DebatchNLPGFile function and
'get back a string array with all the seperate rows
sNLPGDebatched = DebatchNLPGFile(sNLPGFile)
'then pass the string array into the ProcessNLPGDebatched function and
'put each row of the array into the relevant database table
bProcessNLPGRecords = ProcessNLPGRecords(sNLPGDebatched)
Catch ex As Exception
End Try
Return True
End Function


'this function takes the full nlpg file in a string variable and debatches it
'into seperate array elements - one per row
Function DebatchNLPGFile(ByVal sNLPGFile As String) As String()
'use a counter to run through each character in the NLPG string
Dim iCounter As Integer
'use a marker to record where each cr/lf is in the string
Dim iMarker As Integer = 0
Dim iArrayElements As Integer = 0
'set up a one element (0) array
Dim sNLPGDebatched(iArrayElements) As String
Try
'run through the nlpg string one character at a time
For iCounter = 0 To sNLPGFile.Length - 2
'when the loop hits a cr/lf in the string
If (sNLPGFile.Substring(iCounter, 2) = vbCrLf) Then
'increase the size of the array by one and preserve its existing content
iArrayElements = iArrayElements + 1
ReDim Preserve sNLPGDebatched(iArrayElements)
'add the content of the string before the cr/lf to the array
sNLPGDebatched(iArrayElements) = sNLPGFile.Substring(iMarker, iCounter - iMarker)
'set the marker to the current counter and add 2 to the value for cr/lf
'stops the search from including all previous characters in string
iMarker = iCounter + 2
End If
Next
Catch ex As Exception
End Try
'return the array of debatched rows
Return sNLPGDebatched
End Function


'takes a string array of debatched rows and puts each row into the relevant
'database table
Function ProcessNLPGRecords(ByVal sNLPGDebatched() As String) As Boolean
Dim iCounter As Integer
Dim sMessage As String
Dim bInsertMessageIntoDatabase As Boolean
Try
'work through each element in the array one at a time...
'check to see what the 'header' of each row is
For iCounter = 1 To sNLPGDebatched.Length - 1
'********************************************************
'we're interested in four of the row types in the NLPG
'11, 15, 21 and 24
'every time one of these is found pass the row and its header type
'into the InsertMessageIntoDatabase function
'************** MESSAGE 10 (HEADER) *********************
If (sNLPGDebatched(iCounter).Substring(0, 2) = "10") Then
'************** MESSAGE 11 (STREET RECORD) ***********
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "11") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 11)
'************** MESSAGE 15 (STREET DESCRIPTOR) *******
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "15") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 15)
'************** MESSAGE 21 (BLPU) ********************
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "21") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 21)
'************** MESSAGE 24 (LPI) *********************
ElseIf (sNLPGDebatched(iCounter).Substring(0, 2) = "24") Then
sMessage = sNLPGDebatched(iCounter)
bInsertMessageIntoDatabase = InsertMessageIntoDatabase(sMessage, 24)
End If
Next
Catch ex As Exception
End Try
Return True
End Function


'take the message / row content and its type and insert into relevant
'database table
Function InsertMessageIntoDatabase(ByVal sMessage As String, ByVal iMessageType As Integer) As Boolean
Dim iCounter As Integer
Dim iMarker As Integer = 0
Dim iArrayElements As Integer = 0
Dim sMessageDebatched(iArrayElements) As String
sMessageDebatched(iArrayElements) = "Empty"
Dim sQueryHeader As String
Dim sQueryValues As String = ""
Dim connection As SqlConnection
Dim command As SqlCommand
Try
'********************************************************************
'each individual message needs debatching too because it's divided into a number
'of comma seperated fields
'so this short algorithm divides each row into up into a number of array elements
'it then builds a dynamic sql string and processes this against the database
'********************************************************************
'work through each character in the message
For iCounter = 0 To sMessage.Length - 1
'when the loop hits a comma
If (sMessage.Substring(iCounter, 1) = ",") Then
'increase the size of the message array by 1 and preserve existing content
iArrayElements = iArrayElements + 1
ReDim Preserve sMessageDebatched(iArrayElements)
'add the text before the comma into the array element
sMessageDebatched(iArrayElements) = sMessage.Substring(iMarker, iCounter - iMarker)
'reset the marker
iMarker = iCounter + 1
End If
Next
'this is a little flaw in my rusty undergraduate programming skills
'if i set the loop above to work right to the end of the string it bombs out
'set it any shorter and it misses out the last element of the message
'so I have to start at the end of the message and work backwards to get the last
'character
'so set the marker to look at the end of the message
iMarker = sMessage.Length - 1
'then iterate backwards through the message
For iCounter = sMessage.Length - 1 To 0 Step -1
'hit a comma, resize array, add element and then exit loop - this only
'needs doing once
If (sMessage.Substring(iCounter, 1) = ",") Then
iArrayElements = iArrayElements + 1
ReDim Preserve sMessageDebatched(iArrayElements)
sMessageDebatched(iArrayElements) = sMessage.Substring(iCounter + 1, iMarker - iCounter)
iCounter = 0
End If
Next
'****************************************
'**** CLEAN UP CONTENTS OF ARRAY ********
'we're into the dynamic sql statement building phase now
'i.e. replace any blank elements with the NULL string (better for database)
'also identify any elements which are obviously datetime (NLPG has the yyyy-mm-dd
'format as standard which is great) and set them to be cast to datetime in the
'database otherwise they just end up as 1905-01-01
For iCounter = 1 To sMessageDebatched.Length - 1
If (sMessageDebatched(iCounter).Length < length =" 10)" squeryvalues =" sQueryValues" imessagetype =" 11)" squeryheader = "INSERT INTO NLPGStreetRecord VALUES (" imessagetype =" 15)" squeryheader = "INSERT INTO NLPGStreetDescriptor VALUES (" imessagetype =" 21)" squeryheader = "INSERT INTO NLPGBasicLandAndPropertyUnit VALUES (" imessagetype =" 24)" squeryheader = "INSERT INTO NLPGLandAndPropertyIdentifier VALUES (" squeryheader =" sQueryHeader" squeryheader =" sQueryHeader.Substring(0," squeryheader =" sQueryHeader.Insert(sQueryHeader.Length," squeryheader =" sQueryHeader.Replace(" connection =" New" command =" New" icounter =" command.ExecuteNonQuery()">