U2U Blog

for developers and other creative minds

The whole world in a SQL Server database

Some people have the whole world in their hand, but last weekend I tried something slightly less ambitious: Getting (part of) the whole world in a SQL Server database.

Loads of spatial data from OpenStreetMap

I’m always looking for interesting datasets. I was still lacking a large dataset with spatial data. The biggest public spatial dataset I’m aware of is OpenStreetMap (OSM). It’s a map service comparable with Bing or Google Maps. But the data it uses is collected from public data: Some governments provide public location information, but there are also thousands of volunteers who help OSM in putting their street or village ‘on the map’.

The resulting database is public as well. It is described in XML format, and I found on the wiki also an OSM XSD schema. This allows me to load the data with SQL Server Integration Services, using the data flow XML data source. There might be faster solutions than this, but for the time being, SSIS does a rather good job with a very simple setup.

Creating the database

So, let’s create the database in which we want to store our data. Be sure to create a database with an initial size big enough to host the data you want to load. The OSM xml file with all the data from the whole planet is about 250 Gb at the time of writing, so It will probably take about 150 Gb in SQL Server (XML is not the most compact storage format). In this blog post example I will be loading a subset of the whole planet, so I create a much smaller database (just under 2 Gb):

   1: CREATE DATABASE [OSM]
   2:  ON  PRIMARY 
   3: ( NAME = N'OSM', FILENAME = N'F:\Data\OSM.mdf' , SIZE = 2048000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 512000KB )
   4:  LOG ON 
   5: ( NAME = N'OSM_log', FILENAME = N'F:\Data\OSM_log.ldf' , SIZE = 3170880KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)

If you really plan to create a much larger database, don’t forget to enable instant initialization to speed up the process of creating and extending the data file(s).

Next I create all the tables to store the different pieces of information in the OSM XML files. Notice that I do not keep all the information in the XML files. I do not store the person who entered the data, and the date the data was entered. This reduces the amount of data to be stored, which speeds up the data loading. This is the code to create the necessary tables:

   1: USE [OSM]
   2: GO
   3: CREATE TABLE [dbo].[Nodes](
   4:     [node_Id] [bigint] NOT NULL,
   5:     [id] [bigint] NULL,
   6:     [lat] [float] NULL,
   7:     [lon] [float] NULL,
   8:     [geog] [geography] NULL,
   9:  CONSTRAINT [PK_Nodes] PRIMARY KEY CLUSTERED 
  10: (
  11:     [node_Id] ASC
  12: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  13: ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
  14:  
  15: CREATE TABLE [dbo].[NodeTag](
  16:     [key] [nvarchar](250) NULL,
  17:     [value] [nvarchar](3000) NULL,
  18:     [node_Id] [bigint] NULL
  19: ) ON [PRIMARY];
  20: GO
  21: CREATE TABLE [dbo].[Relation](
  22:     [relation_Id] [bigint] NOT NULL,
  23:     [id] [bigint] NULL,
  24:     [visible] [bit] NULL,
  25:  CONSTRAINT [PK_Relation] PRIMARY KEY CLUSTERED 
  26: (
  27:     [relation_Id] ASC
  28: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  29: ) ON [PRIMARY]
  30: GO
  31: CREATE TABLE [dbo].[RelationMember](
  32:     [type] [nvarchar](8) NULL,
  33:     [ref] [bigint] NULL,
  34:     [role] [nvarchar](250) NULL,
  35:     [relation_Id] [bigint] NULL,
  36:     [RelationPosition] [int] IDENTITY(1,1) NOT NULL
  37: ) ON [PRIMARY]
  38:  
  39: GO
  40: CREATE TABLE [dbo].[RelationTag](
  41:     [key] [nvarchar](250) NULL,
  42:     [value] [nvarchar](3000) NULL,
  43:     [relation_Id] [bigint] NULL
  44: ) ON [PRIMARY]
  45:  
  46: GO
  47: CREATE TABLE [dbo].[Way](
  48:     [way_Id] [bigint] NOT NULL,
  49:     [id] [bigint] NULL,
  50:     [visible] [bit] NULL,
  51:  CONSTRAINT [PK_Way] PRIMARY KEY CLUSTERED 
  52: (
  53:     [way_Id] ASC
  54: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  55: ) ON [PRIMARY]
  56:  
  57: GO
  58: CREATE TABLE [dbo].[WayNode](
  59:     [ref] [bigint] NULL,
  60:     [way_Id] [bigint] NULL,
  61:     [WaiNodePosition] [bigint] IDENTITY(1,1) NOT NULL
  62: ) ON [PRIMARY]
  63: GO
  64: CREATE TABLE [dbo].[WayTag](
  65:     [key] [nvarchar](250) NULL,
  66:     [value] [nvarchar](3000) NULL,
  67:     [way_Id] [bigint] NULL
  68: ) ON [PRIMARY]
  69:  
  70: GO

After the destination tables are created, it’s time to download an OSM xml file. I downloaded the file for Belgium from one of the mirror servers. After unzipping the file, it is about 2.4 Gb. All the remaining timings in this article are based on this file, but of course, feel free to experiment with data from other countries as well.

Loading the data with SSIS

Now it’s time to create an Integration Services package which uses an XML source to read the OSM xml file, parse it with the XSD file I mentioned earlier, and write the results to the corresponding tables. I used SQL Server destinations, but if you do be careful to set the destination timeout high enough, or your package will fail on this. You can reuse the SSIS package I made on SQL Server 2012, but be sure to check and correct the pointers to the appropriate files.

Fingers crossed, run the SSIS package and wait till the data is loaded. On my machine it took about 17 minutes. It loaded about 12 million coordinates (Nodes), and more than 1.5 million road segments and other combined objects (Way).

The only table that really contains coordinates is the Nodes table. In order to use SQL Server’s spatial functions on these, we must first create geographical objects out of the latitude and longitude info. This can simply be done with this statement:

   1: UPDATE Nodes SET geog = geography::Point(lat, lon, 4326)

This took about 5 minutes on my machine. From now on, we can start querying this information is we want, but my Belgian OSM file contains about 12 million coordinates… without a decent index, every query is going to take forever, so let’s build a spatial index on this. In the next code example I use the new type of spatial index they introduced in SQL Server 2012, but feel free to try other types of spatial indexes as well. Spatial Ed has a short ‘getting started’ paper if you want to play with spatial indexes.

   1: CREATE SPATIAL INDEX [SpatialIndex-Nodes] ON [dbo].[Nodes]
   2: (
   3:     [geog]
   4: )USING  GEOGRAPHY_AUTO_GRID 
   5: WITH (
   6: CELLS_PER_OBJECT = 16) ON [PRIMARY]
   7: GO

After this index is created, we can get started querying this interesting data source. More on that (and how to create more advanced spatial objects to represent roads, areas etc.) in a later blog post.

Happy querying!
Nico

Validation sets in SQL Server Data Mining

What are validation sets?

Data Mining

Data mining analyses historical data to find patterns that might help us better understand how our business works, or might help predict how the business might evolve in the future: Instead of doing ‘traditional BI’, where we pick some attributes and ask for aggregated data (“show me the sum of sales amount by country per fiscal quarter”), in data mining we ask questions such as “what is typical for customers who buy bikes”, and we get answers (models, as we call them) that contain patterns such as “if the age of the customer is less than 29 and they live in the Netherlands they are more likely bike buyers”. This however results in a problem: how do we know if a model is ‘better’ than another model? Is the model “Young people are more frequent bike buyers” better than “People who do not own a car are bike buyers”?

Test set

The typical approach to test the quality of models is by testing how well they behave when we use them to predict the outcome (e.g. whether a customer buys a bike or not) on the historical data, for which we then already know the outcome. Models for which the predicted outcome more frequently corresponds with the actual outcome are better models. However, we need to be careful: if we would use as a test data set the same set of data we use to create the models, we run the risk of overfitting. Overfitting means the model is so tuned on the training set, that the patterns are not general enough to be useful on new data. E.g. the model “If the customer name is Ben Carlson, Margareta Wuyts, … or Jeremy Frank then it is a bike buyer” might make perfect predictions in your historical data, but it is clear that it will be of little help in making predictions on new customers: it is heavily overfitted. This is why we split the historical data in two sets: training data, on which the system search for patterns, and test data, which we use to test the quality of the model. This is even build-in in the SQL Server Analysis Services wizard to construct mining models: It by default proposes to keep 30% of the data separate for testing.

image

Validation set

But… also test data sets raise an issue: We often need to test a lot of different mining models with different parameter setting to find a near-optimal result. This is an iterative process, in which we create a few models, test them on the test set, see which data mining techniques and parameters work best, use that knowledge to setup a second iteration of models to be tested etc. But in this way, the data mining developer is introducing knowledge from the test set in the development process: Imagine that in our test set age is a strong indicator, than we will favor models which use this. The overall result is that the estimated quality of the predictions which are made on the test set are no longer a good estimate of the expected quality of the predictions on new data. They are already slightly biased towards our test set, and typically overestimate the predictive quality of our model.

This is where validation sets come in: Before we got started with any data mining in the first place, we should have set some of our historical data (e.g. 20% of the data) apart in a validation set. The remaining 80% is then split apart in training and test data. Once we’re finished with our data mining, we test our model one last time, on data it has never seen, not as training data, not as test data. Our validation set is (from the data mining point of view) truly new data, and might give the best impression of the expected predictive quality of our mining model.

How do we create validation sets?

In contrast to test data sets, the mining wizard does not allow us to set apart a validation set. So we need to do this in the data preparation phase (see CRISP-DM methodology for more info on the different phases in the data mining process). If you prefer to prepare your data with T-SQL statements, you can use this approach based on NEWID() to randomly select a certain set of data, but be careful: if you rerun the statement, a different subset will be selected.

Another approach is to use SSIS (Integration Services), which has a percentage sampling transformation which is ideal for this job: it assigns each row an n% likelihood of being selected, so because of that it doesn’t need to cache all rows in memory (in contrast to the row sampling transformation). An advantage over the NEWID() approach is that we can set the seed for the random data generator, such that results are reproducible if we want.

image

How do we use validation sets?

Using validation sets is easy. Just make sure that the table you created with the validation data is in the same data source as the data source you used for the SSAS project. Then in the Mining Accuracy Chart tab of the Mining model in SSAS, you select just the best performing model(s) and below you choose the radio button to use a different data set.
ValidationSet

Click the ellipsis button (…) and select the table or view which contains the validation set. Join the proper columns from the validation set with the mining model, and you’re set! Now you can create lift or profit charts and build a classification matrix against the validation set.

Happy mining!
Nico