The whole world in a SQL Server database

Some people have the whole world in their hand, but last weekend I tried something slightly less ambitious: Getting (part of) the whole world in a SQL Server database.

Loads of spatial data from OpenStreetMap

I’m always looking for interesting datasets. I was still lacking a large dataset with spatial data. The biggest public spatial dataset I’m aware of is OpenStreetMap (OSM). It’s a map service comparable with Bing or Google Maps. But the data it uses is collected from public data: Some governments provide public location information, but there are also thousands of volunteers who help OSM in putting their street or village ‘on the map’.

The resulting database is public as well. It is described in XML format, and I found on the wiki also an OSM XSD schema. This allows me to load the data with SQL Server Integration Services, using the data flow XML data source. There might be faster solutions than this, but for the time being, SSIS does a rather good job with a very simple setup.

Creating the database

So, let’s create the database in which we want to store our data. Be sure to create a database with an initial size big enough to host the data you want to load. The OSM xml file with all the data from the whole planet is about 250 Gb at the time of writing, so It will probably take about 150 Gb in SQL Server (XML is not the most compact storage format). In this blog post example I will be loading a subset of the whole planet, so I create a much smaller database (just under 2 Gb):

   1: CREATE DATABASE [OSM]
   2:  ON  PRIMARY 
   3: ( NAME = N'OSM', FILENAME = N'F:\Data\OSM.mdf' , SIZE = 2048000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 512000KB )
   4:  LOG ON 
   5: ( NAME = N'OSM_log', FILENAME = N'F:\Data\OSM_log.ldf' , SIZE = 3170880KB , MAXSIZE = 2048GB , FILEGROWTH = 10%)

If you really plan to create a much larger database, don’t forget to enable instant initialization to speed up the process of creating and extending the data file(s).

Next I create all the tables to store the different pieces of information in the OSM XML files. Notice that I do not keep all the information in the XML files. I do not store the person who entered the data, and the date the data was entered. This reduces the amount of data to be stored, which speeds up the data loading. This is the code to create the necessary tables:

   1: USE [OSM]
   2: GO
   3: CREATE TABLE [dbo].[Nodes](
   4:     [node_Id] [bigint] NOT NULL,
   5:     [id] [bigint] NULL,
   6:     [lat] [float] NULL,
   7:     [lon] [float] NULL,
   8:     [geog] [geography] NULL,
   9:  CONSTRAINT [PK_Nodes] PRIMARY KEY CLUSTERED 
  10: (
  11:     [node_Id] ASC
  12: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  13: ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
  14:  
  15: CREATE TABLE [dbo].[NodeTag](
  16:     [key] [nvarchar](250) NULL,
  17:     [value] [nvarchar](3000) NULL,
  18:     [node_Id] [bigint] NULL
  19: ) ON [PRIMARY];
  20: GO
  21: CREATE TABLE [dbo].[Relation](
  22:     [relation_Id] [bigint] NOT NULL,
  23:     [id] [bigint] NULL,
  24:     [visible] [bit] NULL,
  25:  CONSTRAINT [PK_Relation] PRIMARY KEY CLUSTERED 
  26: (
  27:     [relation_Id] ASC
  28: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  29: ) ON [PRIMARY]
  30: GO
  31: CREATE TABLE [dbo].[RelationMember](
  32:     [type] [nvarchar](8) NULL,
  33:     [ref] [bigint] NULL,
  34:     [role] [nvarchar](250) NULL,
  35:     [relation_Id] [bigint] NULL,
  36:     [RelationPosition] [int] IDENTITY(1,1) NOT NULL
  37: ) ON [PRIMARY]
  38:  
  39: GO
  40: CREATE TABLE [dbo].[RelationTag](
  41:     [key] [nvarchar](250) NULL,
  42:     [value] [nvarchar](3000) NULL,
  43:     [relation_Id] [bigint] NULL
  44: ) ON [PRIMARY]
  45:  
  46: GO
  47: CREATE TABLE [dbo].[Way](
  48:     [way_Id] [bigint] NOT NULL,
  49:     [id] [bigint] NULL,
  50:     [visible] [bit] NULL,
  51:  CONSTRAINT [PK_Way] PRIMARY KEY CLUSTERED 
  52: (
  53:     [way_Id] ASC
  54: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
  55: ) ON [PRIMARY]
  56:  
  57: GO
  58: CREATE TABLE [dbo].[WayNode](
  59:     [ref] [bigint] NULL,
  60:     [way_Id] [bigint] NULL,
  61:     [WaiNodePosition] [bigint] IDENTITY(1,1) NOT NULL
  62: ) ON [PRIMARY]
  63: GO
  64: CREATE TABLE [dbo].[WayTag](
  65:     [key] [nvarchar](250) NULL,
  66:     [value] [nvarchar](3000) NULL,
  67:     [way_Id] [bigint] NULL
  68: ) ON [PRIMARY]
  69:  
  70: GO

After the destination tables are created, it’s time to download an OSM xml file. I downloaded the file for Belgium from one of the mirror servers. After unzipping the file, it is about 2.4 Gb. All the remaining timings in this article are based on this file, but of course, feel free to experiment with data from other countries as well.

Loading the data with SSIS

Now it’s time to create an Integration Services package which uses an XML source to read the OSM xml file, parse it with the XSD file I mentioned earlier, and write the results to the corresponding tables. I used SQL Server destinations, but if you do be careful to set the destination timeout high enough, or your package will fail on this. You can reuse the SSIS package I made on SQL Server 2012, but be sure to check and correct the pointers to the appropriate files.

Fingers crossed, run the SSIS package and wait till the data is loaded. On my machine it took about 17 minutes. It loaded about 12 million coordinates (Nodes), and more than 1.5 million road segments and other combined objects (Way).

The only table that really contains coordinates is the Nodes table. In order to use SQL Server’s spatial functions on these, we must first create geographical objects out of the latitude and longitude info. This can simply be done with this statement:

   1: UPDATE Nodes SET geog = geography::Point(lat, lon, 4326)

This took about 5 minutes on my machine. From now on, we can start querying this information is we want, but my Belgian OSM file contains about 12 million coordinates… without a decent index, every query is going to take forever, so let’s build a spatial index on this. In the next code example I use the new type of spatial index they introduced in SQL Server 2012, but feel free to try other types of spatial indexes as well. Spatial Ed has a short ‘getting started’ paper if you want to play with spatial indexes.

   1: CREATE SPATIAL INDEX [SpatialIndex-Nodes] ON [dbo].[Nodes]
   2: (
   3:     [geog]
   4: )USING  GEOGRAPHY_AUTO_GRID 
   5: WITH (
   6: CELLS_PER_OBJECT = 16) ON [PRIMARY]
   7: GO

After this index is created, we can get started querying this interesting data source. More on that (and how to create more advanced spatial objects to represent roads, areas etc.) in a later blog post.

Happy querying!
Nico