The whole world in a SQL Server database 29 January 2013 Nico-Jacobs SQL Server, Integration Services Some people have the whole world in their hand, but last weekend I tried something slightly less ambitious: Getting (part of) the whole world in a SQL Server database. Loads of spatial data from OpenStreetMap I’m always looking for interesting datasets. I was still lacking a large dataset with spatial data. The biggest public spatial dataset I’m aware of is OpenStreetMap (OSM). It’s a map service comparable with Bing or Google Maps. But the data it uses is collected from public data: Some governments provide public location information, but there are also thousands of volunteers who help OSM in putting their street or village ‘on the map’. The resulting database is public as well. It is described in XML format, and I found on the wiki also an OSM XSD schema. This allows me to load the data with SQL Server Integration Services, using the data flow XML data source. There might be faster solutions than this, but for the time being, SSIS does a rather good job with a very simple setup. Creating the database So, let’s create the database in which we want to store our data. Be sure to create a database with an initial size big enough to host the data you want to load. The OSM xml file with all the data from the whole planet is about 250 Gb at the time of writing, so It will probably take about 150 Gb in SQL Server (XML is not the most compact storage format). In this blog post example I will be loading a subset of the whole planet, so I create a much smaller database (just under 2 Gb): 1: CREATE DATABASE [OSM] 2: ON PRIMARY 3: ( NAME = N'OSM', FILENAME = N'F:\Data\OSM.mdf' , SIZE = 2048000KB , MAXSIZE = UNLIMITED, FILEGROWTH = 512000KB ) 4: LOG ON 5: ( NAME = N'OSM_log', FILENAME = N'F:\Data\OSM_log.ldf' , SIZE = 3170880KB , MAXSIZE = 2048GB , FILEGROWTH = 10%) If you really plan to create a much larger database, don’t forget to enable instant initialization to speed up the process of creating and extending the data file(s). Next I create all the tables to store the different pieces of information in the OSM XML files. Notice that I do not keep all the information in the XML files. I do not store the person who entered the data, and the date the data was entered. This reduces the amount of data to be stored, which speeds up the data loading. This is the code to create the necessary tables: 1: USE [OSM] 2: GO 3: CREATE TABLE [dbo].[Nodes]( 4: [node_Id] [bigint] NOT NULL, 5: [id] [bigint] NULL, 6: [lat] [float] NULL, 7: [lon] [float] NULL, 8: [geog] [geography] NULL, 9: CONSTRAINT [PK_Nodes] PRIMARY KEY CLUSTERED 10: ( 11: [node_Id] ASC 12: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY] 13: ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] 14: 15: CREATE TABLE [dbo].[NodeTag]( 16: [key] [nvarchar](250) NULL, 17: [value] [nvarchar](3000) NULL, 18: [node_Id] [bigint] NULL 19: ) ON [PRIMARY]; 20: GO 21: CREATE TABLE [dbo].[Relation]( 22: [relation_Id] [bigint] NOT NULL, 23: [id] [bigint] NULL, 24: [visible] [bit] NULL, 25: CONSTRAINT [PK_Relation] PRIMARY KEY CLUSTERED 26: ( 27: [relation_Id] ASC 28: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY] 29: ) ON [PRIMARY] 30: GO 31: CREATE TABLE [dbo].[RelationMember]( 32: [type] [nvarchar](8) NULL, 33: [ref] [bigint] NULL, 34: [role] [nvarchar](250) NULL, 35: [relation_Id] [bigint] NULL, 36: [RelationPosition] [int] IDENTITY(1,1) NOT NULL 37: ) ON [PRIMARY] 38: 39: GO 40: CREATE TABLE [dbo].[RelationTag]( 41: [key] [nvarchar](250) NULL, 42: [value] [nvarchar](3000) NULL, 43: [relation_Id] [bigint] NULL 44: ) ON [PRIMARY] 45: 46: GO 47: CREATE TABLE [dbo].[Way]( 48: [way_Id] [bigint] NOT NULL, 49: [id] [bigint] NULL, 50: [visible] [bit] NULL, 51: CONSTRAINT [PK_Way] PRIMARY KEY CLUSTERED 52: ( 53: [way_Id] ASC 54: )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY] 55: ) ON [PRIMARY] 56: 57: GO 58: CREATE TABLE [dbo].[WayNode]( 59: [ref] [bigint] NULL, 60: [way_Id] [bigint] NULL, 61: [WaiNodePosition] [bigint] IDENTITY(1,1) NOT NULL 62: ) ON [PRIMARY] 63: GO 64: CREATE TABLE [dbo].[WayTag]( 65: [key] [nvarchar](250) NULL, 66: [value] [nvarchar](3000) NULL, 67: [way_Id] [bigint] NULL 68: ) ON [PRIMARY] 69: 70: GO After the destination tables are created, it’s time to download an OSM xml file. I downloaded the file for Belgium from one of the mirror servers. After unzipping the file, it is about 2.4 Gb. All the remaining timings in this article are based on this file, but of course, feel free to experiment with data from other countries as well. Loading the data with SSIS Now it’s time to create an Integration Services package which uses an XML source to read the OSM xml file, parse it with the XSD file I mentioned earlier, and write the results to the corresponding tables. I used SQL Server destinations, but if you do be careful to set the destination timeout high enough, or your package will fail on this. You can reuse the SSIS package I made on SQL Server 2012, but be sure to check and correct the pointers to the appropriate files. Fingers crossed, run the SSIS package and wait till the data is loaded. On my machine it took about 17 minutes. It loaded about 12 million coordinates (Nodes), and more than 1.5 million road segments and other combined objects (Way). The only table that really contains coordinates is the Nodes table. In order to use SQL Server’s spatial functions on these, we must first create geographical objects out of the latitude and longitude info. This can simply be done with this statement: 1: UPDATE Nodes SET geog = geography::Point(lat, lon, 4326) This took about 5 minutes on my machine. From now on, we can start querying this information is we want, but my Belgian OSM file contains about 12 million coordinates… without a decent index, every query is going to take forever, so let’s build a spatial index on this. In the next code example I use the new type of spatial index they introduced in SQL Server 2012, but feel free to try other types of spatial indexes as well. Spatial Ed has a short ‘getting started’ paper if you want to play with spatial indexes. 1: CREATE SPATIAL INDEX [SpatialIndex-Nodes] ON [dbo].[Nodes] 2: ( 3: [geog] 4: )USING GEOGRAPHY_AUTO_GRID 5: WITH ( 6: CELLS_PER_OBJECT = 16) ON [PRIMARY] 7: GO After this index is created, we can get started querying this interesting data source. More on that (and how to create more advanced spatial objects to represent roads, areas etc.) in a later blog post. Happy querying! Nico