Loading data in Azure Machine Learning

In July 2014 Microsoft made their cloud-based data mining environment (known as Azure Machine Learning, or AzureML) available to the public. With this platform users can analyze large amounts of data without the need to install and configure special software: A browser and a credit card is all you need . With the increasing number of people in a number-crunching job (data scientists) it is nice to see Microsoft focusing on this. In a previous blog post (see http://blogs.u2u.be/u2u/post/2014/07/14/First-Steps-in-Azure-Machine-Learning.aspx) I show how to get started with setting up an AzureML environment. In this blog post we take a look at loading data in AzureML. Supported data formats Currently AzureML is focusing on the most common formats used in the world of machine learning: Text files containing comma separated values (CSV), tab-separated values (TSV), the Attribute-Relation File Format (ARFF) which was introduced by the open-source Weka machine learning framework, RData files or the SVMLight format. Database tables: Hive tables (Hadoop), Azure Tables and SQL Databases in Azure Since AzureML runs in the Azure cloud, all your data must be in the cloud as well. Either you already uploaded your data to Azure (e.g. your data is stored in an Azure SQL Database) or you will upload it explicitly for this project. In both cases be careful to store your data in the same region as where you’re running your AzureML, since in the preview period, AzureML only runs from the South Central US data center. If you store your data in another data center it will be slower and more expensive to run your experiments. Let’s first consider the scenario where you upload your data from a local file directly into AzureML (Uploading a DataSet), then we cover the scenario where your data is already somewhere in Azure (Reading data). Uploading a DataSet A lot of sample machine learning data sets are already available out-of-the-box in Azure ML. But after some experimenting with public data, you probably want to play with your own data. If you didn’t have your data anywhere on Azure yet, you can upload it as a new dataset in AzureML directly. But before we start adding data sets, first a warning: In the current preview we cannot delete uploaded datatsets. We can override an existing data set with new data, but if you create 1001 data sets, they will be in the list forever (that is: until Microsoft fixes this limitation). Because of this, if your dataset is not yet fixed, consider uploading the data file(s) into a custom Azure blob store and then load them with the reader from within your experiment. To add a new dataset, click the +New button at the bottom left of the ML Studio screen, and select DataSet –> From local file. In the next dialog box, we can pick the file to upload, provide a name (choose well, it cannot be altered later on), select the type of data in the file and provide an optional description: If you select the checkbox you select an existing dataset, who’s content will be overwritten by the file you select. It is impossible to delete or rename a datset, but you can always upload an empty file ‘as a new version’ of a large data set to truncate it. If we now want to use this data, we create a new experiment by clicking the +New button. In this new experiment under the Saved Datasets we will find our uploaded dataset among the list. Just drag it to the design surface. Also remember the search box at the top: by typing part of an object name (and a data set is one of the many objects we have in AzureML) we get a filtered list which makes it easier to find an object. Now that we have our data in AzureML we can start interacting with it, such as simply visualizing our data: click in the circle under the data set and select Visualize: This will open up the overview screen, showing basic statistical information on each data field: Reading data Another way to get data in an AzureML experiment is by first uploading your data in a Azure SQL Database, an Hadoop cluster (such as HDInsight) or upload the files with data (same data types as we had in the previous paragraph) into an Azure blob store. In this case you do not need to create a data set, but you can immediately create a new experiment. In this experiment, locate the Reader under Data Input and Output and drag it into the experiment. When we click in the Reader, we get on the right-hand side all the configurable properties of this Reader. The most important property is the data source type. This one determines which other properties are needed. Select over here the location where your data can be found and configure the other properties appropriately When we now run the experiment, we can visualize the data from this reader, just as we could we an uploaded data set. But we have an extra option. By clicking Save as dataset, we can permanently store this data in AzureML. This speeds up the runtime of an experiment, but it increases the storage cost (we store another redundant copy of the data). In a next blog post, I will discuss data preprocessing.

First Steps in Azure Machine Learning

Today Microsoft announces the availability of machine learning (data mining) in Azure. As you can assume, you need an Azure account to get started with this, but there are free trial accounts… you can try before you buy. To get started with the machine learning preview go to http://manage.windowsazure.com and log in with your azure account. In the list of options, close to the bottom, you will find Machine Learning: Click the Create an ML workspace link. Currently there is only a Quick Create option available. Invent a unique workspace name. The Workspace owner must be a valid LiveID account. Location is easy: the machine learning is currently only available in South Central US. I guess I as an European will just need to be a little more patient . If you already have an South Central US storage account you can reuse that, but I put all my storage accounts in Europe, so I now will need to create one on US soil. Those who need to keep there data within Europe for legal reasons will need to wait, because I assume Microsoft will make this service available later on in Europe as well. My final configuration looks like this: Now is the time to start reading the tutorial at http://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/ while Azure is creating your Machine Learning workspace. Once the workspace is created we can click the right arrow next to it. Then click on the DashBoard link at the top, and next click the Sign-in to ML Studio under quick glance: And now we arrive in the ML Studio: At this point you can get started following the tutorials(http://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/), play with the sample data or build experiments from scratch with your own data: Have fun!

Validation sets in SQL Server Data Mining

What are validation sets? Data Mining Data mining analyses historical data to find patterns that might help us better understand how our business works, or might help predict how the business might evolve in the future: Instead of doing ‘traditional BI’, where we pick some attributes and ask for aggregated data (“show me the sum of sales amount by country per fiscal quarter”), in data mining we ask questions such as “what is typical for customers who buy bikes”, and we get answers (models, as we call them) that contain patterns such as “if the age of the customer is less than 29 and they live in the Netherlands they are more likely bike buyers”. This however results in a problem: how do we know if a model is ‘better’ than another model? Is the model “Young people are more frequent bike buyers” better than “People who do not own a car are bike buyers”? Test set The typical approach to test the quality of models is by testing how well they behave when we use them to predict the outcome (e.g. whether a customer buys a bike or not) on the historical data, for which we then already know the outcome. Models for which the predicted outcome more frequently corresponds with the actual outcome are better models. However, we need to be careful: if we would use as a test data set the same set of data we use to create the models, we run the risk of overfitting. Overfitting means the model is so tuned on the training set, that the patterns are not general enough to be useful on new data. E.g. the model “If the customer name is Ben Carlson, Margareta Wuyts, … or Jeremy Frank then it is a bike buyer” might make perfect predictions in your historical data, but it is clear that it will be of little help in making predictions on new customers: it is heavily overfitted. This is why we split the historical data in two sets: training data, on which the system search for patterns, and test data, which we use to test the quality of the model. This is even build-in in the SQL Server Analysis Services wizard to construct mining models: It by default proposes to keep 30% of the data separate for testing. Validation set But… also test data sets raise an issue: We often need to test a lot of different mining models with different parameter setting to find a near-optimal result. This is an iterative process, in which we create a few models, test them on the test set, see which data mining techniques and parameters work best, use that knowledge to setup a second iteration of models to be tested etc. But in this way, the data mining developer is introducing knowledge from the test set in the development process: Imagine that in our test set age is a strong indicator, than we will favor models which use this. The overall result is that the estimated quality of the predictions which are made on the test set are no longer a good estimate of the expected quality of the predictions on new data. They are already slightly biased towards our test set, and typically overestimate the predictive quality of our model. This is where validation sets come in: Before we got started with any data mining in the first place, we should have set some of our historical data (e.g. 20% of the data) apart in a validation set. The remaining 80% is then split apart in training and test data. Once we’re finished with our data mining, we test our model one last time, on data it has never seen, not as training data, not as test data. Our validation set is (from the data mining point of view) truly new data, and might give the best impression of the expected predictive quality of our mining model. How do we create validation sets? In contrast to test data sets, the mining wizard does not allow us to set apart a validation set. So we need to do this in the data preparation phase (see CRISP-DM methodology for more info on the different phases in the data mining process). If you prefer to prepare your data with T-SQL statements, you can use this approach based on NEWID() to randomly select a certain set of data, but be careful: if you rerun the statement, a different subset will be selected. Another approach is to use SSIS (Integration Services), which has a percentage sampling transformation which is ideal for this job: it assigns each row an n% likelihood of being selected, so because of that it doesn’t need to cache all rows in memory (in contrast to the row sampling transformation). An advantage over the NEWID() approach is that we can set the seed for the random data generator, such that results are reproducible if we want. How do we use validation sets? Using validation sets is easy. Just make sure that the table you created with the validation data is in the same data source as the data source you used for the SSAS project. Then in the Mining Accuracy Chart tab of the Mining model in SSAS, you select just the best performing model(s) and below you choose the radio button to use a different data set. Click the ellipsis button (…) and select the table or view which contains the validation set. Join the proper columns from the validation set with the mining model, and you’re set! Now you can create lift or profit charts and build a classification matrix against the validation set. Happy mining! Nico