Weird localization problem in DAX editor

One of my students (thanks, Stefan!) discovered during the PowerBI course a bizarre localization issue in the DAX editor of Excel and SQL Server Data Tools, which is easy to replicate: Install Office 2013 or SQL Server data tools on a machine with a locale which uses a comma as decimal separator (I’m from Belgium, a Belgian locale will do). Then create a Power Pivot model, add a table to it, and in the Table editor create a calculated column which uses at least one parameterized function. In this function provide as an argument the constant string “,”. For instance you could use as an expression =IF(1=1;”,”). As soon as I enter this in the editor, the editor replaces the constant string “,” with “;”, but still shows in the outcome the correct value “,”. Something similar happens when I type a dot as constant string. When this is what I type in Power Pivot: This is what I get when I press enter in the expression editor: Notice how the formula in the editor uses a comma instead of a dot, but the evaluated expression (CalculatedColumn1 in my case) uses the dot I originally typed: what I see is not what I get. And to make things worse: when I edit the formula (e.g. adding an else parameter), it really starts to use the comma I never typed, and replaces in the expression the comma with a semi-column I never typed nor see in the evaluated column: So, be careful when using constant strings “.” or “,” in DAX!

First Steps in Azure Machine Learning

Today Microsoft announces the availability of machine learning (data mining) in Azure. As you can assume, you need an Azure account to get started with this, but there are free trial accounts… you can try before you buy. To get started with the machine learning preview go to http://manage.windowsazure.com and log in with your azure account. In the list of options, close to the bottom, you will find Machine Learning: Click the Create an ML workspace link. Currently there is only a Quick Create option available. Invent a unique workspace name. The Workspace owner must be a valid LiveID account. Location is easy: the machine learning is currently only available in South Central US. I guess I as an European will just need to be a little more patient . If you already have an South Central US storage account you can reuse that, but I put all my storage accounts in Europe, so I now will need to create one on US soil. Those who need to keep there data within Europe for legal reasons will need to wait, because I assume Microsoft will make this service available later on in Europe as well. My final configuration looks like this: Now is the time to start reading the tutorial at http://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/ while Azure is creating your Machine Learning workspace. Once the workspace is created we can click the right arrow next to it. Then click on the DashBoard link at the top, and next click the Sign-in to ML Studio under quick glance: And now we arrive in the ML Studio: At this point you can get started following the tutorials(http://azure.microsoft.com/en-us/documentation/articles/machine-learning-create-experiment/), play with the sample data or build experiments from scratch with your own data: Have fun!

Validation sets in SQL Server Data Mining

What are validation sets? Data Mining Data mining analyses historical data to find patterns that might help us better understand how our business works, or might help predict how the business might evolve in the future: Instead of doing ‘traditional BI’, where we pick some attributes and ask for aggregated data (“show me the sum of sales amount by country per fiscal quarter”), in data mining we ask questions such as “what is typical for customers who buy bikes”, and we get answers (models, as we call them) that contain patterns such as “if the age of the customer is less than 29 and they live in the Netherlands they are more likely bike buyers”. This however results in a problem: how do we know if a model is ‘better’ than another model? Is the model “Young people are more frequent bike buyers” better than “People who do not own a car are bike buyers”? Test set The typical approach to test the quality of models is by testing how well they behave when we use them to predict the outcome (e.g. whether a customer buys a bike or not) on the historical data, for which we then already know the outcome. Models for which the predicted outcome more frequently corresponds with the actual outcome are better models. However, we need to be careful: if we would use as a test data set the same set of data we use to create the models, we run the risk of overfitting. Overfitting means the model is so tuned on the training set, that the patterns are not general enough to be useful on new data. E.g. the model “If the customer name is Ben Carlson, Margareta Wuyts, … or Jeremy Frank then it is a bike buyer” might make perfect predictions in your historical data, but it is clear that it will be of little help in making predictions on new customers: it is heavily overfitted. This is why we split the historical data in two sets: training data, on which the system search for patterns, and test data, which we use to test the quality of the model. This is even build-in in the SQL Server Analysis Services wizard to construct mining models: It by default proposes to keep 30% of the data separate for testing. Validation set But… also test data sets raise an issue: We often need to test a lot of different mining models with different parameter setting to find a near-optimal result. This is an iterative process, in which we create a few models, test them on the test set, see which data mining techniques and parameters work best, use that knowledge to setup a second iteration of models to be tested etc. But in this way, the data mining developer is introducing knowledge from the test set in the development process: Imagine that in our test set age is a strong indicator, than we will favor models which use this. The overall result is that the estimated quality of the predictions which are made on the test set are no longer a good estimate of the expected quality of the predictions on new data. They are already slightly biased towards our test set, and typically overestimate the predictive quality of our model. This is where validation sets come in: Before we got started with any data mining in the first place, we should have set some of our historical data (e.g. 20% of the data) apart in a validation set. The remaining 80% is then split apart in training and test data. Once we’re finished with our data mining, we test our model one last time, on data it has never seen, not as training data, not as test data. Our validation set is (from the data mining point of view) truly new data, and might give the best impression of the expected predictive quality of our mining model. How do we create validation sets? In contrast to test data sets, the mining wizard does not allow us to set apart a validation set. So we need to do this in the data preparation phase (see CRISP-DM methodology for more info on the different phases in the data mining process). If you prefer to prepare your data with T-SQL statements, you can use this approach based on NEWID() to randomly select a certain set of data, but be careful: if you rerun the statement, a different subset will be selected. Another approach is to use SSIS (Integration Services), which has a percentage sampling transformation which is ideal for this job: it assigns each row an n% likelihood of being selected, so because of that it doesn’t need to cache all rows in memory (in contrast to the row sampling transformation). An advantage over the NEWID() approach is that we can set the seed for the random data generator, such that results are reproducible if we want. How do we use validation sets? Using validation sets is easy. Just make sure that the table you created with the validation data is in the same data source as the data source you used for the SSAS project. Then in the Mining Accuracy Chart tab of the Mining model in SSAS, you select just the best performing model(s) and below you choose the radio button to use a different data set. Click the ellipsis button (…) and select the table or view which contains the validation set. Join the proper columns from the validation set with the mining model, and you’re set! Now you can create lift or profit charts and build a classification matrix against the validation set. Happy mining! Nico

Using reporting services with Oracle package / stored procedure

Reporting services can create data sets on top of Oracle stored procedures. Sure, the manual says so, but this afternoon, I tried to get things up and running, and I want to share some of the dead ends and problems I experienced in that, maybe it can help you avoid these obstacles. Don’t panic, not all paths are dead ends, we got to the stored procs in Oracle via Reporting Services . Please note that I’m not an oracle expert, nor did I spend days figuring these things out. So please, if you can share your experience, please do so! Getting started There are two important decisions that can cause this querying to go wrong: Which data provider do we use? How do we create the procedure in Oracle First the data provider. I installed Oracle 10g Express edition on the same machine that was running Microsoft Reporting Services 2008 R2. To test the connection with Oracle independent of the Microsoft product stack, I also installed the freeware version of Toad. A first frustration that I ran into was that I was unable to connect to Oracle from Toad on my 64 bit version of windows 7. I could access Oracle from the web interface, but not from Toad. Installing the client tools for Oracle express didn’t help . So I moved to 32 bit box of Win7. Installing Oracle Express with client tools on this box and… hurray, Toad can talk to Oracle. Next, I used Integration Services to push the DimGeography table from AdventureworksDW into my Oracle. Time to query the data from Reporting Services. I first try sending an explicit query with the three data providers that I have on my machine: Microsoft OleDb Provider for Oracle (MSDAORA) Oracle Provider for OLE DB (OraOLEDB) Oracle (OracleClient .Net) MSDAORA Configuring the data provider is straight forward, only remember to not use “.” or “(local)” as the name of the local machine, but use localhost (or 127.0.0.1) instead. When I create a dataset with query type Text, I can start the graphical query builder. The query builder lists nicely all tables and views, and is clever enough to put double quotes around my object names. Also query type Table is available, but unfortunately, this one does not list the available tables, I must type the table name myself. Don’t forget to put double quotes around the table name, or Oracle can become angry at you! Query type Stored Procedure is available as well, but whatever I tried, I could not get it working. Nothing shows up in the drop-down box, typing the procedure name  results in an error asking for parameters, when I start adding parameters in the call as well, it either claims ORA-06550 PLS-00201 identifier must be declared, or ORA-06550 PLS-00306 wrong number or types of arguments, or ORA-01008 not all variables bound I gave up on this one. OraOLEDB Also easy to configure this Oracle OleDb driver, just as MSDAORA. When building a query in the graphical query builder environment, the query builder doesn’t put double quotes around table and column names, which causes error ORA-00942: table or view does not exist. Of course, we can manually add the quotes, but it’s nice if computers work for us, not the other way around. Besides this, the driver seems to give the same results as MSDAORA, and I can’t seem to get the stored procedure calls up and running. Next! OracleClient Last one down the list. Configuring the DataSource is comparable with the two previous: local connections should refer to Localhost or 127.0.0.1, not just “.”. As query type, only Text and Stored procedure is available, not Table. The Text graphical query builder behaves as with MSDORA. But the very good news is that if you create an Oracle stored procedure in the proper way (more on that in a second), that you can select the stored procedure radio button and the drop-down box will be populated with the Oracle stored procedures! Just select one from the list and it will behave exactly as on sql server: All the regular parameters of the stored procedure become report parameters, and a data set is built based on the first table returned by the stored procedure. Building the Oracle stored procedure Oracle is somewhat different from SQL Server when it comes to stored procedures. In Oracle you can create a package, a container in which you can declare and implement variables, stored procedures, functions,…. In contrast to SQL Server, a stored procedure in Oracle cannot return a table, only functions can do so. But … we still need to use the stored procedure anyhow, if we want it to show up in the Reporting services drop-down box. The trick is to make sure this stored procedure builds up the result in a REF CURSOR, and make this ref cursor a parameter! To illustrate this, I show you the code to get a list of all cities in a particular country (a lot of thanks to Philippe Clercq for writing this!) CREATE OR REPLACE PACKAGE sp_country AS     type ref_cursor is REF CURSOR;     PROCEDURE get_city(country_name IN varchar2 default null,po_cursor out ref_cursor); END sp_country; CREATE OR REPLACE PACKAGE sp_country AS PROCEDURE get_city(country_name IN varchar2,po_cursor out ref_cursor)  IS       BEGIN   open po_cursor for SELECT "GeographyKey", "City", "StateProvinceCode",    "StateProvinceName", "CountryRegionCode", "EnglishCountryRegionName",    "SpanishCountryRegionName", "FrenchCountryRegionName", "PostalCode",    "SalesTerritoryKey" FROM DW."DimGeography" g where upper(g."EnglishCountryRegionName") like upper(country_name||'%');      END; END sp_country; And with that, we could finally build our Reporting Services report on top of Oracle Express Edition (XE) and have the stored procedures show up in the drop down box. Of course, once your data is in, it doesn’t matter anymore whether you build your reports on top of SQL Server or Oracle. Nico

Jump to report action hides parameters

Reporting services offers three types of actions. One of them is the “Jump to report” action, often used for a drill-through scenario where you jump from an overview report into a report that provides detailed data. Usually, the detailed report has parameters (such as the month and region for which we want to see the details) which are set by the overview report. So usually, end users do not need to see the parameters of the detail report, since they already set these parameters by invoking the action. But sometimes, the detail report has parameters that still need to be set explicitly. Unfortunately, the parameter list is always collapsed, which might be confusing for end users. However, I didn’t discovered a way to convince Reporting services to show the parameter area… as long as we are using the jump to report action. But there is a “jump to URL” action as well, in which we can control the complete URL. If you combine that with the power of reporting services URL rendering, you can get a drill-through report with visible parameters. The price you pay is more work in writing the correct expressions to manually build the URL, which makes your report slightly more expensive to maintain… everything comes at a price…