If you are a new and fashionable Data Scientist, or old-fashioned Analytic, have some interest in Data Mining and pay attention to the new staff coming from the biggest players in this area, you have probably heard of Microsoft Azure ML.
First, a few words about the Gartner BI Magic Quadrant from the previous year — although the same companies are still Leaders (which is not surprising as BI technologies don’t change rapidly), data discovery capabilities looks more and more important.
Therefore, the attempt to deliver brand new data discovery platform looks right, doesn’t it?
Why I said “brand new”? Well, fifteen years ago, Microsoft released SQL Server 2000, the first version with Data Mining algorithms (two at that time) included. SQL Server 2005 was a major breakthrough in this respect (the number of supported DM algorithms went up to nine), and three years later (with SQL Server 2008) Microsoft Time Series Algorithm was completely rewritten. Since then nothing had happen, the quite mature DM platform was starting to look a little bit old-fashioned and the lack of any statistical tools was more and more problematic.
But Machine Learning is being defined as a combination of Data Mining and Statistical Engineering, so everything should be fine now.
Disclaimer. As a Data Mining practitioner heavy using Microsoft tools I’m excited about this new platform. But my main point is to solve customer problems as efficiently (in terms of time, many, usability etc.) as possible. So I spend a couple days replaying some real-world scenarios in Azure ML. I’m not a Azure ML expert. In fact, I hope that I just couldn’t find the right way to achieve something in Azure ML and will be more than happy if somebody show me how to do this or correct my mistakes.
Why I like Azure ML?
1. For “Drag and Drop” approach to building experiments.
This is not only easy (almost all third-party DM tools are easy to use) but also flexible (i.e. the same inputs and outputs can be used multiple times) and allows following (i.e. inspection) data flow. This design has some really nice features, i.e. ability to save a trained model and reuse it in different experiments. Not to mention that it looks fairly appealing.
2. Because of R Scripts support.
If you are more code-oriented person instead of build an experiments from multiple blocks you can prepare your data writing R scripts. And it works seamlessly. Plus, if needed R package is not available by default, you can upload it to the experiment by yourself (this great blog post explains how to do this: http://blogs.technet.com/b/saketbi/archive/2014/08/20/microsoft-azure-ml-amp-r-language-extensibility.aspx).
3. For a wide range of supported DM algorithms.
Ensemble models, like Boosted Decision Trees (models that use a panel of algorithms instead of a single one) are very welcome. Finally.
4. Because of embedded data preparation blocks.
These blocks vary from simple transformations, through statistical function, to automate feature selection. Again, better now than never.
5. Due to valuable documentation.
Azure ML is still in beta, so the official documentation is rather sparse, but even now there are interesting materials (like tutorials at http://azure.microsoft.com/en-us/documentation/services/machine-learning/), examples (you will find them when you launched Azure ML Studio), blog post (like the one I shown you earlier) and book (Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes by Roger Barga, Valentine Fontama, and Wee Hyong Tok is a nice starting point).
6. Duo to easy implementation.
Published models are available as Web services, and sample client code (in c#, Python and R) are provided to you. In a nutshell, call a prediction query is a copy-paste task.
7. For an extensive Text Analytic support.
I’m playing with those components right now and they look promising. Hopefully, I will be able to say more about Text Analytics support soon.
What drives me crazy about Azure ML?
Almost all components are painfully slow. The simplest data manipulation took way longer than on premise. The same for model training and answering prediction queries.
2. Inability to see model content.
Train model component has only one output, but instead of model visualizer or founded regression formula it gives me something like this:
1. Difficulty to separate some attributes (like email addresses) that I would like to have attached to results, but hide them in training data.
I suppose this is “be design” — after all Azure ML should be used for prediction only, not for descriptive purposes. Still, I would love to discuss this design principle.
2. Laborious way of creating and comparing multiple copies of the same model.
In SQL Server to build another copy of defined model you have to click Ctrl+C and Ctrl+V. Then one can tweak some algorithm parameters, train them and compare them all at once. In Azure ML after copy/paste you will probably have to reconnect some blocks. And then a surprise — Evalute Model block has only two inputs.
3. No Azure ML Addin for Excel.
R studio is great and I’m really fan of it, but for whatever reason end users prefers Excel…