Author Archives: Cortana Intelligence and ML Blog Team

Deep Learning on the New Ubuntu-Based Data Science Virtual Machine for Linux

Authored by Paul Shealy, Senior Software Engineer, and Gopi Kumar, Principal Program Manager, at Microsoft.

Deep learning has received significant attention recently for its ability to create machine learning models with very high accuracy. It’s especially popular in image and speech recognition tasks, where the availability of massive datasets with rich information make it feasible to train ever-larger neural networks on powerful GPUs and achieve groundbreaking results. Although there are a variety of deep learning frameworks available, getting started with one means taking time to download and install the framework, libraries, and other tools before writing your first line of code.

Microsoft’s Data Science Virtual Machine (DSVM) is a family of popular VM images published on the Azure marketplace with a broad choice of machine learning and data science tools. Microsoft is extending it with the introduction of a brand-new offering in this family – the Data Science Virtual Machine for Linux, based on Ubuntu 16.04LTS – that also includes a comprehensive set of popular deep learning frameworks.

Deep learning frameworks in the new VM include:

  • Microsoft Cognitive Toolkit
  • Caffe and Caffe2
  • TensorFlow
  • H2O
  • MXNet
  • NVIDIA DIGITS
  • Theano
  • Torch, including PyTorch
  • Keras

The image can be deployed on VMs with GPUs or CPU-only VMs. It also includes OpenCV, matplotlib and many other libraries that you will find useful.

Run dsvm-more-info at a command prompt or visit the documentation for more information about these frameworks and how to get started.

Sample Jupyter notebooks are included for most frameworks. Start Jupyter or log in to JupyterHub to browse the samples for an easy way to explore the frameworks and get started with deep learning.

GPU Support

Training a deep neural network requires considerable computational resources, so things can be made significantly faster by running on one or more GPUs. Azure now offers NC-class VM sizes with 1-4 NVIDIA K80 GPUs for computational workloads. All deep learning frameworks on the VM are compiled with GPU support, and the NVIDIA driver, CUDA and cuDNN are included. You may also choose to run the VM on a CPU if you prefer, and that is supported without code changes. And because this is running on Azure, you can choose a smaller VM size for setup and exploration, then scale up to one or more GPUs for training.

The VM comes with nvidia-smi to monitor GPU usage during training and help optimize parameters to make full use of the GPU. It also includes NVIDIA Docker if you want to run Docker containers with GPU access.

Data Science Virtual Machine

The Data Science Virtual Machine family of VM images on Azure includes the DSVM for Windows, a CentOS-based DSVM for Linux, and an Ubuntu-based DSVM for Linux. These images come with popular data science and machine learning tools, including Microsoft R Server Developer Edition, Microsoft R Open, Anaconda Python, Julia, Jupyter notebooks, Visual Studio Code, RStudio, xgboost, and many more. A full list of tools for all editions of the DSVM is available here. The DSVM has proven popular with data scientists as it helps them focus on their tasks and skip mundane steps around tool installation and configuration.


To try deep learning on Windows with GPUs, the Deep Learning Toolkit for DSVM contains all tools from the Windows DSVM plus GPU drivers, CUDA, cuDNN, and GPU versions of CNTK, MXNet, and TensorFlow.

Get Started Today

We invite you to use the new image to explore deep learning frameworks or for your machine learning and data science projects – DSVM for Linux (Ubuntu) is available today through the Marketplace. Free Azure credits are available to help get you started.

Paul & Gopi

Build & Deploy Machine Learning Apps on Big Data Platforms with Microsoft Linux Data Science Virtual Machine

This post is authored by Gopi Kumar, Principal Program Manager in the Data Group at Microsoft.

This post covers our latest additions to the Microsoft Linux Data Science Virtual Machine (DSVM), a custom VM image on Azure, purpose-built for data science, deep learning and analytics. Offered in both Microsoft Windows and Linux editions, DSVM includes a rich collection of tools, seen in the picture below, and makes you more productive when it comes to building and deploying advanced machine learning and analytics apps.

The central theme of our latest Linux DSVM release is to enable the development and testing of ML apps for deployment to distributed scalable platforms such as Spark, Hadoop and Microsoft R Server, for operating on data at a very large scale. In addition, with this release, DSVM also offers Julia Computing’s JuliaPro on both Linux and Windows editions.


Here’s more on the new DSVM components you can use to build and deploy intelligent apps to big data platforms:

Microsoft R Server 9.0

Version 9.0 of Microsoft R Server (MRS) is a major update to enterprise-scale R from Microsoft, supporting parallel and distributed computation. MRS 9.0 supports analytics execution in the Spark 2.0 context. There’s a new architecture and simplified interface for deploying R models and functions as web services via a new library called mrsdeploy, which makes it easy to consume models from other apps using the open Swagger framework.

Local Spark Standalone Instance

Spark is one of the premier platforms for highly scalable big data analytics and machine learning. Spark 2.0 launched in mid-2016 and brings several improvements such as the revised machine learning library (MLLib), scaling and performance optimization, better ANSI SQL compliance and unified APIs. The Linux DSVM now offers a standalone Spark instance (based on the Apache Spark distribution), PySpark kernel in Jupyter to help you build and test applications on the DSVM and deploy them on large scale clusters like Azure HDInsight Spark or your own on-premises Spark cluster. You can develop your code using either Jupyter notebook or with the included community edition of the Pycharm IDE for Python or RStudio for R.

Single Node Local Hadoop (HDFS and YARN) Instance

To make it easier to develop Hadoop programs and/or use HDFS storage locally for development and testing, a single node Hadoop installation is built into the VM. Also, if you are developing on the Microsoft R Server for execution in Hadoop or Spark remote contexts, you can first test things locally on the Linux DSVM and then deploy the code to a remote scaled out Hadoop or Spark cluster or to Microsoft R Server. These DSVM additions are designed to help you iterate rapidly when developing and testing your apps, before they get deployed into large-scale production big data clusters.

The DSVM is also a great environment for self-learning and running training classes on big data technologies. We provide sample code and notebooks to help you get started quickly on the different data science tools and technologies offered.

DSVM Resources

New to DSVM? Here are resources to get you started:

Linux Edition

Windows Edition

The goal of DSVM is to make data scientists and developers highly productive in their work and provide a broad array of popular tools. We hope you find it useful to have these new big data tools pre-installed with the DSVM.

We always appreciate feedback, so please send in your comments below or share your thoughts with us at the DSVM community forum.

Gopi

Julia – A Fresh Approach to Numerical Computing

This post is authored by Viral B. Shah, co-creator of the Julia language and co-founder and CEO at Julia Computing, and Avik Sengupta, head of engineering at Julia Computing.

The Julia language provides a fresh new approach to numerical computing, where there is no longer a compromise between performance and productivity. A high-level language that makes writing natural mathematical code easy, with runtime speeds approaching raw C, Julia has been used to model economic systems at the Federal Reserve, drive autonomous cars at University of California Berkeley, optimize the power grid, calculate solvency requirements for large insurance firms, model the US mortgage markets and map all the stars in the sky

It would be no surprise then that Julia is a natural fit in many areas of machine learning. ML, and in particular deep learning, drives some of the most demanding numerical computing applications in use today. And the powers of Julia make it a perfect language to implement these algorithms.

julia

One of key promises of Julia is to eliminate the so-called “two language problem.” This is the phenomenon of writing prototypes in a high-level language for productivity, but having to dive down into C for performance-critical sections, when working on real-life data in production. This is not necessary in Julia, because there is no performance penalty for using high-level or abstract constructs.

This means both the researcher and engineer can now use the same language. One can use, for example, custom kernels written in Julia that will perform as well as kernels written in C. Further, language features such as macros and reflection can be used to create high-level APIs and DSLs that increase the productivity of both the researcher and engineer.

GPU

Modern ML is heavily dependent on running on general-purpose GPUs in order to attain acceptable performance. As a flexible, modern, high-level language, Julia is well placed to take advantage of modern hardware to the fullest.

First, Julia’s exceptional FFI capabilities make it trivial to use the GPU drivers and CUDA libraries to offload computation to the GPU without any additional overhead. This allows Julia deep learning libraries to use GPU computation with very little effort.

Beyond that, libraries such as ArrayFire allow developers to use natural-looking mathematical operations, while performing those operations on the GPU instead of the CPU. This is probably the easiest way to utilize the power of the GPU from code. Julia’s type and function abstractions make this possible with, once again, very little performance overhead.

Julia has a layered code generation and compilation infrastructure that leverages LLVM. (Incidentally, it also provides some amazing introspection facilities into this process.) Based on this, Julia has recently developed the ability to directly compile code onto the GPU. This is an unparalleled feature among high-level programming languages.

While the x86CPU with a GPU is currently the most popular hardware setup for deep learning applications, there are other hardware platforms that have very interesting performance characteristics. Among them, Julia now fully supports the Power platform, as well as the Intel KNL architecture.

Libraries

The Julia ecosystem has, over the last few years, matured sufficiently to materialize these benefits in many domains of numerical computing. Thus, there are a set of rich libraries for ML available in Julia right now. Deep learning framework with natural bindings to Julia include MXNet and TensorFlow. Those wanting to dive into the internals can use the pure Julia libraries, Mocha and Knet. In addition, there are libraries for random forests, SVMs, and Bayesian learning.

Using Julia with all these libraries is now easier than ever. Thanks to the Data Science Virtual Machine (DSVM), running Julia on Azure is just a click away. The DSVM includes a full distribution of JuliaPro, the professional Julia development environment from Julia Computing Inc, along with many popular statistical and ML packages. It also includes the IJulia system, with brings Jupyter notebooks to the Julia language. Together, it creates the perfect environment for data science, both for exploration and production.

Viral Shah
@Viral_B_Shah