Author Archives: Julia Computing, Inc.

Automatic Differentiation Meets Conventional Machine Learning

Differentiation is a central problem in many fields including deep learning, finance, scientific computing and others. The two conventional techniques used include:

  • Symbolic differentiation which can result in complex and redundant expressions

  • Numerical differentiation which can lead to large numerical errors

Automatic differentiation (AD) has been behind recent advances in deep learning. A Differentiable Programming System to Bridge Machine Learning and Scientific Computing, makes AD a first class feature in Julia language , helps differentiate programs, rather than simple network architectures.

Julia helps bring AD to the world of machine learning beyond neural networks. We will look at a specific use case and see how Julia allows us to compose functionalities from multiple packages seamlessly. (JuliaLang: The Ingredients for a Composable Programming Language)

Introduction

One of the advantages of Julia is its composability. Automatic differentiation can be applied in a conventional machine learning tool like xgboost. These two packages were developed independently yet they work together seamlessly. In order to illustrate this let’s look at how easily custom loss functions can be implemented in XGBoost with Zygote.

There are a lot of in-built loss functions in xgboost, but these may be suboptimal for a lot of real world problems. Consider testing for a rare disease. False positives are not good, but they are not costly mistakes in the sense that further tests can easily rule them out. On the other hand false negatives are far worse than a false positive. A person who has a condition and needs immediate care has been incorrectly identified as healthy. False negatives are far more costly than false positives. These kinds of scenarios are common in real world machine learning problems. This is where custom loss functions are useful.

The default loss functions available off-the-shelf might not be optimal for the business objectives. On the other hand, a custom loss function that closely matches the business objectives can take the asymmetry on errors into account. In the disease detection scenario discussed above, false negative errors are far more costly than false positives. Using a custom loss function that penalizes the model heavily for making false negative errors will result in a model that is averse to false negatives.

Problem formulation

We have chosen to predict the survival chances on the Titanic ocean liner using a supervised ML technique called XGBoost. It is an implementation of gradient boosting where an ensemble of weak decision tree learners are combined to produce a strong model.

In order to deal with asymmetric penalties, like the case of disease detection that was discussed above, XGBoost permits custom loss functions. First we build a simple baseline model. Let’s see how we do this in Julia.

Load Data

The datasets have been downloaded from Kaggle:

We use CSVFiles to load the training data, and DataFrames formanipulation.

julia> using CSVFiles, DataFrames
julia> df = DataFrame(CSVFiles.load("train.csv"))

The names method will give the features in this dataframe.

  julia> names(df)
  12-element Array{Symbol,1}:
  :PassengerId
  :Survived
  :Pclass
  :Name
  :Sex
  :Age
  :SibSp
  :Parch
  :Ticket
  :Fare
  :Cabin
  :Embarked

Survived is the target variable, we will have to clean and process the rest of the features before a model can be built on it. We will be building a simple model with a few features, which are Age, Embarked, Sex, Pclass, SibSp, Parch, Fare

Data preprocessing

We will have to clean up the data a bit. The following line gives the count of number of missing values in column Embarked

julia> sum(df[:,:Embarked] .== "")
  2

XGBoost does support learning on missing values, but in this case there are only 2 of them. So, it is better that we impute them with the most frequent value in the column.

Replace missing values with the most frequent port:

julia> df[df[:,:Embarked] .== "", :Embarked] = "S"

The column Age also has missing values, the following command tells us that there are 177 rows in the Age column with missing values

julia> sum(ismissing.(df[:Age]))
177

Age being a numerical feature, it is better that we use the average value of Age column to impute the missing values

Replace missing values with the average:

julia> using Statistics
julia> average_age = mean(df[.!ismissing.(df[:Age]), :Age])
julia> df[ismissing.(df[:Age]), :Age] = average_age

We can use one-hot encoding for categorical features such as Pclass and Embarked. One-hot encoding creates a binary feature for each value of the categorical feature. For instance, the feature Embarked has 3 fields, viz. S, C, Q One hot encoding will create 3 binary features, each corresponding to one of the unique values of the categorical column. The binary feature will Embarked will have 1s in rows where the value of Embarked was S, and 0s everywhere in other rows.

julia>for i in unique(df.Pclass)
	df[:,Symbol("Pclass_"*string(i))] = Int.(df.Pclass .== i)
     end
julia>for i in unique(df.Embarked)
	df[:,Symbol("Embarked_"*string(i))] = Int.(df.Embarked .== i)
	end

Gender can be encoded as a binary feature:

julia> gender_dict = Dict("male"=>1, "female"=>0);
julia> df[:Sex] = map(x->gender_dict[x], df[:Sex]);

Building our baseline model

Let’s split the dataset into training and validation set. The training
set can be created as below

julia> x_train = convert(Matrix{Float32},select(df[1:800,:],Not(:Survived)))
julia> y_train = convert(Array{Float32}, df[1:800,:Survived])

Validation set:

julia> x_val = convert(Matrix{Float32},select(df[801:end,:],Not(:Survived)))
julia> y_val = convert(Array{Float32}, df[801:end,:Survived])

Create a DMatrix:

julia> train_dmat = DMatrix(x_train, label=y_train)

Train the model:

julia> bst_base = xgboost(train_dmat,2, eta=0.3, objective="binary:logistic", eval_metric="auc")
[1] train-auc:0.893250
[2] train-auc:0.899080

Get predictions on the validation set:

julia>  = predict(bst, x_val)

The following function will calculate the accuracy and weighted f score:

function evaluate(y, ;threshold=0.5)
	out = zeros(Int64, 2,2)
	 = Int.(.>=threshold)
	out[1,1]=sum((y.==0).&(.==0))
	out[2,2]=sum((y.==1).&(.==1))
	out[2,1]=sum((y.==1).&(.==0))
	out[1,2]=sum((y.==0).&(.==1)	
	r0 = out[1,1]/(out[1,1]+out[1,2])
	p0 = out[1,1]/(out[1,1]+out[2,1])
	f0 = 2*p0*r0/(p0+r0	
	r1 = out[2,2]/(out[2,2]+out[2,1]
	p1 = out[2,2]/(out[2,2]+out[1,2]
	f1 = 2*r1*p1/(p1+r1	
	println("Weighted f1 = ", round((sum(y .== 0.0)/length(y)) * f0 + (sum(y .== 1.0)/length(y)) * f1	digits=3))
	println("Accuracy =", (out[2,2]+out[1,1])/sum(out))
	out
end                                                                      

Let’s look at the performance of the baseline model:

julia> evaluate(y_val, )
Weighted f1 = 0.845
Accuracy = 0.8461538461538461
2×2 Array{Int64,2}:
51 6
 8 26

Let’s submit this to Kaggle to get a tangible measure of the model performance. The current model is at 12300^th^ position on the leaderboard

Custom loss function

There are 8 false negatives, in order to reduce false negatives, we can weigh false negatives higher than false positives in our loss function.

The signature of custom loss should be:

function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
		gradients =  #calculate gradients
		hessians =  #calculate hessians
		return gradients, hessians
end

The conventional approach is to calculate the gradients and Hessians by hand and then translate them to code. Let’s start by looking at the log loss function used by the model.

Notice that the term -yln(sigma(x)) which penalizes false negatives has a coefficient of 1.

We can change this weight to make the model penalize false negatives more than false positives. With a weight w on false positives the equation will look like this:

XGBoost admits loss functions in the signature that we defined above. We will have to find the gradient and Hessian of this loss and plug them in the function.

The gradient turns out to be:

Differentiating the above again gives us the second derivative:

Now we can complete our function with a slightly higher penalty, i.e. 1.5:

function weighted_loss(preds::Vector{Float32}, dtrain::DMatrix)
	beta = 1.5
	p = 1. ./ (1 .+ exp.(-preds))
	grad = p .* ((beta .- 1) .* y .+ 1) .- beta .* y
	hess = ((beta .- 1) .* y .+ 1) .* p .* (1.0 .- p)
	return grad, hess
end

Note that this required us to do some cumbersome calculus to implement the custom loss. A smarter approach would be to invoke Julia’s automatic differentiation to calculate gradients and Hessians. Apart from not having to do the math, it gives us flexibility as well since a slight change in the function means we won’t have to redo the math again.

Automatic differentiation

XGBoost outputs scores that need to be passed through a sigmoid function. We do this inside the custom loss function that we defined above. Let’s define it here explicitly:

σ(x) = 1/(1+exp(-x))

The weighted log loss can be defined as:

weighted_logistic_loss(x,y) = -1.5 .* y*log(σ(x)) -1 .* (1-y)*log(1-σ(x))

We have added a weight of 1.5 to false negatives.

Now we are ready to take advantage of Zygote to calculate the gradient and Hessian:

gradient_logistic(x,y) = gradient(weighted_logistic_loss,x,y)[1]

The gradient method differentiates the weighted_logistic_loss function. The parameters of the weighted_logistic_loss function are also passed alongside. We then take the first element with [1] because we are interested in the derivative with respect to the first parameter x.

Similarly, the Hessian can be calculated by differentiating the above gradient function:

hess_logistic(x,y) = gradient(grad_logistic,x,y)[1]

We can define the custom loss function using this gradient and Hessian:

function custom_objective(preds::Vector{Float32}, dtrain::DMatrix)
  y = get_info(dtrain, "label")
  grad = grad_logistic.(preds,y)
  hess = hess_logistic.(preds,y)
  return grad,hess
end

This can be passed to the XGBoost trainer:

julia> bst = xgboost(train_dmat, 2,eta=0.3, eval_metric="auc", obj=custom_objective)
[1] train-auc:0.892897
[2] train-auc:0.899565

At this point we can evaluate the model, and compare it against the baseline:

julia>  = predict(bst, x_val)
julia> evaluate(y_val, )
Weighted f1 = 0.878
Accuracy =0.8791208791208791
2×2 Array{Int64,2}:
53 4
7 27

That’s quite an improvement for a small change in the loss function. We didn’t have to do any of the taxing math to try out this objective function. All we had to do was define the function and let Zygote take care of calculating the gradient and Hessian.

At this point, we can submit the results from the new model to Kaggle. We have been able to move 6400 places up on the leaderboard with very little effort.

This exercise illustrates the composability of Julia packages and the flexibility that it gives data scientists.

Newsletter February 2020

Julia Computing hosted a free one hour Webinar for 100+ pharmaceutical researchers to discuss pharmacology modeling using Pumas.jl. The Webinar was led by Vijay Ivaturi, Professor of Pharmacology at the University of Maryland School of Pharmacy who initiated and leads the Pumas project.

Julia Computing Webinars for Enterprise Users: Julia Computing provides free one-hour Webinars for enterprise users who want to learn more about Julia’s capabilities and case studies of successful Julia deployment in production.

Parallel Computing for Enterprises in Julia
Dr. Alan Edelman, MIT Professor of Applied Mathematics, Principal Investigator at MIT Julia Lab and MIT Computer Science and Artificial Intelligence Laboratory, co-creator of Julia, co-founder of Julia Computing
Tuesday February 18, 12 noon – 1 pm Eastern (US) Click here to register
Scientific Machine Learning for Enterprises in Julia
Dr. Chris Rackauckas, MIT Applied Mathematics Instructor and University of Maryland, Baltimore School of Pharmacy Senior Research Analyst
Tuesday February 25 12 noon – 1 pm Eastern (US) Click here to register
Pharmacology and Pharmacometrics in Julia with PumasAI and Pumas.jl
Dr. Vijay Ivaturi, University of Maryland, Baltimore School of Pharmacy Research Assistant Professor
Friday March 6 11:30 am – 12:30 pm GMT (12:30-1:30 pm CEST / 1:30-2:30 pm EEST / 5-6 pm IST) Click here to register

Julia Computing Live Online Training Courses: Julia Computing’s Dr. Matt Bauman leads three 8-hour live online training courses this month. Each course consists of two four hour training sessions. Click here to register today.

      

        

        

      

      

        

        

        

      

      

        

        

        

      

    

Introduction to Julia
Course Description
Part 1: Wed Feb 12 11 am – 3 pm (US EST)
Part 2: Thu Feb 13 11 am – 3 pm (US EST)
Click here to register
Introduction to Machine Learning and Artificial Intelligence in Julia
Course Description
Part 1: Wed Feb 19 11 am – 3 pm (US EST)
Part 2: Thu Feb 20 11 am – 3 pm (US EST)
Click here to register
Parallel Computing in Julia
Course Description
Part 1: Wed Feb 26 11 am – 3 pm (US EST)
Part 2: Thu Feb 27 11 am – 3 pm (US EST)
Click here to register

Julia Computing on DM Radio: Julia co-creator and Julia Computing CEO Viral Shah joined DM Radio’s Eric Kavanagh to discuss Artificial Intelligence as the Great Enabler. Click here to listen.

MLSys 2020: Mike Innes (Julia Computing) will present Sense & Sensitivities: The Path to General-Purpose Algorithmic Differentiation at MLSys in Austin March 2-4. An older version is available here.

Julia Machine Learning Workshop Comes to Prague: Avik Sengupta (Julia Computing) and Kevin O’Brien (Coillte) will present a workshop on Machine Learning in Julia at Machine Learning Prague March 20-22.

Universal Differential Equations for Scientific Machine Learning: Chris Rackauckas et al. have submitted a paper on Universal Differential Equations for Scientific Machine Learning. They describe how universal differential equations “augment scientific models with machine-learnable structures for scientifically-based learning” and “how [universal differential equations] can be utilized to discover previously unknown governing equations, accurately extrapolate beyond the original data, and accelerate model simulation, all in a time and data-efficient manner.”

BioJulia Benchmarking: Jakob Nybo Nissen and Ben J. Ward conducted
benchmarking analysis of
BioJulia and Seq, a new language for bioinformatics.

JuliaCon 2020 Deadlines: JuliaCon
2020
will take place July 27-31 at ISCTE –
Instituto Universitário de Lisboa (ISCTE-IUL) in Lisbon, Portugal.

  1. JuliaCon 2020 Call for Proposals: JuliaCon 2020 proposals are due March 7, 2020. Proposal types include talks, lightning talks, minisymposia, workshops, posters and ‘Birds of a Feather’ breakout sessions. Please review submission guidelines, prepare and submit your proposal no later than March 7, 2020. Mentorship is also available for new presenters.

  2. Financial Assistance to Attend JuliaCon 2020: If financial assistance will impact your ability to attend JuliaCon 2020, please apply no later than March 7, 2020.

  3. Early Bird Ticket Discount: Early Bird Tickets are available for purchase now through April 20, 2020. Please purchase your tickets early to take advantage of discounted pricing.

  4. JuliaCon 2020 Call for Volunteers: JuliaCon runs on volunteers! Please consider signing up to volunteer. JuliaCon volunteer opportunities include:

    • Mentors for new speakers
    • Proceedings reviewers
    • Talk submission reviewers
    • Financial assistance application reviewers
    • Local/onsite volunteers
  1. JuliaCon 2020 Sponsors: JuliaCon relies on the support of sponsors. Click here for more information about becoming a JuliaCon sponsor.

Julia Computing Enterprise Solutions: Contact Julia Computing for more information about putting Julia to work for your organization, deploying Julia more efficiently, effectively and at scale.

  • JuliaSure:
    JuliaSure provides enterprise support and indemnity for organizations using Julia.

  • JuliaTeam:
    JuliaTeam provides enterprise governance including private and package development, deployment, management, security, support and indemnity.

  • JuliaRun:
    JuliaRun allows you to scale Julia deployment from a single machine to dozens or hundreds of nodes in a public or private cloud environment, including AWS, Azure or Google Cloud.

JuliaBox 30 Day Free Trial: JuliaBox is now available with a 30 day free trial. JuliaBox is the fastest and easiest way to start using Julia right away with no download required. Register today to start your 30 day free trial.

JuliaBox Academic Discount: Hundreds of students and faculty at universities around the world use JuliaBox for classroom instruction and learning. Use free and open source materials to design your own course using Julia. JuliaBox starts at just $7 per month including a 50% academic discount. Sign up online or contact Julia Computing to take advantage of the academic discount or for more information.

Julia and Julia Computing in the News

  • HPCWire: Julia Programming’s Dramatic Rise in HPC and Elsewhere

  • CleanTechnica: Energy Efficiency Is A Core Target For Machine Learning

  • BuiltInChicago: An Engineering Leader Discusses the Best Programming Languages to Learn

  • IProgrammer: Python As Fast As Go and C++ The Queens Prove It

  • Analytics India: MLDS 2020 – AIM Wraps Up Bangalore Edition of India’s Biggest ML Developer Conference

  • Clare Herald: Clare and Mid-West Best for Tech and Biotech Job Offerings

  • Irish Tech News: Tech on the Wild Atlantic Way Thomond Park, Feb 1

  • Free Press Journal: 10th Aegis Graham Bell Awards Concluded the 2nd Jury Round

  • Yahoo: 10th Aegis Graham Bell Awards Concluded the 2nd Jury Round

  • HPCWire: Julia Computing Brings Machine Learning in Julia Workshop to Prague

Julia Blog Posts

Upcoming Julia Events

Recent Julia Events

Julia Jobs, Fellowships and Internships

Do you work at or know of an organization looking to hire Julia programmers as staff, research fellows or interns? Would your employer be interested in hiring interns to work on open source packages that are useful to their business? Help us connect members of our community to great opportunities by sending us an email, and we’ll get the word out.

There are more than 300 Julia jobs currently listed on Indeed.com, including jobs at Accenture, Airbus, Amazon, AstraZeneca, AT&T, Barnes & Noble, BlackRock, Capital One, CBRE, Charles River Analytics, Citigroup, Comcast, Conde Nast, Cooper Tire & Rubber, Disney, Dow Jones, Facebook, Gallup, Genentech, General Electric, Google, Huawei, Ipsos, Johnson & Johnson, KPMG, Lockheed Martin, Match, McKinsey, NBCUniversal, Netflix, Nielsen, Novartis, OKCupid, Opendoor, Oracle, Pandora, Peapod, Pfizer, Raytheon, Spectrum, Wells Fargo, Zillow, Brown, BYU, Caltech, Dartmouth, Emory, Harvard, Johns Hopkins, Louisiana State University, Massachusetts General Hospital, MIT, Penn State, Princeton, UC Davis, University of Chicago, University of Delaware, University of Kentucky, UNC-Chapel Hill, USC, University of Virginia, Argonne National Laboratory, Federal Reserve Bank, Lawrence Berkeley National Laboratory, Los Alamos National Laboratory, National Renewable Energy Laboratory, Oak Ridge National Laboratory, Pacific Northwest National Laboratory, State of Wisconsin and many more.

Contact Us: Please contact us if
you wish to:

  • Purchase or obtain license information for Julia products such as JuliaSure, JuliaTeam, or JuliaRun

  • Obtain pricing for Julia consulting projects for your organization

  • Schedule Julia training for your organization

  • Share information about exciting new Julia case studies or use cases

  • Spread the word about an upcoming conference, workshop, training,
    hackathon, meetup, talk or presentation involving Julia

  • Partner with Julia Computing to organize a Julia meetup, conference, workshop, training, hackathon, talk or presentation involving Julia

  • Submit a Julia internship, fellowship or job posting

About Julia and Julia Computing

Julia is the fastest high performance open source computing language for data, analytics, algorithmic trading, machine learning, artificial intelligence, and other scientific and numeric computing applications. Julia solves the two language problem by combining the ease of use of Python and R with the speed of C++. Julia provides parallel computing capabilities out of the box and unlimited scalability with minimal effort. Julia has been downloaded more than 13 million times and is used at more than 1,500 universities. Julia co-creators are the winners of the 2019 James H. Wilkinson Prize for Numerical Software and the 2019 Sidney Fernbach Award. Julia has run at petascale on 650,000 cores with 1.3 million threads to analyze over 56 terabytes of data using Cori, one of the ten largest and most powerful supercomputers in the world.

Julia Computing was founded in 2015 by all the creators of Julia to develop products and provide professional services to businesses and researchers using Julia.

Julia Computing Brings Machine Learning in Julia Workshop to Prague

Prague, Czech RepublicJulia Computing’s Avik Sengupta (VP Engineering) and Coillte’s Kevin O’Brien (Forestry Resource Modeller) will lead a workshop on “Machine Learning in Julia” at the Machine Learning Prague conference on March 20, 2020.

Julia is the fastest high-performance open source computing language for data, analytics, algorithmic trading, machine learning, artificial intelligence, and other scientific and numeric computing applications. Julia solves the two language problem by combining the ease of use of Python and R with the speed of C++.

Julia works with GPUs, TPUs, multithreading and parallel processing to deliver seamless unlimited scalability from a single CPU to thousands of nodes, cores and threads in the public or private cloud. Julia has run at petascale on 9,300 Knights Landing (KNL) nodes with 650,000 cores and 1.3 million threads to analyze over 56 terabytes of data using Cori, one of the ten largest and most powerful supercomputers in the world.

Julia has been downloaded more than 13 million times and is used at more than 1,500 universities. Julia co-creators are the winners of the 2019 James H. Wilkinson Prize for Numerical Software and the 2019 Sidney Fernbach Award.

Julia is used by hundreds of firms worldwide in engineering, biotechnology, pharmaceutical research, aviation, manufacturing and medicine for image and pattern recognition, flight path planning, risk analysis, optimization and more. Examples include:

  • Path BioAnalytics: Path BioAnalytics is a computational biotech company using precision medicine for drug discovery and development and treatment of disease. By switching to Julia, Path BioAnalytics decreased computation 65x, increased accuracy 55% and significantly reduced their code base.

  • Lincoln Labs: Lincoln Labs works with the US Federal Aviation Administration (FAA) to develop and deploy the next generation Airborne Collision Avoidance System (ACAS-X). They use Julia to compute 650 billion decision points within an optimized logic table to identify failures. Julia reduced the time required to conduct these computations by several years.

  • Brazilian National Institute for Space Research: Brazil’s space mission planning research institute uses Julia to plan space missions. They leverage Julia’s superior speed and ease of use to build a simulator, create multidisciplinary design optimization (MDO) tools for space mission planning and design a attitude and orbit control subsystem (AOCS).

  • Contextflow: Contextflow uses Julia for artificial intelligence to search and analyze medical images to improve the speed and accuracy of medical diagnosis and treatment. Julia reduced the time required for image searching from up to 20 minutes to less than 2 seconds.

  • Aviva: One of Europe’s largest insurers uses Julia for risk analysis, including Solvency II compliance. According to Tim Thornham, Aviva’s Director of Financial Modeling, “Solvency II compliant models in Julia are one thousand times faster, use 93% fewer lines of code and took one-tenth the time to implement” compared with their legacy system.

About Julia Computing

Julia Computing was founded in 2015 by all the creators of Julia to provide products including JuliaTeam, JuliaSure and JuliaRun to businesses and researchers using Julia.