If there is anything years of research on how to streamline data analytics has shown us, it is that working with big data is not cake walk. No matter how one looks at it, it is time consuming and computationally intensive to create, maintain, and build models based upon large datasets.
Introducing JuliaDB
In Julia v0.6, we aim to take another step towards solving this problem with our new package, JuliaDB.
JuliaDB is a high performance, distributed, column-oriented data store providing functionality for both in-memory and out-of-core calculations. Being fully implemented in Julia, JuliaDB allows for ease of integration with data loading, analytics, and visualization packages throughout the Julia language ecosystem. Such seamless integration allows for rapid development of data and compute intensive applications.
This example shall use datasets provided by LendingClub, the world’s largest online marketplace for connecting borrowers and investors. On their website, they provide publicly available, detailed datasets that contain anonymous data regarding all loans that have been issued through their system, including the current loan status and latest payment information.
The analysis conducted below is similar to that performed on the same datasets in this post by the Microsoft R Server Tiger Team for the Azure Data Science VM.
The first step in conducting this analysis is to download the following files from the website: LoanStats2016_Q1.csv, LoanStats2016_Q2.csv, LoanStats2016_Q3.csv, LoanStats2016_Q4.csv, LoanStats2017_Q1.csv, LoanStats3b.csv, LoanStats3c.csv and LoanStats3d.csv. A basic clean-up of the data files is performed by deleting the first and last line of descriptive text from each csv.
Writing the Julia code
Once the file clean-up is done, add the following packages: JuliaDB, TextParse, IndexedTables, NullableArrays, DecisionTree, CoupledFields, Gadfly, Cairo, Fontconfig, Dagger, and Compose, followed by loading the required ones.
# Packages that need to be installed with Julia 0.6
Pkg.add("JuliaDB")
Pkg.add("TextParse")
Pkg.add("IndexedTables")
Pkg.add("NullableArrays")
Pkg.add("DecisionTree")
Pkg.add("CoupledFields")
Pkg.add("Gadfly")
Pkg.add("Cairo") # Needed for PNG creation with Gadfly
Pkg.add("Fontconfig") # Needed for PNG creation with Gadfly
Pkg.clone("https://github.com/AndyGreenwell/ROC.jl.git")
Pkg.add("Dagger")
Pkg.add("Compose")
Pkg.add("BenchmarkTools")
using Dagger, Compose
using ROC, Gadfly
using DecisionTree, JuliaDB, TextParse, NullableArrays
import TextParse: Numeric, NAToken, CustomParser, tryparsenext, eatwhitespaces, Quoted, Percentage
Now define a variable that contains a path to the directory containing the data files, and a dictionary that contains the names of all of the columns that are contained in the dataset as keys.
dir = "/home/venkat/LendingClubDemo/files"
const floatparser = Numeric(Float64)
const intparser = Numeric(Int)
t = Dict("id" => Quoted(Int),
"member_id" => Quoted(Int),
"loan_amnt" => Quoted(Nullable{Float64}),
"funded_amnt" => Quoted(Nullable{Float64}),
"funded_amnt_inv" => Quoted(Nullable{Float64}),
"term" => Quoted(TextParse.StrRange),
"int_rate" => Quoted(NAToken(Percentage())),
"delinq_2yrs" => Quoted(Nullable{Int}),
"earliest_cr_line" => Quoted(TextParse.StrRange),
"inq_last_6mths" => Quoted(Nullable{Int}),
...and so on
)
Calling the function “loadfiles” from the JuliaDB package parses the data files, and constructs the corresponding table (providing the above dictionary as input helps it construct the table, although it doesn’t necessarily need this input). Since none of the dictionary columns are index columns, JuliaDB will itself create its own implicit index column with each row having a unique integer value, starting with 1.
LS = loadfiles(glob("*.csv", dir), indexcols=[], colparsers=t, escapechar='"')
Once done, we classify some loans as bad loans and others as good loans based upon whether the payment on the loan is late, in default, or has been charged off. We then split the table based upon whether the loans are good or bad.
bad_status = ("Late (16-30 days)","Late (31-120 days)","Default","Charged Off")
# Determine which loans are bad loans
is_bad = map(status->(status in bad_status),
getdatacol(LS, :loan_status)) |> collect |> Vector{Bool}
# Split the table into two based on the loan classification
LStrue = filter(x->x.loan_status in bad_status, LS)
LSfalse = filter(x->!(x.loan_status in bad_status), LS)
Constructing a relevant model necessitates that we identify which factors are the best in identifying good and bad loans. Over here, the feature selection method that we use is a graphical comparison based upon how each numerical column’s row values are associated with either a good or bad categorization of individual loans. We construct two density plots of the values contained in each numerical column, one for good loans and the other for bad. This process necessitates that we first figure out which columns are numerical. We do that by using the following set of “isnumeric” functions.
# Define a function for determining if a value is numeric, whether or not the
# value is a Nullable.
isnumeric(::Number) = true
isnumeric{T<:Number}(::Nullable{T}) = true
isnumeric(::Any) = false
isnumeric{T<:Number}(x::Quoted{T}) = true
isnumeric{T<:Nullable}(x::Quoted{T}) = eltype(T) <: Number
We then map our isnumeric function over each column of the JuliaDB table, construct Gadfly layers for each density plot for the good and bad loans, and then display that collection for feature selection.
# Produce density plots of the numeric columns based on the loan classification
varnames = map(Symbol, collect(keys(filter((k,v)->(k != "id" && k!="member_id" && isnumeric(v)), t))))
layers = Vector{Gadfly.Layer}[]
for s in varnames
nt = dropnull(collect(getdatacol(LStrue,s)))
nf = dropnull(collect(getdatacol(LSfalse,s)))
push!(layers, layer(x = nt, Geom.density, Theme(default_color=colorant"blue")))
push!(layers, layer(x = nf, Geom.density, Theme(default_color=colorant"red")))
end
# Layout the individual plots on a 2D grid
N = length(varnames)
M = round(Int,ceil(sqrt(N)))
cs = Array{Compose.Context}(M,M)
for i = 1:N
cs[i] = render(Gadfly.plot(layers[2i-1],layers[2i],
Guide.title(string(varnames[i])),
Guide.xlabel("value"),Guide.ylabel("density")))
end
for i = N+1:M^2
cs[i] = render(Gadfly.plot(x=[0],y=[0]))
end
draw(PNG("featureplot.png",24inch, 24inch), gridstack(cs))
The Gadfly plots would typically look like this:
In order to make sure that our analysis is as close as possible as that conducted by Microsoft, we’ll select the same set of predictor variables that they did:
revol_util, int_rate, mths_since_last_record, annual_inc_joint, dti_joint
total_rec_prncp, all_util
Creating the predictive model
Our predictive model will be created by using the random forest model of the DecisionTree.jl package. There are two steps here — one where we use a large amount of data to construct the model, and two, a smaller set of data to test the model. So we randomly split the data into two parts, one containing 75% of the data points, to be used for training the model, and the other containing the other 25%, to be used to test the model.
# Split the data into 75% training / 25% test
n = length(LS)
srand(1)
p = randperm(n)
m = round(Int,n*3/4)
a = sort(p[1:m])
b = sort(p[m+1:end])
LStrain = LS[a]
LStest = LS[b]
labels_train = is_bad[a]
The random forest model needs us to create two vectors — one being a vector of labels, and the other being the corresponding feature matrix. For the label vector, we reuse the index vector used above (when extracting the training subset of the original data to extract the corresponding subset of the is_bad label vector). For the construction of the feature matrix, we extract the columns for our selected features from the distributed JuliaDB table, gather those columns to the master process, and finally concatenate the resulting vectors into our feature matrix.
features_train = [revol_util_train int_rate_train mths_since_last_record_train annual_inc_joint_train total_rec_prncp_train all_util_train]
Having done this, we can now call the “build_forest” function from the DecisionTree.jl package.
model = build_forest(labels_train, features_train, 3, 10, 0.8, 6)
Should we want to save our model to reuse at a later time, we can store it to our disk.
f = open(" loanmodel.jls", "w")
serialize(f, model)
close(f)
We can now test our model on the rest of the data. To do this, we will generate predictions in parallel across all workers by mapping the “apply_forest” function onto every row of the JuliaDB dataset.
predictions = collect(map(row->DecisionTree.apply_forest(model, [row.revol_util.value; row.int_rate.value;row.mths_since_last_record.value;row.annual_inc_joint.value;row.dti_joint.value;row.total_rec_prncp.value;row.all_util.value]), LStest)).data
With our set of predictions, we construct a ROC curve using the ROC.jl package and calculate the area under the curve to find a single measure of how predictive our trained model is on the dataset.
# Receiver Operating Characteristics curve
curve = roc(convert(Vector{Float64},predictions), convert(BitArray{1},is_bad[b]))
# An ROC plot in Gadfly with data calculuated using ROC.jl
Gadfly.plot(layer(x = curve.FPR,y = curve.TPR, Geom.line),
layer(x = linspace(0.0,1.0,101), y = linspace(0.0,1.0,101),
Geom.point, Theme(default_color=colorant"red")), Guide.title("ROC"),
Guide.xlabel("False Positive Rate"),Guide.ylabel("True Positive Rate"))
The ROC would look like this.
Area under the curve would be:
# Area Under Curve
AUC(curve)
0.5878135617540067
There. That is how you would create a model that can predictively determine the quality of a loan using JuliaDB.