Author Archives: Andrew Collier

#MonthOfJulia Day 16: Databases

Julia-Logo-Database

Yesterday we looked at how time series data can be sucked into Julia from Quandl. What happens if your data are sitting in a database? No problem, Julia can handle that too. There are a number of database packages available. I’ll be focusing on SQLite and ODBC, but it might be worthwhile checking out JDBC, LevelDB and LMDB too.

SQLite

SQLite is a lightweight transactional SQL database engine that does not require a server or any major configuration. Installation is straightforward on most platforms.

The first step towards using SQLite from Julia is to load the package.

julia> using SQLite

Next, for illustrative purposes, we’ll create a database (which exists as a single file in the working directory) and add a table which we’ll populate directly from a delimited file.

julia> db = SQLiteDB("passwd.sqlite")
SQLiteDB{UTF8String}("passwd.sqlite",Ptr{Void} @0x00000000059cde38,0)
julia> create(db, "passwd", readdlm("/etc/passwd", ':'), ["username", "password", "UID", "GID",
                                                          "comment", "homedir", "shell"])
1x1 ResultSet
| Row | "Rows Affected" |
|-----|-----------------|
| 1   | 0               |

Then the interesting bit: we execute a simple query.

julia> query(db, "SELECT username, homedir FROM passwd LIMIT 10;")
10x2 ResultSet
| Row | "username" | "homedir"         |
|-----|------------|-------------------|
| 1   | "root"     | "/root"           |
| 2   | "daemon"   | "/usr/sbin"       |
| 3   | "bin"      | "/bin"            |
| 4   | "sys"      | "/dev"            |
| 5   | "sync"     | "/bin"            |
| 6   | "games"    | "/usr/games"      |
| 7   | "man"      | "/var/cache/man"  |
| 8   | "lp"       | "/var/spool/lpd"  |
| 9   | "mail"     | "/var/mail"       |
| 10  | "news"     | "/var/spool/news" |

Most of the expected SQL operations are supported by SQLite (check the documentation) and hence also by the Julia interface. When we’re done we close the database connection.

julia> close(db)

Of course, the database we created in Julia is now available through the shell too.

colliera@propane:~/proj/Z-212-language-julia/src$ ls -l passwd.sqlite 
-rw-r--r-- 1 colliera colliera 6144 Sep 18 07:21 passwd.sqlite
colliera@propane:~/proj/Z-212-language-julia/src$ sqlite3 passwd.sqlite 
SQLite version 3.8.7.4 2014-12-09 01:34:36
Enter ".help" for usage hints.
sqlite> pragma table_info(passwd);
0|username|TEXT|0||0
1|password|TEXT|0||0
2|UID|REAL|0||0
3|GID|REAL|0||0
4|comment|TEXT|0||0
5|homedir|TEXT|0||0
6|shell|TEXT|0||0
sqlite> 

ODBC

If you need to access an enterprise DB (for example, Oracle, PostgreSQL, MySQL, Microsoft SQL Server or DB2) then the ODBC interface will be the way to go. To avoid the overhead of using one of these fancy DBs, I will demonstrate Julia’s ODBC functionality using the SQLite database we created above. Before we do that though, you’ll need to setup ODBC for SQLite. It’s not an onerous procedure at all. Then we fire up the ODBC package and we’re ready to roll.

julia> using ODBC

First we’ll check which drivers are available for ODBC (just SQLite in my case) and what data source names (DSNs) are registered.

julia> listdrivers()
(String["SQLite","SQLite3"],String["Description=SQLite ODBC Driver\0Driver=libsqliteodbc.so\0Setup=libsqliteodbc.so\0UsageCount=1\0","Description=SQLite3 ODBC Driver\0Driver=libsqlite3odbc.so\0Setup=libsqlite3odbc.so\0UsageCount=1\0"])
julia> listdsns()
(String["passwd"],String["SQLite3"])

We see that there is a DSN available for the passwd database. So we create a connection:

julia> db = ODBC.connect("passwd")
ODBC Connection Object
----------------------
Connection Data Source: passwd
passwd Connection Number: 1
Contains resultset(s)? No

At this point I’d like to execute a query. However, somewhat disappointingly, this doesn’t work. No error message but also no results. I’ve logged an issue with the package maintainer, so hopefully this will be resolved soon.

julia> query("SELECT * FROM passwd LIMIT 5;", db)
0x0 DataFrame

What’s promising though is that I can still retrieve the metadata for that query.

julia> querymeta("SELECT * FROM passwd LIMIT 5;", db)
Resultset metadata for executed query
-------------------------------------
Query:   SELECT * FROM passwd LIMIT 5
Columns: 7
Rows:    0
7x5 DataFrame
| Row | Names      | Types                  | Sizes | Digits | Nullable |
|-----|------------|------------------------|-------|--------|----------|
| 1   | "username" | ("SQL_LONGVARCHAR",-1) | 65536 | 0      | 1        |
| 2   | "password" | ("SQL_LONGVARCHAR",-1) | 65536 | 0      | 1        |
| 3   | "UID"      | ("SQL_DOUBLE",8)       | 54    | 0      | 1        |
| 4   | "GID"      | ("SQL_DOUBLE",8)       | 54    | 0      | 1        |
| 5   | "comment"  | ("SQL_LONGVARCHAR",-1) | 65536 | 0      | 1        |
| 6   | "homedir"  | ("SQL_LONGVARCHAR",-1) | 65536 | 0      | 1        |
| 7   | "shell"    | ("SQL_LONGVARCHAR",-1) | 65536 | 0      | 1        |

Again, when we’re done, we close the database connection.

julia> disconnect(db)

We’re now covered a number of means for getting data into Julia. Over the next few days we’ll be looking at Julia’s capabilities for analysing data. Stay tuned. In the meantime you can check out the code for today (and previous days) on github. Also take a look at the talk below.

The post #MonthOfJulia Day 16: Databases appeared first on Exegetic Analytics.

#MonthOfJulia Day 17: Datasets from R

Julia-Logo-RDatasets

R has an extensive range of builtin datasets, which are useful for experimenting with the language. The RDatasets package makes many of these available within Julia. We’ll see another way of accessing R’s datasets in a couple of days’ time too. In the meantime though, check out the documentation for RDatasets and then read on below.

As always, the first thing that we need to do is load the package.

julia> using RDatasets

We can get a list of the R packages which are supported by RDatasets.

julia> RDatasets.packages()
33x2 DataFrame
| Row | Package        | Title                                                                     |
|-----|----------------|---------------------------------------------------------------------------|
| 1   | "COUNT"        | "Functions, data and code for count data."                                |
| 2   | "Ecdat"        | "Data sets for econometrics"                                              |
| 3   | "HSAUR"        | "A Handbook of Statistical Analyses Using R (1st Edition)"                |
| 4   | "HistData"     | "Data sets from the history of statistics and data visualization"         |
| 5   | "ISLR"         | "Data for An Introduction to Statistical Learning with Applications in R" |
| 6   | "KMsurv"       | "Data sets from Klein and Moeschberger (1997), Survival Analysis"         |
| 7   | "MASS"         | "Support Functions and Datasets for Venables and Ripley's MASS"           |
| 8   | "SASmixed"     | "Data sets from \"SAS System for Mixed Models\""                          |
| 9   | "Zelig"        | "Everyone's Statistical Software"                                         |
| 10  | "adehabitatLT" | "Analysis of Animal Movements"                                            |
| 11  | "boot"         | "Bootstrap Functions (Originally by Angelo Canty for S)"                  |
| 12  | "car"          | "Companion to Applied Regression"                                         |
| 13  | "cluster"      | "Cluster Analysis Extended Rousseeuw et al."                              |
| 14  | "datasets"     | "The R Datasets Package"                                                  |
| 15  | "gap"          | "Genetic analysis package"                                                |
| 16  | "ggplot2"      | "An Implementation of the Grammar of Graphics"                            |
| 17  | "lattice"      | "Lattice Graphics"                                                        |
| 18  | "lme4"         | "Linear mixed-effects models using Eigen and S4"                          |
| 19  | "mgcv"         | "Mixed GAM Computation Vehicle with GCV/AIC/REML smoothness estimation"   |
| 20  | "mlmRev"       | "Examples from Multilevel Modelling Software Review"                      |
| 21  | "nlreg"        | "Higher Order Inference for Nonlinear Heteroscedastic Models"             |
| 22  | "plm"          | "Linear Models for Panel Data"                                            |
| 23  | "plyr"         | "Tools for splitting, applying and combining data"                        |
| 24  | "pscl"         | "Political Science Computational Laboratory, Stanford University"         |
| 25  | "psych"        | "Procedures for Psychological, Psychometric, and Personality Research"    |
| 26  | "quantreg"     | "Quantile Regression"                                                     |
| 27  | "reshape2"     | "Flexibly Reshape Data: A Reboot of the Reshape Package."                 |
| 28  | "robustbase"   | "Basic Robust Statistics"                                                 |
| 29  | "rpart"        | "Recursive Partitioning and Regression Trees"                             |
| 30  | "sandwich"     | "Robust Covariance Matrix Estimators"                                     |
| 31  | "sem"          | "Structural Equation Models"                                              |
| 32  | "survival"     | "Survival Analysis"                                                       |
| 33  | "vcd"          | "Visualizing Categorical Data"                                            |

Next we’ll get a list of all datasets supported across all of those R packages. There are a lot of them! Also we see some specific statistics about the number of records and fields in each of them.

julia> sets = RDatasets.datasets();
julia> size(sets)
(733,5)
julia> head(sets)
6x5 DataFrame
| Row | Package | Dataset     | Title       | Rows | Columns |
|-----|---------|-------------|-------------|------|---------|
| 1   | "COUNT" | "affairs"   | "affairs"   | 601  | 18      |
| 2   | "COUNT" | "azdrg112"  | "azdrg112"  | 1798 | 4       |
| 3   | "COUNT" | "azpro"     | "azpro"     | 3589 | 6       |
| 4   | "COUNT" | "badhealth" | "badhealth" | 1127 | 3       |
| 5   | "COUNT" | "fasttrakg" | "fasttrakg" | 15   | 9       |
| 6   | "COUNT" | "lbw"       | "lbw"       | 189  | 10      |

Or we can find out what datasets are available from a particular R package.

julia> RDatasets.datasets("vcd")
31x5 DataFrame
| Row | Package | Dataset           | Title                                      | Rows  | Columns |
|-----|---------|-------------------|--------------------------------------------|-------|---------|
| 1   | "vcd"   | "Arthritis"       | "Arthritis Treatment Data"                 | 84    | 5       |
| 2   | "vcd"   | "Baseball"        | "Baseball Data"                            | 322   | 25      |
| 3   | "vcd"   | "BrokenMarriage"  | "Broken Marriage Data"                     | 20    | 4       |
| 4   | "vcd"   | "Bundesliga"      | "Ergebnisse der Fussball-Bundesliga"       | 14018 | 7       |
| 5   | "vcd"   | "Bundestag2005"   | "Votes in German Bundestag Election 2005"  | 16    | 6       |
| 6   | "vcd"   | "Butterfly"       | "Butterfly Species in Malaya"              | 24    | 2       |
| 7   | "vcd"   | "CoalMiners"      | "Breathlessness and Wheeze in Coal Miners" | 32    | 4       |
| 8   | "vcd"   | "DanishWelfare"   | "Danish Welfare Study Data"                | 180   | 5       |
| 9   | "vcd"   | "Employment"      | "Employment Status"                        | 24    | 4       |
| 10  | "vcd"   | "Federalist"      | "'May' in Federalist Papers"               | 7     | 2       |
| 11  | "vcd"   | "Hitters"         | "Hitters Data"                             | 154   | 4       |
| 12  | "vcd"   | "HorseKicks"      | "Death by Horse Kicks"                     | 5     | 2       |
| 13  | "vcd"   | "Hospital"        | "Hospital data"                            | 3     | 4       |
| 14  | "vcd"   | "JobSatisfaction" | "Job Satisfaction Data"                    | 8     | 4       |
| 15  | "vcd"   | "JointSports"     | "Opinions About Joint Sports"              | 40    | 5       |
| 16  | "vcd"   | "Lifeboats"       | "Lifeboats on the Titanic"                 | 18    | 8       |
| 17  | "vcd"   | "NonResponse"     | "Non-Response Survey Data"                 | 12    | 4       |
| 18  | "vcd"   | "OvaryCancer"     | "Ovary Cancer Data"                        | 16    | 5       |
| 19  | "vcd"   | "PreSex"          | "Pre-marital Sex and Divorce"              | 16    | 5       |
| 20  | "vcd"   | "Punishment"      | "Corporal Punishment Data"                 | 36    | 5       |
| 21  | "vcd"   | "RepVict"         | "Repeat Victimization Data"                | 8     | 9       |
| 22  | "vcd"   | "Saxony"          | "Families in Saxony"                       | 13    | 2       |
| 23  | "vcd"   | "SexualFun"       | "Sex is Fun"                               | 4     | 5       |
| 24  | "vcd"   | "SpaceShuttle"    | "Space Shuttle O-ring Failures"            | 24    | 6       |
| 25  | "vcd"   | "Suicide"         | "Suicide Rates in Germany"                 | 306   | 6       |
| 26  | "vcd"   | "Trucks"          | "Truck Accidents Data"                     | 24    | 5       |
| 27  | "vcd"   | "UKSoccer"        | "UK Soccer Scores"                         | 5     | 6       |
| 28  | "vcd"   | "VisualAcuity"    | "Visual Acuity in Left and Right Eyes"     | 32    | 4       |
| 29  | "vcd"   | "VonBort"         | "Von Bortkiewicz Horse Kicks Data"         | 280   | 4       |
| 30  | "vcd"   | "WeldonDice"      | "Weldon's Dice Data"                       | 11    | 2       |
| 31  | "vcd"   | "WomenQueue"      | "Women in Queues"                          | 11    | 2       |

Finally, the most interesting bit: accessing data from a particular dataset. Below we load up the women dataset from the vcd package.

julia> women = dataset("datasets", "women")
15x2 DataFrame
| Row | Height | Weight |
|-----|--------|--------|
| 1   | 58     | 115    |
| 2   | 59     | 117    |
| 3   | 60     | 120    |
| 4   | 61     | 123    |
| 5   | 62     | 126    |
| 6   | 63     | 129    |
| 7   | 64     | 132    |
| 8   | 65     | 135    |
| 9   | 66     | 139    |
| 10  | 67     | 142    |
| 11  | 68     | 146    |
| 12  | 69     | 150    |
| 13  | 70     | 154    |
| 14  | 71     | 159    |
| 15  | 72     | 164    |

From these data we learn that the average mass of American women of height 66 inches is around 139 pounds. If you are from a country which uses the Metric system (like me!) then these numbers might seem a little mysterious. Come back in a couple of days and we’ll see how Julia can convert pounds and inches in metres and kilograms.

That’s all for now. Code for today is available on github.

The post #MonthOfJulia Day 17: Datasets from R appeared first on Exegetic Analytics.

#MonthOfJulia Day 15: Time Series

Julia-Logo-TimeSeries

Yesterday we looked at Julia’s support for tabular data, which can be represented by a DataFrame. The TimeSeries package implements another common data type: time series. We’ll start by loading the TimeSeries package, but we’ll also add the Quandl package, which provides an interface to a rich source of time series data from Quandl.

julia> using TimeSeries
julia> using Quandl

We’ll start by getting our hands on some data from Yahoo Finance. By default these data will be of type TimeArray, although it is possible to explicitly request a DataFrame instead,

julia> google = quandl("YAHOO/GOOGL");                     # GOOGL at (default) daily intervals
julia> typeof(google)
TimeArray{Float64,2,DataType} (constructor with 1 method)
julia> apple = quandl("YAHOO/AAPL", frequency = :weekly);  # AAPL at weekly intervals
julia> mmm = quandl("YAHOO/MMM", from = "2015-07-01");     # MMM starting at 2015-07-01
julia> rht = quandl("YAHOO/RHT", format = "DataFrame");    # As a DataFramejulia
> typeof(rht)
DataFrame (constructor with 11 methods)

Having a closer look at one of the TimeSeries objects we find that it actually consists of multiple data series, each represented by a separate column. The colnames attribute gives names for each of the component series, while the timestamp and values attributes provide access to the data themselves. We’ll see more convenient means for accessing those data in a moment.

julia> google
100x6 TimeArray{Float64,2,DataType} 2015-04-24 to 2015-09-15

             Open      High      Low       Close     Volume         Adjusted Close  
2015-04-24 | 580.05    584.7     568.35    573.66    4608400        573.66          
2015-04-27 | 572.77    575.52    562.3     566.12    2403100        566.12          
2015-04-28 | 564.32    567.83    560.96    564.37    1858900        564.37          
2015-04-29 | 560.51    565.84    559.0     561.39    1681100        561.39          
⋮
2015-09-10 | 643.9     654.9     641.7     651.08    1384600        651.08          
2015-09-11 | 650.21    655.31    647.41    655.3     1736100        655.3           
2015-09-14 | 655.63    655.92    649.5     652.47    1497100        652.47          
2015-09-15 | 656.71    668.85    653.34    665.07    1761800        665.07          
julia> names(google)
4-element Array{Symbol,1}:
 :timestamp
 :values   
 :colnames 
 :meta     
julia> google.colnames
6-element Array{UTF8String,1}:
 "Open"          
 "High"          
 "Low"           
 "Close"         
 "Volume"        
 "Adjusted Close"
julia> google.timestamp[1:5]
5-element Array{Date,1}:
 2015-04-24
 2015-04-27
 2015-04-28
 2015-04-29
 2015-04-30
julia> google.values[1:5,:]
5x6 Array{Float64,2}:
 580.05  584.7   568.35  573.66  4.6084e6  573.66
 572.77  575.52  562.3   566.12  2.4031e6  566.12
 564.32  567.83  560.96  564.37  1.8589e6  564.37
 560.51  565.84  559.0   561.39  1.6811e6  561.39
 558.56  561.11  546.72  548.77  2.362e6   548.77

The TimeArray type caters for a full range of indexing operations which allow you to slice and dice those data to your exacting requirements. to() and from() extract subsets of the data before or after a specified instant.

julia> google[1:5]
5x6 TimeArray{Float64,2,DataType} 2015-04-24 to 2015-04-30

             Open      High      Low       Close     Volume        Adjusted Close  
2015-04-24 | 580.05    584.7     568.35    573.66    4608400       573.66          
2015-04-27 | 572.77    575.52    562.3     566.12    2403100       566.12          
2015-04-28 | 564.32    567.83    560.96    564.37    1858900       564.37          
2015-04-29 | 560.51    565.84    559.0     561.39    1681100       561.39          
2015-04-30 | 558.56    561.11    546.72    548.77    2362000       548.77    
julia> google[[Date(2015,8,7):Date(2015,8,12)]]
4x6 TimeArray{Float64,2,DataType} 2015-08-07 to 2015-08-12

             Open      High      Low       Close     Volume        Adjusted Close  
2015-08-07 | 667.78    668.8     658.87    664.39    1374100       664.39          
2015-08-10 | 667.09    671.62    660.23    663.14    1403900       663.14          
2015-08-11 | 699.58    704.0     684.32    690.3     5264100       690.3           
2015-08-12 | 694.49    696.0     680.51    691.47    2924900       691.47          
julia> google["High","Low"]
100x2 TimeArray{Float64,2,DataType} 2015-04-24 to 2015-09-15

             High      Low       
2015-04-24 | 584.7     568.35    
2015-04-27 | 575.52    562.3     
2015-04-28 | 567.83    560.96    
2015-04-29 | 565.84    559.0     
⋮
2015-09-10 | 654.9     641.7     
2015-09-11 | 655.31    647.41    
2015-09-14 | 655.92    649.5     
2015-09-15 | 668.85    653.34  
julia> google["Close"][3:5]
3x1 TimeArray{Float64,1,DataType} 2015-04-28 to 2015-04-30

             Close     
2015-04-28 | 564.37    
2015-04-29 | 561.39    
2015-04-30 | 548.77  

We can shift observations forward or backward in time using lag() or lead().

julia> lag(google[1:5])
4x6 TimeArray{Float64,2,DataType} 2015-04-27 to 2015-04-30

             Open      High      Low       Close     Volume        Adjusted Close  
2015-04-27 | 580.05    584.7     568.35    573.66    4608400       573.66          
2015-04-28 | 572.77    575.52    562.3     566.12    2403100       566.12          
2015-04-29 | 564.32    567.83    560.96    564.37    1858900       564.37          
2015-04-30 | 560.51    565.84    559.0     561.39    1681100       561.39   
julia> lead(google[1:5], 3)
2x6 TimeArray{Float64,2,DataType} 2015-04-24 to 2015-04-27

             Open      High      Low       Close     Volume        Adjusted Close  
2015-04-24 | 560.51    565.84    559.0     561.39    1681100       561.39          
2015-04-27 | 558.56    561.11    546.72    548.77    2362000       548.77

We can also calculate the percentage change between observations.

julia> percentchange(google["Close"], method = "log")
99x1 TimeArray{Float64,1,DataType} 2015-04-27 to 2015-09-15

             Close   
2015-04-27 | -0.0132 
2015-04-28 | -0.0031 
2015-04-29 | -0.0053 
2015-04-30 | -0.0227 
⋮
2015-09-10 | 0.0119  
2015-09-11 | 0.0065  
2015-09-14 | -0.0043 
2015-09-15 | 0.0191

Well, that’s the core functionality in TimeSeries. There are also methods for aggregation and moving window operations, as well as time series merging. You can check out some examples in the documentation as well as on github. Finally, watch the video below from JuliaCon 2014.

The post #MonthOfJulia Day 15: Time Series appeared first on Exegetic Analytics.