Tag Archives: R

Dave Giles on “MCMC for Econometrics Students”

By: Francis Smart

Re-posted from: http://www.econometricsbysimulation.com/2014/04/dave-giles-on-mcmc-for-econometrics.html

In an excellent four part series of posts in March, Dave Giles introduces Markov Chain Monte Carlo (MCMC) and Gibbs samplers.  In these posts he gives a cogent explanation for the reasoning and mechanics involved in this branch of econometrics/statistics as well as clear simulated examples in R.

If you have not checked it out yet, now is definitely a good time.

Find Dave’s posts:
Post 1: Introduction
Post 2: Showing that MCMC “Actually Works”
Post 3: Shows an additional example as well as how to extract marginal posterior distributions
Post 4: Shows how simple it is to use R to implement MCMC

In order to experiment with the topic of MCMC I have made some modifications to Dave’s code in R.  He makes no assertions that his code is in efficient form.

Gibs defined below is the same as his code except that Gibs is now defined as a function.  Gibs2 has modified the code as best I could do in order so that I am working with vectors as much as possible rather than item by item manipulation.  I used Noam Ross’s excellent post to inform out my understanding of improving processing speeds with R.

By vectoring the random draws Gibs2 processes 2 to 3 times faster than Gibs.  Full code can be found on github:

system.time(gibs())

# user system elapsed
# 0.97 0.06   1.17

system.time(gibs(nrep=10^6))
# user system elapsed
# 9.31 0.02   9.64

system.time(gibs2())
# user system elapsed
# 0.35 0.00 0.36
system.time(gibs2(nrep=10^6))
# user system elapsed
# 3.66 0.01   3.80 

As an exercise I also coded the same gibs function in Julia.  This can also be found on github as well.

@elapsed yy = gibs()
# 0.063151467
@elapsed gibs(10^6)
# 0.479542057

@elapsed yy = gibs2()
# 0.010729382
@elapsed yy = gibs2(10^6)
# 0.065821774
 

The first thing to notice is that when coding the initial form of gibs, the julia version is considerably faster (10-15x).   Gains from improving the form are also larger in julia with gibs2 running much faster than the R version (30-55x faster).

Flattr this

A Weekend With Julia: An R User’s Reflections

By: Francis Smart

Re-posted from: http://www.econometricsbysimulation.com/2014/04/a-weekend-with-julia-r-users-reflections.html

http://Pixton.com/ic:ngr1sg4e
The Famous Julia

First off, I am not going to talk much about Julia’s speed. Everybody has seen the tables and graphs showing how in this benchmark or another, Julia is tens times or a hundred times faster than R.  Most blog posts talking about Julia test the generality of these results (Bogumił Kamiński 2013, Randy Zwitch 2013, and  Wes McKinney 2012).

Enough said about machine speed!  Let’s talk more about intuitive appeal, compactness of notation, and aesthetics.
Julia has some very thoughtful design features that make it an extremely enjoyable language to program in despite being in a nascent state.
1. Julia has some great ways of handling string expressions.  In Julia all strings are subsettable.  Thus:
julia> a = “Hello world”; a[3:7]
“llo w”
In R:
R>a <- “Hello world”; substr(a, 3, 7)

Also, “Julia allows interpolation into string literals using $, as in Perl:”(doc)

julia>user = “Bob”

julia>"Hello $user. How are you?"
"Hello Bob. How are you?"

2. Julia implements comprehension syntax for defining arrays which is an incredibly powerful method. Formally it looks something like this: A = [ F(x,y,…) for x=rx, y=ry, … ]

Imagine we would like to define an area equivalent to the number line:

julia> A = [ x*y for x=1:12, y=1:12]; A
12x12 Array{Int64,2}:
  1   2   3   4   5   6   7   8    9   10   11   12
  2   4   6   8  10  12  14  16   18   20   22   24
  3   6   9  12  15  18  21  24   27   30   33   36
  4   8  12  16  20  24  28  32   36   40   44   48
  5  10  15  20  25  30  35  40   45   50   55   60
  6  12  18  24  30  36  42  48   54   60   66   72
  7  14  21  28  35  42  49  56   63   70   77   84
  8  16  24  32  40  48  56  64   72   80   88   96
  9  18  27  36  45  54  63  72   81   90   99  108
 10  20  30  40  50  60  70  80   90  100  110  120
 11  22  33  44  55  66  77  88   99  110  121  132
 12  24  36  48  60  72  84  96  108  120  132  144

Let’s see how we might do this in R.

R><matrix(nrow=12,ncol=12); A <-col(A)*row(A); A
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
 [1,]    1    2    3    4    5    6    7    8    9    10    11    12
 [2,]    2    4    6    8   10   12   14   16   18    20    22    24
 [3,]    3    6    9   12   15   18   21   24   27    30    33    36
 [4,]    4    8   12   16   20   24   28   32   36    40    44    48
 [5,]    5   10   15   20   25   30   35   40   45    50    55    60
 [6,]    6   12   18   24   30   36   42   48   54    60    66    72
 [7,]    7   14   21   28   35   42   49   56   63    70    77    84
 [8,]    8   16   24   32   40   48   56   64   72    80    88    96
 [9,]    9   18   27   36   45   54   63   72   81    90    99   108
[10,]   10   20   30   40   50   60   70   80   90   100   110   120
[11,]   11   22   33   44   55   66   77   88   99   110   121   132
[12,]   12   24   36   48   60   72   84   96  108   120   132   144

3. Functions can be written in a mathematically intuitive form.

julia> f(x,y)=x^3y+x*y; f(3,2)
31

4. Numerical constants leading functions are automatically interpreted

julia> x=13; 3x
39
julia> (12+4)x
208
R> x<13; 3*x
39
R> (12+4)*x
208

5. Julia does not worry about deep nesting of functions. Imagine a very silly function that adds up all of the integers between 1 and n.

julia> f(n)=(n>1 ? f(n1)+n: n); f(100)
5050
R> f <function(n) ifelse(n>1, f(n1)+n, n); f(100)
[1] 5050
Both R and Julia seem to work fine.  But what happens when we go a little deeper?
R> f(10^3)
Fails while
julia> f(10^5)

Does not even cause a hiccup! Now I have never been in the situation of needing this many recursions but this result reflects the general power of the language.

6. Julia has no problem with many Unicode characters. Thus if you want θ(μ,σ,φ)=μ^σ/φ

julia> θ(μ,σ,φ)=μ^σ/φ; θ(1,2,3)
0.3333333333333333

Personally I find this notation extremely appealing as it succinctly communicates the equation for which the researcher is actually dealing with as opposed to what typically happens when I code R.

R> theta <– function(mu, sigma, phi) = mu^sigma/phi; theta(1,2,3)

Besides being more compact, Julia’s notation requires less mental juggling in order to translate from mathematical concepts to direct implementation.

7. Tuple fast assignment.

I am not sure if I am referring to this properly. Perhaps one of the omniscient badasses working on Julia can correct me if I am not describing this feature correctly.


I will just show you how it works:

julia> a, b, c, d = 1, 22, 33, 444
julia> b,d
(22, 444)
 

This might seem very trivial but check out how easy it is to pass parameters into a function. Item response theory probability of a 4PL model defined with ability parameter θ and item parameters a,b,c,d.

function ppi(θ=0, ipar=(1, 0, 0, 1), D=1.7)

  a, b, c, d = ipar
  # Define exponential factor appearing in
  # numerator and denominator.
  ε = e^(D * a *  - b))
  # Define final probability being returned.
  c + ((d-c)*ε)/(1+ε)

end

julia> ppi()
0.5 
8. The Readability of programming in Julia.

What I mean by this is that when programming in Julia, so long as you follow basic rules regarding programming efficiency your program is likely to run at a comparable speed in Julia as it would be if it were programmed in C. For me this is a big deal. It means that I do not have to think about figuring out how to program the same task in two different languages if I am programming a new library or feature.

Likewise, when reading functions and libraries written in Julia, I hopefully will have to worry less about working with other languages. This should make it possible for the source code for functions to be more accessible for the purposes of learning and improving my understanding of programming.

Some Critiques

It is at this time that I want to mention a few critiques.  As I see it, these critiques are entirely based on the novelty of the Julia environment which is completely addressed by the creators listing the current version of the language in 0.2 and testing 0.3.  As far as how the language actually performs, I have no complaints.  It does not have as many libraries as R but they are in development.

1. The documentation is surprisingly very readable and surprisingly enjoyable.  I found that reading through documentation not only enhanced my understanding of Julia but significantly enhanced my understanding of programming languages in general.  That said, the documentation seems to be written from programming polyglots to other programming polyglots.  I think it would be very difficult for someone without programming experience to read the documentation and be able to do much with Julia.

2. The documentation is sparse on examples and the examples tend to be at a high level of complexity.  I personally think that the best way to learn to program is through examples (though this might explain my spotting understanding of abstract concepts like scope).  The documentation of Julia has a fair number of examples but I would think that doubling the number of examples would be very helpful.  In addition, there are quite a number of functions written for Julia which do not have any documentation at all.  Obviously it would be helpful to have documentation for such functions.

3. Finally, as a windows users (I know, poor form) it is not very easy to use Julia.  I have been using Forio’s Julia Studio (since I do not like working with Window’s command line) but don’t find it that great to work with plus it is proprietary (though free).  It would be nice if there was a very basic windows IDE like R’s for which I could write code in Nopepad++ and send it to Julia.  But once again this is a minor issue.  I am resolved to get a Linux build running on my machine once I make my way back to a country with decent internet.

Getting Social Sciences Out of the Black Box: The Open Access Revolution

By: Francis Smart

Re-posted from: http://www.econometricsbysimulation.com/2014/04/getting-social-sciences-out-of-black.html

Trading Ethos for Logos

Up until very recently (the last 10 years) it has been uncommon for social science researchers to share their data even when the sharing would neither compromise the private information of the subjects nor the validity of the study. [1]

Even more uncommon is it for researchers to willingly share the code they used to transform their data from its raw state to the state required to produce their analysis.

Researchers instead refer to what is often an incomplete “write up” included in their published work as reference for inquiries.

As a matter of course, they by and large do not publicly recognize the possibility of coding inconsistencies unintentionally or intentionally created as existing except if it were the case in which an error was so large or gross that a reviewer could detect it through superficial examination.

Of course, as every programmer knows, errors (or bugs) inevitably pop up and are only removed through a combination of careful examination of code, experience at error detection, and luck.  For researchers to not share both their code and their data is therefore an act of unscientific hubris at best, deceit and publication chasing at worst.

Even when code is shared, the code which is shared is often defaced and unintelligible without comments or instructions on how to connect the data to the code.  I have experienced this situation first hand, the result being that unsurprisingly it is usually more energy efficient to recode the entire analysis from scratch rather than attempting to interpret the baffling and intractable code of another.

Furthermore social science research is generally ranked by the number of citations it inspires, a measure more related to its ability to create a splash, than its truthfulness.  Thus it might be profitable for a researcher to exaggerate or selectively pick results that further the researcher’s position at the expense of identifying true effects or results.

In this paradigm, even when it became widely known that a paper employed faulty methods, the critique of that research needs to undergo the hurdle of being presented or published in a secondary source at some later point in time resulting in a large scale failure to disseminate corrections to analysis.

It is ironic, that in a profession (economics) that takes as its base assumption that humans are utility maximizing and only moral when it is cost effective, that such a lax incentive structure would have developed.

The result of this sordid academic publishing incentive system is a lemons market for academic research (See Recent Economist’s Article) in which it is believed that there is good research “out there”, but it is uncertain how that research is being done and by whom.

So what is the solution?
Enter the Open Access Revolution. This revolution is characterized by a new structure of analytical work heralded by a rise in open access journals, open data providers, requirements for the sharing of data by mainstream journals, and the general proliferation of shared research methods through online communities.  Open access journals in particular are exceedingly helpful in improving the ethics of scientific research by providing standards in which research, data, analysis, as well as critiques are required to be shared.

The hope is that by transitioning from black box research methods to a system in which information is shared, the lemons market will be squeezed to death as unethical and sloppy research is crushed out of the market.  Not to say that all “bad” research is the results of unethical behavior or lack of carefulness in coding or analysis.  It is simply impossible to know what the quality of the research being done is until it can be assessed externally.

Open Source Statistical Programming Languages (R, Julia, etc.)
In order to accomplish the full dissemination of research and research methods not only is the sharing of data and code ideal but the sharing of the means of running the analysis.  To this end I believe open source programming languages are a highly effective tool.  Analysis done in these languages and shared through publications is highly accessible to any researcher, even researchers who do not have a background in that language, since anybody can install these languages for free, run the code, and get help from their highly active communities.

This however may not be the case with other statistical programming languages which present large barriers to analysis by requiring the purchase of expensive or difficult to find software.  In addition, other software options may not be as well supported as these programming languages, resulting in a situation in which even if you have someone else’s code, there might not be a support network available to help debug and service that code.

It is therefore my belief that using an open source statistical programming languages should become the standard in statistical analysis and the setting of standards for scientific research, especially in the highly subjective fields known as the “social sciences”.

Data Aggregators: (Quandl, the Social Science Research Network, etc.)
In order to further the goal of open access to data, data aggregators and in particular Quandl have emerged.  These independent organizations have taken on the mammoth responsibility of gathering, organizing, and redistributing publicly available data for the purpose of advancing scientific research. 

Combining code from an open source statistical programming language and the data from Quandl, a new possible data analysis paradigm has emerged which is as different from the old paradigm as mountain spring water is from muddy puddles.  In this new paradigm, it is possible to have every piece of an analysis including the access and downloading of data, coded, and openly available for review and duplication.  Students and fellow researchers will not have to trudge through the difficult and often impossible work of trying to rebuild the analysis of others based on their published works.

Disclosure
For the sake of full disclosure, there is the possibility that I will be a guest blogger on the Quandl blog in the near future.  Thus my favoring of Quandl in this post, might be influenced by that anticipated opportunity.  That said, I find Quandl a truly remarkable undertaking and hopefully a game changer for the standards in future research.

[1] Purely personal observation.  I do not have data to support this claim.