Re-posted from: https://bkamins.github.io/julialang/2020/06/07/analyze-julia-git.html
Development of the Julia language
During 20 years of my work as a researcher I have used numerous programming
languages to do scientific computing, chiefly R, Python, and Java.
However, when I learned Julia I immediately felt this is a to-go solution,
although I started using it when version 0.3 was released and the language
and its ecosystem was still immature.
Currently Julia has reached version 1.4.2 and in many fields its package
ecosystem provides best-in-class functionality.
A natural question to as is who has made this happen. It is easy enough to find
out on the GitHub page of the Julia project here.
However, the default GitHub interface allows you to only see contributions
by number of commits, additions or deletions. We can learn from this that
Jeff Bezanson is a leader by far in all these categories.
However, the statistics show you the whole history of the git repository.
I was always curious who is the author of the current state of the code.
Essentially, what I wanted to do is blame the whole repository and count
the distribution of the number of lines committed by the authors.
The problem is that by default git
does not give you such an option.
There are ways to achieve this, which I discuss below. The project was
interesting for me, because I think it nicely shows what Julia offers you
when you have a scripting task at hand.
Before we start
In order to follow the examples below you need to have git
installed.
Also you should have git-extras
installed. If you are on Ubuntu just write
sudo apt install git-extras
and it should be added.
In order to analyze the repository we need to download it to our local machine
e.g. to julia_src
folder.
This can be done using the following command (warning! it takes some time):
~$ git clone https://github.com/JuliaLang/julia.git julia_src
Cloning into 'julia_src'...
remote: Enumerating objects: 83, done.
remote: Counting objects: 100% (83/83), done.
remote: Compressing objects: 100% (78/78), done.
remote: Total 325678 (delta 31), reused 16 (delta 5), pack-reused 325595
Receiving objects: 100% (325678/325678), 181.28 MiB | 1.20 MiB/s, done.
Resolving deltas: 100% (244259/244259), done.
Now switch our working directory to the newly downloaded repository:
~$ cd julia_src
~/julia_src (master)$
Using git
You can get the information we want using the summary
command provided by
git-extras
. Here is how you can do it:
The whole list is quite long so I have cut it down to show only people with at
least 1.0% contribution. As you can see from the distribution
Jameson Nash is really close to Jeff Bezanson
in the ranking.
As you can see I have additionally added time
in front of the command to see
how long the operation took. For a such large repository as this one
(note that it has almost 500,000 lines of code) it is quite time consuming.
The first thing I did was search over the Internet and I have found the
following proposal here:
The solution finished in 4 minutes and 14 seconds, so it was two times faster
(the downside is that it does not produce a nice percentage information).
In general it lead me to thinking about writing a Julia script that would do the
job and check its speed. In the next section you can find my take on it.
Using Julia
In the solution I use FreqTables.jl, ProgressMeter.jl, and Pipe.jl in the
following versions:
(@v1.4) pkg> status FreqTables ProgressMeter Pipe
Status `~/.julia/environments/v1.4/Project.toml`
[da1fdf0e] FreqTables v0.4.0
[b98c9c47] Pipe v1.2.0
[92933f4c] ProgressMeter v1.3.0
Here is the code that does the job of listing authors of all lines in the git
repository:
As you can see I am using Threads.@threads
to use multiple threads for
my computations. In variable p
I keep a progress meter that helps me
to visually track how the computations go.
In the code a line that looks innocent but is actually quite relevant is
shuffle!(files)
. You might wonder why do I randomly reorder files for
processing. The reason is that the files most probably (and in fact also
actually) do not have the same cost of processing using git blame
. Therefore
I do not want to have expensive files clumped together. This has two benefits:
- ProgressMeter.jl is able to quickly give me a good estimate of ETA (e.g. if
cheap files were clumped together at the beginning of processing the
estimate would be overly optimistic); Threads.@threads
does static allocation of jobs to threads; this against
means that we do prefer to shuffle jobs in order to reduce the risk that
all expensive jobs go to a single thread, which would negatively affect the
overall processing time.
Finally note that I wrap append!
to auths
vector in a lock to avoid
race condition (different threads potentially might try to update auths
at
the same time). This is not needed for next!(p)
operation as ProgressMeter.jl
is thread-safe.
Now let us test the above code. First start Julia using four threads
(you can change it of course to other number of threads) using the command:
~/julia_src (master)$ JULIA_NUM_THREADS=4 julia
(on Windows do set JULIA_NUM_THREADS=4
before running Julia)
Next load the script I have given above. You are now ready for the test. Here
is the code I have run on my machine:
As you can see I am well under 2 minutes now.
In the last part of code I have used Pipe.jl which greatly facilitates
using pipes in Julia (there is also a very nice package
Underscores.jl which I recommend you
to investigate; it has more functionality but this comes at the cost of being
a bit more complex to master).
What Pipe.jl does is best described by a section of its manual, so I just reuse
it here:
if after
@pipe
you place a underscore in the right hand of|>
,
it will be replaced with the left hand side. So:@pipe a |> b(x, _) # == b(x, a)
I hope you enjoyed this little exercise (and now we know exactly whose code we
run when using Julia).
P.S. Setting up your environment
As you probably know I am obsessed with proper environment setup. In an earlier
post I discussed that you should always make sure you run proper versions
of the packages. What is a quick way to set up the environment for the project
described in this post?
When you are in Julia REPL (e.g. started as instructed above in the julia_src
directory) switch to the package manager mode by pressing ]
and execute the
following commands (I am showing the whole output which is a bit long but allows
you to check which packages got recursively added to Manifest.toml):
(@v1.4) pkg> activate .
Activating new environment at `~/julia_src/Project.toml`
(julia_src) pkg> add FreqTables@0.4.0 Pipe@1.2.0 ProgressMeter@1.3.0
Updating registry at `~/.julia/registries/General`
Updating git-repo `https://github.com/JuliaRegistries/General.git`
Resolving package versions...
Installed Parsers ─ v1.0.5
Updating `~/julia_src/Project.toml`
[da1fdf0e] + FreqTables v0.4.0
[b98c9c47] + Pipe v1.2.0
[92933f4c] + ProgressMeter v1.3.0
Updating `~/julia_src/Manifest.toml`
[324d7699] + CategoricalArrays v0.8.1
[861a8166] + Combinatorics v1.0.2
[9a962f9c] + DataAPI v1.3.0
[864edb3b] + DataStructures v0.17.17
[e2d170a0] + DataValueInterfaces v1.0.0
[da1fdf0e] + FreqTables v0.4.0
[41ab1584] + InvertedIndices v1.0.0
[82899510] + IteratorInterfaceExtensions v1.0.0
[682c06a0] + JSON v0.21.0
[e1d29d7a] + Missings v0.4.3
[86f7a689] + NamedArrays v0.9.4
[bac558e1] + OrderedCollections v1.2.0
[69de0a69] + Parsers v1.0.5
[b98c9c47] + Pipe v1.2.0
[92933f4c] + ProgressMeter v1.3.0
[ae029012] + Requires v1.0.1
[3783bdb8] + TableTraits v1.0.0
[bd369af6] + Tables v1.0.4
[2a0f44e3] + Base64
[ade2ca70] + Dates
[8bb1440f] + DelimitedFiles
[8ba89e20] + Distributed
[9fa8497b] + Future
[b77e0a4c] + InteractiveUtils
[8f399da3] + Libdl
[37e2e46d] + LinearAlgebra
[56ddb016] + Logging
[d6f4376e] + Markdown
[a63ad114] + Mmap
[de0858da] + Printf
[9a3f8284] + Random
[ea8e919c] + SHA
[9e88b42a] + Serialization
[6462fe0b] + Sockets
[2f01184e] + SparseArrays
[10745b16] + Statistics
[8dfed614] + Test
[cf7118a7] + UUIDs
[4ec0a83e] + Unicode
(julia_src) pkg> status
Status `~/julia_src/Project.toml`
[da1fdf0e] FreqTables v0.4.0
[b98c9c47] Pipe v1.2.0
[92933f4c] ProgressMeter v1.3.0
Now you are sure all will work as expected. Just press backspace to leave the
package manager mode and you are ready to run the examples.