Tag Archives: Machine Learning

Speeding up Julia precompilation

Re-posted from: https://medium.com/coffee-in-a-klein-bottle/speeding-up-julia-precompilation-97f39d151a9f?source=rss-8bd6ec95ab58------2

An easy Tutorial on using PackageCompiler.jl to create a precompiled Custom Julia version

Continue reading on Coffee in a Klein Bottle »

New trends in programming languages

By: Nicolau Leal Werneck

Re-posted from: https://medium.com/julia-notes/new-trends-in-programming-languages-1876e879651d

In this article we implement a same program with different programming languages trying to understand what sets them apart. That’s only after we ponder a bit about what’s even the point of all these languages? We look at C++, Go, Rust, Swift, Python and Julia. Among the many aspects we might appreciate in these languages, we focus on scientific programming applications, and on what sets Julia apart. Does Julia really talk like Python and run like C?

Introduction

I love computers, compilers and programming languages. Or at least I used to, until I grew up. I believe it happened some day between college and my first job after grad school, that I became an adult. Among the many revelations that came with this, I learned that in the Real World there is a lot of negative feelings around programming languages, a lot of fear and hatred. Sincerely loving programming languages suddenly became difficult to me.

According to the scriptures, languages are a curse cast onto humanity by The Lord: “Come, let’s go down and confuse the people with different languages. Then they won’t be able to understand each other.”–God. Heaven, 2354.

The tower of Babel being assembled under the disapproving sight of the Principal Architect

Science also has something to say about languages. The Indian linguist Kanavili Rajagopalan writes that: Linguistic identity is largely a political matter and languages are flags of allegiance. This means that the instrumental view of language is fundamentally flawed. If anything, it is the pre-theoretical sense that communication is possible or desirable (…) that makes us postulate the existence of a common language.

Programmers today are often quickly and superficially labelled according to the main language they use. There are “Java programmers”, “Ruby coders”, “PHP people”, “Pythonistas” and “Haskellers”. Job ads almost always ask for proficiency in specific programming languages. Cartoons try to guess what each of these individuals should look like. Differences about languages are often emphasized in our daily lives, and only the sporadic reference to general concepts such as algorithms or design patterns allow us to transcend these differences. Professor Rajagopalan also says The Greek sense of self-identity crucially depended upon the perception that [the “Barbarians”] were just as human as the Greeks themselves, only different. How often nowadays do you consider the “barbarians” are programmers too?

Even though this pigeonholing feels like contemporary identity politics, there has always been some kind of stress between users of different programming languages. The book Hackers by Steven Levy tells in great detail how some Assembly programmers were angry with the popularization of Basic. Even the creation of the first compiled languages like Fortran suffered push-back from people who found it pointless. “What a silly waste of time, I can perfectly go on just flicking these flimsy switches forever!”

Microsoft made a fortune by sparing people from flicking switches

I invented this last quote, so let me provide an actual reference to the fact not everyone leaped into the first languages as soon as they were available: The 704 and 709 Fortrans were successful quite early — especially Fortran II — but the penetration on users, so to speak, was rather uneven. The most experienced users (…) tended to retain assembly language programming, and the newest and least sophisticated newcomers to computing were most frequently Fortran users.

Why are there always some people opposed to new programming technologies? Perhaps it’s not all just politics. First of all, it’s not always clear what is a legitimate technological advance instead of mere hype, and some healthy skepticism is a good thing. On top of that, programming is hard, and most people cannot afford the time and energy to learn new languages. If you can do your work with a language, a work that is challenging and that you are proud of, it really can sound stupid, offensive or insensitive if someone suggests you should or even might do things differently. This sense of pride in your work, and also this feeling of belonging to a community of programmers of a specific language, sharing the same values, both seem related to the idea of thumos, very well discussed in the recent book Identity by Francis Fukuyama.

The world is not frozen in time, though. While many ancestral languages like C and LISP have an impressive ability to remain alive and relevant, new languages continue to be created and adopted by programmers. Especially by beginner programmers, who can often choose indiscriminately between new or established languages as their first one. For any programmer, beginner or not, contemplating the current panorama of languages, these questions come up naturally: What’s the difference between them after all? And what language should I learn?

What follows is an attempt to provide an answer, at least one that makes sense for a certain programmer, and that may be useful to some others.

The rise of scientific Python

In the specific context of scientific and numerical programming, data analysis and machine learning the question is especially difficult right now because there is a lot going on. This is an area where, first of all, there was always this feeling that to reach the ultimate speed you would eventually have to move your code to “good old” C++ or Fortran. Before that happened, though, you would be prototyping and experimenting with more interactive and dynamic languages such as Matlab, R and Mathematica, or maybe those 90s languages like Perl, Python, Ruby, Lua and Tcl, not to mention the occasional Bash script and SQL query. And some tasks are inherently experimental and interactive. Java has also been used in this area in the past decades, although probably motivated more by its previous adoption in companies for other reasons. And apart from languages and libraries there are of course tools like Weka.

Some languages, especially Python, have been gaining attention in the past decade as a result of the growing interest in data science, and more recently in the use of deep learning for working with tasks involving complex natural signals such as images, sound and natural language. Some of the reasons Python seems so suited for the job might be:

1. The ease to inter-operate with efficient, compiled libraries.

2. The succinct, “clean” syntax resulting from features such as dynamic dispatch and garbage collection, not to mention a more sophisticated parser than most people would’ve had the patience to write until the 1980s.

3. The community. Fun and welcoming since the 1990s, Python managed to attract and retain people with different interests, from system administrators, web and game developers to a more scientific crowd such as the creators of packages like numpy, scipy, scikit and matplotlib.

It is not obvious why Python specifically has gained such popularity, though. Other dynamic languages offered something similar to Python, and some did get adopted in data science and deep learning. For instance, Perl offers arrays similar to Numpy through the PDL library, and has statistical packages on CPAN. It also has/had a large community and plenty of “batteries included”. Octave had a pretty solid offering as a free Matlab clone. For deep learning we can cite the Torch framework, based on Lua, which definitely had some traction. At a time Ruby seemed as good as Python for anything and eventually it only became big in web development, never numerical work.

The book Leaders: Myth and Reality by McChristal et al narrates the lives of a number of impressive personalities, and raises the question of how those people became great leaders. It ends up concluding that leadership does not depend so much on personality traits or anything this leader might do. While they do always work hard, a successful leadership seems to depend more on symbolism instead of actual delivered goods, on the environment the leader was in, the organizations around them, and most of all great leadership depends on the will of the leader’s followers to acclaim and follow that leader. Also luck can be a factor.

Robespierre was a great leader. Very popular, and very sure of what was the right way of doing things. He led the ostentatious “Festival of the Supreme Being” above. Only two months later he was sent to the guillotine.

What could that mean for languages? While great languages like Python or JavaScript do offer some amazing features to its users, widespread adoption might depend on more than that, or even the famous batteries. It can depend on how institutions back different languages, what the language can come to symbolize in time to different people, and it fundamentally depends on the decision of individuals to go on pick up a language. With that I mean the existence of a demand, some itch to scratch by the developers and a willingness to look for the answer in a different language. And finally, some luck might be part of everything.

Note I did not mention popularity here. It may be part of what drives some developers in some situations. In other situation it may be the exact opposite, some people may be looking precisely for a unique language. While certainly access to information and to experienced colleagues can be a factor in making a developer pick a language, no language can be popular from the start, and I think we consider popularity to be an important factor in language adoption far more often than it really matters. It’s just a simplistic and naive take on what drives people to follow a leader or to choose a programming language to learn. We are not just “sheeple”!

Rising demands in dynamical languages

Python has definitely made a formidable journey from a systems scripting language to become a big player in the area of numerical programming among Matlab and C++. While the success is clear, the language definitely has its limitations, which are understood by its users. A good job interview covering Python will touch on these topics, and some of the most important tools and techniques used in Python relate to these limitations. And some of these limitations have deep roots.

The first one has to be performance. You simply cannot expect great performance from pure Python code when doing numerical programming tasks. Python can fall behind even compared to other interpreted dynamical languages such as JavaScript . It was never meant to be an efficient language, though. Exaggerating a bit, asking for that is like demanding a great performance from a bash interpreter, for instance.

It is impressive how Python sought to overcome its performance limitations. While there were misses such as Unladen Swallow, there have also been hits such as IronPython and PyPy. And on top of that, of course, is the widely successful implementation of the Matlab-ish paradigm of running fast vector operations over arrays implemented by Numpy. This led the way to more wins with the use of Python on top of lower-level libraries such as OpenCV and TensorFlow. Apart from Python’s success as a “glue language”, projects such as Cython an Numba demonstrate that the platform can go even further.

Another major limitation of Python relates to its laissez-faire type system. Python normally follows the “duck typing” approach, that basically means every function is completely generic, and as long as underlying functions exist that can accept the operator types found at run-time they will just run. This flexibility has undeniable benefits to the development experience, and is one of the main reasons for the popularity of dynamic languages. There is another side to it, though, and experienced developers know that in time it can create problems. For instance, it can be hard to figure out how to use some piece of code that is too abstract. “What are even the expected variable types??”

It is often useful to type-check an input argument to make sure there are no silent bugs happening because a function is not being used as intended. Discussion about compulsory type declarations in programming languages are at least as old as 1963, as this quote from Hoare in the Algol committee shows. While a compelling story, the Mariner I failure was actually caused by a typo in a variable name, although one may argue this mistake could also have been prevented by type-checking.

Just like with performance, the Python community saw that something could be done to improve this situation, and eventually MyPy appeared as a way to provide static type-checking to the language. Again, it is remarkable that this could be accomplished, and the fact they could pull this off says something about the Python architecture and its community.

Considering the evolution of Python and other dynamic languages in the past decades, it seems there was first a massive adoption followed by a craving for performance and a type system. Apart from Python, this can also be observed in JavaScript, where V8 brought great performance improvements first and more recently TypeScript contributed in the other front. Another language that has recently introduced an optional type system is Racket, which has also been aiming at further improving its performance by adopting Chez Scheme as a sort of compiler back-end.

Developments in the static front

Looking beyond dynamic languages, we can see a number of compiled and typed-checked languages that have been proposed in the past decade and have been gaining popularity. We can cite Swift, Rust and Go as examples. All of those have the backing of big organizations and plenty of smart people working on them. The fact they are compiled mean they might offer the desired performance for numerical applications, and the fact they are somehow “modern” means they can offer at least some usability improvements over languages such as C++ and Fortran. Can the scientific community ever favor those languages the same way as Python or Matlab?

Discussing this kind of question is often annoying because of a fundamental fact: Most programming languages can do everything. Having the potential, and even being somehow innovative doesn’t mean you can actually make it happen, though, as this webinar from last year argues. Adoption of a new language depends on the language solving some kind of problem the users feel they have. The conclusion must be that this cannot mean merely whether it is feasible to create some kind of program with the language, otherwise assembly would be “good enough” for anything, and everybody would still be just punching instruction codes into paper tapes in order to write a program. Adopting a new language must bring a productivity benefit, or some other kind of advantage to its users in order to gain traction.

Rust basically brings some of the same improvements over C and C++ that Java did in the 90s. This necessary aggiornamento for low-level, systems languages means offering features such as automatic bounds-checking for array accessing, automatic memory management through reference counting or other forms of garbage-collection, better support for immutability, and less automagical type conversions such as integer-to-boolean. —It is important to point out that this last fact is what many people mean by “strongly/weakly-typed language”, what is different from having explicit argument types at function declarations. It is also different from dynamic/static typing.

C and C++ are seen by some today as being a perfectly fine language, capable of offering all of the above. They just happen to have the wrong defaults, or require caution. This sounds like a nice argument, though it’s not very different from the point above that most languages can do everything. These “defaults” are actually what is crucial in your experience as a programmer of the language. If you are struggling to make something happen as you want, the language isn’t really helping, and you may be better off creating a whole language that outputs C code instead, such as Cython or Nim. While it may seem to help fixing the problems in C and C++ to talk about wrong defaults and warn about the necessary work and caution, these are exactly the very things that drive developers to pick up Python instead of C, Fortran instead of Assembly, etc.

Some people are often annoying under the pretext of making a joke. If you are always joking, though, this is who you are. You are actually an annoying person, it makes no difference if it is a joke. It may be possible to perform bounds checks, reference counting, enforce immutability and explicit conversions in C++. For some reason people don’t do it, this is the face of practical C++, at least for many frustrated programmers. It’s great that libraries have been created to deal with these issues. It took decades and the influence from other languages, though, and it can feel like an afterthought, similar to Python’s efforts towards performance and type safety.

Go is quite different from Rust because apart from trying to offer the “right defaults”, it offers a kind of asceticism that some programmers seem to be craving for. This is epitomized by its lack of parametric polymorphism. The asceticism is not an explicit wish, though. The creators of Go state their goal to be mostly reducing compilation time, and to deliver overall simplicity. Knowingly or not, the design of Go seems to be exploiting the paradox of choice somehow. Whatever they are trying to do, the project seems to have attained success at it. The cost of these constraints seems to change a lot among different programmers, meaning there are some people who will just not be able to use the language, while there does seem to be a large public who is happy with it. Even very happy. It is questionable whether it can please some niches, though, including scientific programmers.

While Rust started backed by the Mozilla Foundation and Go by Google, Swift was created by Apple to be the official language for iOS development, replacing Objective-C —whose name indicates some sort of inspiration. Swift seems to retain some baggage from its predecessor, what should be expected. It does offer something new, though, maybe the main thing being the fact its compiler is based on LLVM, just like Rust and unlike Go.

Swift was not considered a disruptive language in the webinar we cited, maybe because at first it only sought to replace Objective-C, catering to the same audience. Recently, though, some people have been proposing Swift as a language that might offer something to scientific programmers. Central to this idea is the Swift for TensorFlow project.

The timeline goes a bit like this: in the early 2000s Chris Lattner created LLVM at UIUC, with which Clang was developed, and later many other projects. Lattner was later hired by Apple, where he created Swift based on LLVM (and some of that Objective-C baggage). Now he works at Google and is building on top of his successes with the Swift for TensorFlow project.

One of the core developers of Swift for TensorFlow was Richard Wei, who followed Lattner’s steps working at the same UIUC group, Apple and then Google. The seed for Swift for TensorFlow seems to be his graduation project, discussed in this presentation at the 2017 LLVM developers meeting. Apart from the support from Google, the project has been receiving some external appraise too, conspicuously by the Silicon Valley-based fast.ai.

The project seems to exert a natural attraction force to anyone interested in developing iOS apps with deep learning near San Francisco. S4TF does much more than collect the right bag of buzzwords, though. One big reason for its success may be that similar approaches in other modern compiler languages have actually been tried, and simply dropped. Other languages just don’t seem to be adequate for it.

This blog here is a nice illustration of the difficulties that can be faced by anyone trying to integrate TensorFlow with either Go or Rust, and also how even Swift itself was challenging. Swift, by the way, also had its own bumpy ride going from something strictly used for iOS apps to turn into something that might be used in the server or for scientific programming.

The latest bump in the Swift ride is perhaps Richard Wei announcing last month that he’s leaving the project. Although the project remains with a strong institutional backing, some important questions about whether it all could be really integrated in Swift were never answered. It remains unclear what is the potential of S4TF as a practical tool, and not just a well-working prototype that demonstrate many of the things contemporary scientific programmers are looking for.

Building a tool for the job

From S4TF, Python and even C, we have been repeatedly talking here about languages that are being used for scientific programming as an afterthought. This contrasts with languages built with this purpose in mind from the start, such as Mathematica, R, Matlab, Fortran or APL. It doesn’t mean the first languages cannot be good, and it doesn’t mean the second languages are intrinsically better: they do face challenges when moving from prototyping to commercial applications, for instance.

One of the most interesting language alternatives available for scientific programming today was developed on top of LLVM, with an explicit goal of allowing top performance to be achievable. It was also designed with all the benefits seen in modern languages that allow for comfortable prototyping, with a type system that can be as flexible or strict as you wish, and also advanced language features that enable the implementation of techniques such as source-to-source automatic differentiation.

This language is Julia. It started at MIT, which provides an institutional backing. It has been in development for many years already, and version 1.0 released last year was an important landmark. The maturity reached by the core language is complemented by some great tools such as a package system and a debugger, not to mention many packages that help developers working with differential equations and image processing all the way to writing an HTTP server or scripts to run external processes.

While Julia’s dynamic features paired with its close integration with LLVM allows Julia to solve the so-called “two language problem”, Julia also offers great integration with other languages, including Python and C. It has a community with a strong academic background, with some quite active online resources such as a forum and a Slack workspace. It also has a great conference, and many its talks are available on YouTube.

JuliaCon attendees in Baltimore, USA, 2019.

Experiments

This article begins with a promise of an experiment looking into different languages. While it was a long ride to get here, we will finally deliver it!

Julia seems to be accomplishing something that almost sounds like a paradox. In the two histories we discussed before, we saw what we called “dynamical” languages struggling to offer some features from “static” languages, especially high performance and static type-checking, while these “static” languages were following a different path, slowly incorporating features such as polymorphism, meta-programming or dynamic dispatch.

This distinction between these groups of languages can make it seem there’s some natural compromise between them. As if a language can be either fast and reliable to run, or convenient and pleasant to write. As soon as a “dynamic” language is offering static analysis as with MyPy or a form of run-time compilation such as a TensorFlow graph, or when C++ starts offering native dynamic dispatch and a template system that effectively allows you to do “duck typing”, does the nomenclature still make sense?

There is no paradox, really. We believe there is just a natural evolution of programming languages, and the specific limitations we see are casual.

Our experiment hypothesis is that Julia allows you to write code that is very similar to Python, a high standard in syntax quality, while also delivering the performance of C++, a high standard in code efficiency. It is not easy to come up with a good objective assessment of these things, though we believe trying it out is better than just waving your arms and leave you without any kind of practical argument to support all the many words already spent in this article!

Show me the code

We implemented the “change-making problem” in multiple languages, and the code is available at: https://gist.github.com/nlw0/04ec031eaa839d5e358d7ad0d194c497

We offer these numbers as a performance evaluation, solving the problem for the number 907 with coins denominations {11,10,1}.

C++: 3.775 ms

Julia: 3.315 ms

Python: 15.098 ms

It seems safe to say Julia can achieve the same performance as C++. In this case here it even surpassed a little bit, a difference that is most probably related to some small implementation change that a diligent programmer might be able to find. Python, on the other hand is at least 4 times slower than either Julia or C++. Actually quite a good performance, Python can be way slower in other cases.

Regarding syntax similarity, we performed a test utilizing diff. The script can be found on the gist above, and all the pairwise similarity results can be also found on the gist. Here are the similarities only relative to Python:

Python: 1.0
Julia: 0.61
JavaScript: 0.55
Swift: 0.46
Go: 0.33
Rust: 0.30
C++: 0.30

The fact JavaScript scored high while C++ scored low should support this measurement of language similarity. And notably, Julia scored highest in this “Pythonicity” test compared to all other languages.

Conclusion

The importance of programming languages in our lives has only increased since their creation, and they have shown to be very enduring entities. While great transformations have happened in computer technology, from processors to networks, some programming languages seem able to live forever after reaching a certain level of popularity. Languages play a role in our lives that goes beyond their strictly technical purpose. They form a rallying point for the formation of programmer communities, and learning specific programming languages is often conflated with learning other general topics, including computer science.

In the first decades of programming language existence different paradigms were investigated, and the many initiatives showed us what is even possible to do with them. Modern programming languages are being created in a different context, and improvements in software development as a whole makes it easier every day for programming languages to offer new features that in the past would tend to remain exclusive to the languages where they were first created. Among these technological improvements we can cite the LLVM in especial as one important tool that has enabled many of the most interesting languages created in the past decade or so. The JVM has also had a similar effect, not to mention GCC.

Modern scientific programming has been pushing the limits of languages in many ways, requiring solutions for high-performance, distributed and GPU programming, interactive and exploratory work, and also demanding more powerful abstractions. This is exemplified by the need to handle large and complex data pipelines and models, that we want to evaluate, modify, and take all imaginable derivatives along the way!

The great flexibility in program language development today means many languages can offer more or less the same thing, and inter-operation has been becoming easier with time too. It’s not really all the same, though, and differences can be subtle and annoying things to talk about. One can go crazy looking at all the differences in our example code and trying to say why the syntax from one language is better than the other. We did our best not to fall into this trap in this blog. When debating programming languages, we live emphasizing small differences, unless it seems convenient to admit a “common language”. Then the difference between languages become a silly detail, and the barbarians became people too…

This article started with the intention of advocating the Julia language. Many Julia programmers are scientists and academics, though, and by following this nature, this became a scientific-y text. And it turns out that advocating a language may be something strictly rhetorical, though.

If we try to find very compelling objective reasons to use Julia, we will probably always perform experiments similar to what we showed before. We can show good running times, and if there is such a thing as a good test for a “clean syntax”, it seems the language can stand up to it. It can also definitely deliver modern needs such as running on GPUs and performing automatic differentiation. It checks many boxes.

The main reasons anyone will pick up Julia, though, or any other language, are probably not very scientific in nature. Programming is actually a very social activity, what seems at odds with the very technical and impersonal nature of the problems programmers often concern themselves with. Technical reasons may not be enough to convince anyone to try a new language. It’s more like: if you liked this article, you should probably try the language. Join the community, ask your friends, read more about it, I hope you like it.

Our final conclusion must be that you probably don’t need to hear a scientist telling you why to choose a language, but a poet instead. The Brazilian writer Clarice Lispector once wrote:

Surrender, as I surrendered. Immerse yourself in what you do not know how I dove. Do not bother to understand, live passes all understanding.

Try Julia out today!

New trends in programming languages was originally published in Julia Notes on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Tour of the Data Ecosystem in Julia

By: Jacob Quinn

Re-posted from: https://quinnj.home.blog/2019/07/21/a-tour-of-the-data-ecosystem-in-julia/

Julia 1.0 was released at JuliaCon 2018 and it’s been a quick year for the package ecosystem to build upon the first long-term stable release. In a lot of ways, pre-1.0 for packages involved a lot of experimentation; a lot of trying out various ideas, shotgun-style and seeing what sticks, in addition to trying to keep up with the evolving core language. One of my favorite things about Julia is the great efforts that have been made to collaborate, coordinate, and modularize not just the base language and standard libraries, but the entire package ecosystem. Julia was born in the age of GitHub, Discourse, and Slack, which has led to an exceptional amount of public communication from and with core developers, targeted efforts to automatically update package code and test the impact of core changes to the ecosystem, and a proliferation of user-group meetups and domain-specific GitHub organizations. I don’t feel dishonest in saying I think Julia is the most collaborative programming language that exists today.

With all this context, one might be wondering: so what is the current status of working with data in Julia? How do various packages like CSV.jl, DataFrames.jl, JuliaDB.jl, and Query.jl play together. And so I present, “A Tour of the Data Ecosystem in Julia”. Let’s begin…

Data I/O: How do I get data in and out of Julia?

Text Files
I once heard Jeff Bezanson say tongue-in-cheek, “It doesn’t really matter what fancy features you put in a programming language, all people really want to do is read csv files. That’s it, just csv file reading.” Julia has grown a lot since the earliest days of dlmread, with several excellent packages for reading csv and other delimited files, each growing out of a unique idea for approaching this old problem or addressing a specific integration need.

CSV.jl
As blog author, you’ll have to forgive the shameless plug for my own packages: I started CSV.jl in 2015 to try and make a strong effort to match csv parsing functionality provided by other popular languages (in particular, Pandas and fread in R). Since then, it’s accumulated some ~400 commits from over 25 collaborators, and recently hit issue #464. It also released a significant upgrade with the recent 0.5 release, bringing performance inline with the fastest parsers in other languages, while providing some powerful and unique features not available elsewhere, including: “perfect” column typing without needing to restart parsing at all, auto delimiter detection, automatic handling of invalid rows, and several options/layers for lazily parsing files, all with unparalleled performance (a full feature comparison with pandas and R can be found in the latest discourse announcement post). Part of CSV.jl’s speed comes from the hand-tuned type parsers made available in the Parsers.jl package, which includes extremely performant parsing for ints, floats, bools, and dates/datetimes, all in pure Julia.

Tables.jl
We’re going to take a quick detour from I/O packages here to mention a key building block in the I/O story. The Tables.jl package was born during the 2018 JuliaCon hack-a-thon. It was a collaboration between several data-related package authors to come up with essentially two access patterns for “table” like data: “rows” and “columns”. The core idea is simple, yet powerful: any table-like data format that can implement one of the access patterns (via row iteration, or column access), can “automatically” integrate with any downstream package that then uses one of the access patterns, all without needing to take bi-directional dependencies. Even being a relatively new package, the power of this interface can be seen already in the integrations already available:

In-memory datastructures
- DataFrames.jl
- TypedTables.jl
- IndexedTables.jl/JuliaDB.jl
- FunctionalTables.jl
Data Formating/Processing Packages
- DataKnots.jl
- FreqTables.jl
- Mustache.jl
- FormattedTables.jl
- PrettyTables.jl
- TableView.jl
- BrowseTables.jl
Data File Format Packages
- CSV.jl
- Feather.jl
- StataDTAFiles.jl
- Taro.jl
- XLSX.jl
Database Packages
- ODBC.jl
- SQLite.jl
- MySQL.jl
- LibPQ.jl
- JDBC.jl
Statistics Packages
- StatsModels.jl
- MLJ.jl
- GLM.jl
Plotting Packages
- StatsMakie.jl
- StatsPlots.jl
- TableWidgets.jl

Ok, back to data I/O. We also have the wonderful TextParse.jl, which started as the data ingestion engine for JuliaDB.jl. While blazing fast, it doesn’t have quite the maturity or breadth of functionality compared to CSV.jl, like this long-standing issue of incorrect float parsing. The CSVFiles.jl package also exists to provide FileIO.jl integration, meaning you can just do load("data.csv") and FileIO.jl automatically recognizes the .csv extension and knows how to load the data. (A full feature comparison between CSV.jl and CSVFiles.jl can be found here). Another interesting approach to csv reading popped up recently in the form of TableReader.jl from the ever-talented Kenta Sato, using finite state machines. A pretty decent collection of benchmarks can be found here.

In addition to the excellent packages for handling text-based, delimited files, there are also some great files for managing/handling data formats in general.

RData.jl: for reading .rda files from R into Julia
DBFTables.jl: for reading .dbf files into Julia
StatFiles.jl: for reading SAS, SPSS, and Stata files into Julia
StataDTAFiles.jl: for reading and writing Stata files
DataDeps.jl: for declaring data dependencies and managing reproducible setups for with data in Julia
ExcelFiles.jl: for reading excel files into Julia and integration with FileIO.jl

Another popular option for data storage is binary-based formats, including feather, apache arrow, parquet, orc, avro, and BSON. Julia has pretty good coverage of these formats, including native Julia implementations in Feather.jl (and FileIO.jl integration in FeatherFiles.jl), Arrow.jl, and Parquet.jl (again, with FileIO.jl integration with ParquetFiles.jl). There’s also the BSON.jl package for binary JSON support. Beginning support for avro has been started in Avro.jl, and I’m personally interested in diving into the ORC format.

Database Support
As mentioned above in Tables.jl integrations, there are also great packages in place to support extracting data from databases, including:

ODBC.jl: provides generic support for any database that provides a compatible ODBC driver. Supports parameterized queries for inserting data, as well as extracting data from a query, and with Tables.jl support, “exporting” the results to any Tables.jl-compatible sink. It does require setting up an ODBC administrator tool on OSX and linux (windows has builtin support), and then proper setup/installation of the specific database’s ODBC driver, but once setup, things generally work pretty seamlessly.
JDBC.jl: the JDBC.jl package also provides access for databases that support JDBC access. It does so via JavaCall.jl, which does require an active JDK to work, which can also be tricky to setup, but database vendors tend to have pretty good support for a JDBC driver
MySQL.jl/LibPQ.jl: specific packages for mysql/postgres databases, respectively. Both provide integration with Tables.jl and aim to provide support for database-specific types and functionality (beyond the more generic interfaces of ODBC/JDBC). They can be easier to setup since there’s no “middle man” interface library and you just need to interact with the database libraries directly.
SQLite.jl: a library providing integration with the excellent sqlite database; supports in-memory databases as well as file-based. Supports parameterized queries and Tables.jl integration for loading data and extracting query results.

Data Processing: what can I do with my data once it’s in Julia?

Once you identify the right I/O package for reading your data, you then have a choice to make with regards to what you need/want to do with it. Clean it? Reshape it? Filter/calculate/group/sort it? Just browse around for a little while? Do advanced statistics, model training, or other machine learning techniques? Here we take a tour of the current landscape of packages that provide various types of functionality with regards to data processing, including table-like structures, query functionality, and statistics/machine learning.

Table Structures
Yes, in Julia we can have an entire section where we talk about table structures, plural. Sometimes when I mention to people that there are several dataframe-like packages in Julia, they are initially confused: why does Julia need more than one table package? Why doesn’t everyone just work on a single package to focus on quality over quantity? My most common response goes something like this: for a project like Pandas, or data.table, or data.frame, most of the actual code is written in what language? Python? Or R? It’s actually C/C++. The common belief is that because R/Python are dynamic, “high-level” languages, they can’t be fast, but that’s no worry, you just write the parts that need to be fast in C. But therein lies a core issue: for languages like Python and R, where the user-to-developer ratio is so high, having core pieces of functionality written in a lower-level language like C makes the code much less accessible to someone who wants to contribute, let alone just take a peek under the hood to see how things work. It automatically splits the code into a lower-level “black box” component, and a higher-level “wrapper” component. I believe it stunts innovation due to such a high “barrier to entry” to take a new approach to a table structure. I might be a budding R or Python user, getting into developing things, but shoot, if I want to make a meaningful contribution to Pandas or data.table, all the sudden I’m wading into deep make/cmake/build issues, compiler versions, and manual memory management that I’ve never dealt with before. The other alternative is I write something in pure R/Python which will just surely be doomed to performance issues, regardless of how novel my interfaces or APIs might be.
Julia, however, is a famously declared solution to this “two-language problem”. No longer do budding developers need to fear plain for-loops, or vectorize every operation, or rely on C/C++ for the “core stuff”. You can just write plain Julia, down in the trenches, and up at the highest-level user APIs. Julia, all the way down.
I firmly believe this has led to greater innovation in Julia for experimenting with unique table structures, as well as making packages more accessible to those hoping to contribute.
Ok, enough soap-boxing, let’s talk packages.

DataFrames.jl
The DataFrames.jl package is one of the very oldest packages in the Julia ecosystem. This tenure and naming proximity with its cousins in Pandas and R have also made it one of the most popular packages for those hoping to give Julia a try. The package has evolved quite a bit since its early days, and is rapidly approaching its own 1.0 release (expected around JuliaCon 2019). Development over the last year or two has focused on core performance, safety of APIs, and overall consistency with Base APIs. The amount of thought, effort, discussion, and documentation by numerous collaborators makes it the most mature “table” package in my opinion. So what’s unique about DataFrames? I’ll try to give what I deem to be notable highlights of how DataFrames.jl approaches representing tabular data:

A DataFrame stores columns internally as a Vector{AbstractVector}; but wait, you might ask, isn’t that type unstable (since we’re essentially lumping all columns, regardless of individual column type, as AbstractVector)? Yes! And on purpose! Experienced Julia developers are quick to point out that sometimes code can get “overly typed”, leading to “compilation overdrive”, where the compiler is having to generate very specialized code for every operation, with compiled code reuse rare. A DataFrame can represent any number of columns, with any combination of column types, so it’s a natural scenario where you may want to “hide” too much type information from the compiler by slapping the lowest common denominator abstract type as the type label (AbstractVector in this case). This design decision has certainly been extensively discussed, but remains as-is, if not as a more compiler-friendly option than other “strongly typed table” types.
DataFrames.jl includes specialized subtypes for representing SubDataFrames and GroupedDataFrames, as opposed to returning full DataFrames; these “lazy” structures are mostly used in intermediate operations and is a useful way to avoid too much unnecessary data copying/movement
Core manipulation operations included in the package itself include grouping, joining, and indexing; in particular, supporting column indexing via regex matches, functional selecting, Not indexing (inverted indices), and flexible interfaces similar to Base.Arrays in terms of filtering/selecting specific indices for rows or columns.
A lot of work in recent years has also been to simplify and move code out of the DataFrames.jl package, to focus on the core types, functionality, and reduce the dependency burden, being a common dependency for packages around the ecosystem; this has included notably the creation of the DataFramesMeta.jl package to support other common query/filter/manipulation operations
DataFrames.jl supports the Tables.jl interface, which means any I/O package also supporting it can automatically convert it’s table format into an in-memory DataFrame, and convert back to the format for output

IndexedTables.jl / JuliaDB.jl

JuliaDB.jl splashed onto the Julia scene in early 2017, touting a new approach to query/table operations that utilized type stability and Julia’s built in parallelism to provide “out of core” functionality like a database. JuliaDB.jl’s core table type actually lives in the IndexedTables.jl package, which in turn uses the clever StructArrays.jl package to turn NamedTuple rows into an efficient struct-of-arrays structure more suitable for columnar analytics. JuliaDB.jl itself then, with the use of Dagger.jl, adds the “parallel” layer on top of IndexedTables.jl. The benefits of “type stability” come from a table being essentially encoded as Table{Col1T, Col2T, Col3T, ...}, where the full type of each column is encoded in the top-level table type. This allows operations like selecting columns, filtering, and aggregating to be extremely efficient due to the compiler knowing the exact types it’s dealing with (as opposed to having to do “runtime” checks like in the DataFrames.jl case). Now, as discussed in the DataFrames.jl section, this doesn’t come without a cost; indeed, there’s a long-standing issue for dealing with the compilation cost for tables with a large number of columns. But, in the case of a manageable number of columns, and with the ability to scale tables up beyond a single machine’s memory limits is powerful functionality. JuliaDB.jl is also sponsored by JuliaComputing, which gives it a nice stamp of support and stability. While it may lack some of the maturity of DataFrames.jl long-discussed APIs, I’m excited by the unique, “Julian” approach JuliaDB.jl provides in the big data analytics space. Go check out the docs here and give it a spin. An additional package providing experimental manipulation functions for JuliaDB is JuliaDBMeta.jl, which mostly mirrors the before-mentioned DataFramesMeta.jl, but for JuliaDB tables.

TypedTables.jl
The TypedTables.jl package is one that started out in “experimentation” mode for a while until recently being declared “ready” by its primary author, the venerable Andy Ferris. Andy has long been known in the Julia community for his eye for API consistency and being able to strike that elusive balance between theoretical ideals and practical use. In TypedTables.jl, the “fully type stable” approach is taken, similar to IndexedTables.jl/JuliaDB.jl, with a Table being defined literally as <: AbstractVector{T} where {T <: NamedTuple}, that is, a collection of “rows” or NamedTuples. While TypedTables.jl includes two @Select and @Compute macros for simple manipulations, some of the more interesting promise comes from the TypedTables.jl “supporting cast” packages:

AcceleratedArrays.jl: a tidy package to turn any array into an “indexed” array (in the database sense), to provide optimized “search” functions: findall, findfirst, filter, unique, group, join, etc. An AcceleratedArray can thus be used in a TypedTable to provide powerful indexing behavior for an entire table (though note that it works on any AbstractArray, which means these indexed columns could even be used in a DataFrame).
SplitApplyCombine.jl: this package provides a powerful set of functions to perform common split, apply, and combine operations on generic collections like mapmany, group, product, groupreduce, and innerjoin. The aim is to provide the basic building blocks of relational algebra functions that work on any kind of collection, obviously including a TypedTable as a specialized type of AbstractVector of NamedTuples.

CSV.File
One more honorable mention for table structures (and another shameless plug) actually comes from the CSV.jl package. While CSV.read materializes a file as a DataFrame, a CSV.File, which supports all the same keyword arguments as CSV.read, can be used to transfer data to any other Tables.jl sink, or used itself as a table directly. It supports getproperty for column view access and iterates a CSV.Row type (which acts like a NamedTuple). I think as packages continue to evolve, we’ll see more and more cases of customized structures like this, which allow for certain efficiencies or specialized views into raw data formats, and with sufficiently general interfaces like Tables.jl and SplitApplyCombine.jl, users won’t need to worry as much about conforming to a single table structure for everything, but can focus on understanding more generic interfaces, and using data structures optimized for specific use-cases, data formats, and workflows.

Query.jl
Another package (set of packages really) that must be discussed in the Julia data ecosystem is Query.jl. Pioneered by David Anthoff, Query.jl provides a LINQ implementation for Julia. Query.jl and its sister package QueryOperators.jl provide a custom “query dsl” by a set of macros that allow convenient “query context” syntax for common manipulation tasks: selection and projection, filtering, grouping, and joining. These “query verbs” also are able to operate on any iterator, in true LINQ fashion, which makes the processing functions extremely versatile. While currently Query.jl/QueryOperators.jl hold the sole implementations of the query verbs, the grander scheme of having a custom dsl is the ability to represent entire queries in an AST (abstract syntax tree), which could then allow custom implementations that “execute” a structured query. This is most immediately useful when one considers being able to use a single “query dsl” to operate on both DataFrames and database tables, having a query translated into a vendor-specific SQL syntax. While not currently fully fleshed out, the ambitious undertaking is exciting to track.

DataValues.jl
One note for users on the use of Query.jl is the current reliance on the DataValues.jl package to represent missing data. What that means is that query functions in Query.jl/QueryOperators.jl aren’t integrated with the Base-builtin representation of missing, but rely on the DataValues.jl package, which defines a DataValue{T} wrapper type that also holds whether a value is missing or not (along with what “type” of missing value it is). While the history of missing data in Julia is long and storied, Andy Ferris wisely noted that no missing value representation is perfect. missing was included in Base largely due to the ease of working with a single sentinel value, and compiler support for code generation involving Union{T, Missing}. Inherent in the use of Union{T, Missing}, however, is a current compiler complexity involving inference of Union values that are stored in a parametric struct field. This affects the @map macro in Query.jl with NamedTuple inputs to NamedTuple outputs, hence, Query.jl relies on the use of DataValue{T} to more conveniently pass type information through projections. There are also active efforts explore ways the core language can avoid the need for an explicit wrapper type while still propagating the type information soundly through projections.

Additional Data Efforts

Other efforts integrating with the data ecosystem include (but are certainly not limited to):

StatsModels.jl: for specifying (in familiar “formula” notation), fitting and evaluating statistical models; can operate on any Tables.jl-compatible
GLM.jl for working specifically with generalized linear models
LightQuery.jl: another budding approach to type-stable querying capabilities
Flux.jl for an extremely extensible approach to machine learning modelling, GPU integration, and pure Julia, all the way down
Distributions.jl: for sampling, moments, and density/mass functions for a wide variety of distributions
StatsMakie.jl for GPU-enabled statistical plotting goodness
MLJ.jl: another new holistic approach to representing a variety of machine learning models sponsored by the Alan Turing Institute
ScikitLearn.jl for Julia access to the scikit-learn APIs for machine learning, including pure Julia implementations and integration with python models via PyCall.jl
OnlineStats.jl: a mature, fully featured statistical package for “online” statistical algorithms, including well-documented source code for published algorithms
TextAnalysis.jl: providing algorithms, statistical support, and feature engineering for text analysis
Dagger.jl: a dask-like Julia framework for distributed, parallel computation graphs
MultivariateStats.jl: stats, but for multiple variables!
RCall.jl: package that allows integrating with the R statistical language; transfer objects between languages, call R functions/libraries, etc.
StatsPlots.jl: another strong statistical plotting package

Future of Data in Julia

So now JuliaCon 2019 is upon us and we have to wonder: what’s next for working with data in Julia? As I’ve tried to illustrate in this long showcase of Julia packages, the data ecosystem has come along way since the official 1.0 release of the language itself. Support for the most common data formats is about as mature as any other language, but there’s always room to improve. The in-memory processing is an exciting space to watch in Julia due to the number of approaches being fleshed out, with varying degrees of maturity. DataFrames.jl is solid and should be a main utility knife for any Julia programmer, but one of the wonders of Julia, as mentioned above, is the ease of developing high-level, performant solutions in the language itself, so it’s exciting to see alternative approaches that can offer trade-offs or additional features that may enable better workflows depending on the environment. But given all that, here’s a shortlist of things swimming in my head around the future of the data ecosystem in Julia:

Ensure the performance and usability of Union{T, Missing} to represent missing data in Julia; currently, 95% of uses and workflows work amazingly well, but we always want to track down corner cases and do everything we can to improve the compiler, core language, or APIs to improve
Data format support: the job is never done here, but on my mind are an officially blessed (and integrated) implementation of apache arrow, write support for Parquet.jl, and a Julia package for supporting the ORC data format
Working towards a common API package/definitions for common table processing tasks; while exploratory efforts are always encouraged, it could be immensely convenient to users of various table types if there were a common set of operations that “just worked”. While Query.jl currently provides the best solution for this, it has other integration issues in the ecosystem; so working to resolve those or define a new common “table operations” type package
Relatedly, defining a full “structured query graph” model could be one way to provide a lower-level shared representation of querying tasks. This could catalyze “frontend” efforts (custom DSLs like Query.jl’s, or new dplyr-like verbs, or even a plain SQL parsing package) to “lower” to this common representation, while allowing similar types of backend innovation in *how* these structured query graphs are executed (with parallel support, out-of-core, etc.). I’ve recently been studying an old effort to do something like this for inspiration

For those who made it this far, kudos! As always, follow and ping me on twitter to chat data in Julia. And if you’ll be at JuliaCon 2019 in Baltimore, hit me up in the official Julia slack to meet up and chat.

juliabloggers.com

A Julia Language Blog Aggregator