Tag Archives: General Programming

Maximizing Julia Development with VSCode Extension

By: Justyn Nissly

Re-posted from: https://blog.glcs.io/julia-vs-code-extension

In this article, we will review some of the features that the Julia extension offers! If you don’t have Julia, VS Code, or the Julia extension already installed, look at this article to help you get set up!

Running a Julia File

Now that we have the extension installed and configured, we can start using the extension.One of the first things we will examine is the most basic feature – running code!To run code, we do the following:

  1. Click the drop down in the top right of the code window
  2. Click “Julia: Execute Code in REPL”
  3. Enjoy the results!

When you hit Ctrl+Enter, the line your cursor is currently on will run, and the cursor will advance to the next line.This is one way to step through the code line by line.

You are also able to use Ctrl+F5 (Option+F5 for Macs) to run the entire file. You can learn more about running Julia code from the official documentation.

Code Navigation

The information in this section applies to any language and is not exclusive to the Julia extension. The features below are helpful to know and can increase your productivity as you code.

Within VS Code, if you use Ctrl+P, a prompt will open at the top of your editor. In that prompt, you can start typing the name of a file you want to jump to in your project. Once you click the option (or press enter), VS Code will jump directly to that file. You can also use Ctrl+G to jump to a specific line number in the file if you know which line you are looking for. Being able to jump back and forth between files without having to search the file tree will greatly enhance your workflow. Imagine all the time you will save by not searching through file trees!

Using Ctrl+Shift+O (be sure to use the letter O, not the number 0) will allow you to navigate through individual symbols within your program. After using the shortcut above, type : and all your symbols will be grouped by type. You are then able to navigate between them all.

Editing Code

Code navigation is a great skill to develop, especially when given some wonderful legacy code to unravel…we’ve all been there. Being proficient in navigating your code does you no good unless you are also proficient at editing the code.

One way to get going quickly is to use the “rename symbol” shortcut. You can either right click on a symbol and press “rename” or hit F2. When you rename the symbol, it will be renamed everywhere else in the file that it exists. Pretty neat, huh?

Change_Symbol

The Plot Viewer

Up until this point in the article we have laid the ground work for working with your code in VS Code. Next, we will look into some of the Julia specific features that the Julia extension offers, starting with the plot viewer.

The plot viewer is a really handy tool that lets you…well…view plots. We can look at an example to see how it works.

First, we will install the plots package if it hasn’t been installed already.

julia> using Pkgjulia> Pkg.add("Plots")# OR# We can type the "]" key and use the package interface(@v1.10) pkg>(@v1.10) pkg> add Plots

After we do that, we can create a plot to visualize.

using Plotsfunction make_plot()    # Create a plot    p = plot()     = range(0, 2, length=100)    x = cos.() * 5    y = sin.() * 5    plot!(p, x, y, linecolor=:black, linewidth=2, legend=:topright)    x_left = 1 .+ 0.5 * cos.()    y_left = 2 .+ 0.5 * sin.()    plot!(p, x_left, y_left, linecolor=:black, linewidth=2, fillalpha=0.2)    x_right = 3 .+ 0.5 * cos.()    y_right = 2 .+ 0.5 * sin.()    plot!(p, x_right, y_right, linecolor=:black, linewidth=2, fillalpha=0.2)    _arc = range(0, , length=10)    x_arc = 2 .+ cos.(_arc) * 2    y_arc = -1 .+ sin.(_arc) * 1    plot!(p, x_arc, y_arc, linecolor=:black, linewidth=2)    # Adjust plot limits and display the final plot    xlims!(-6, 6)    ylims!(-6, 6)    display(p)end# Execute the function to plotmake_plot()

Next we run this by using the keyboard shortcut we learned earlier (Ctrl+Enter), and we can see the result below!

Make_Plot

Pretty cool! Now we can see our charts generated in real time right inside our editor!

The Table Viewer

I don’t know about you, but I never liked having to try and dump the contents of an array, matrix,or other data structures to the console and try and parse through the data. Lucky for us we don’t have to do that.The Julia extension allows us to view any Tables.jl compatible table in the special table viewer.There are two ways to do this.

The first way is by clicking the “View in VS Code” button next to your table in the “Workspace” section.

Table_View_Button

The second way to do this is by running the vscodedisplay(name_of_table) directly in the REPL.

It’s pretty cool if you ask me.Having the ability to view the data is a nice feature, but to make it even better, you can sort the data in the UI by clicking the column headers. You can also copy data by using Ctrl + C just like you would in Excel!

A word of caution: All the table data you see is cached so any changes you make will not be reflected in the table viewer. To fix that just re-display the table in the editor and you will see your changes.

Debugging

The last topic we will cover is the debugging tools.

There are many things you can do in the VS Code debugger that are not language-specific. We aren’t going to cover all of those features here, so if you want a more in-depth look, check out the official documentation.

Since the only truly bug free code is no code, we will start by writing some code that we can test and try to catch bugs in.

function do_math(a, b)    c = a + b    d = a * b    c + dendfunction print_things()    println("First print")    println("Second print")    var = do_math(5,8)    println("Third print and math: ", var)    var2 = do_math("Bug",3)    println("Fourth print and bug: ", var2)endprint_things()

We have code to run, so we can run the debugger and see what we get. First, we must switch to the “Run and Debug” tab. We do this by either clicking on the tab (the one with the bug and play button) or by hitting Ctrl+Shift+D.

Once we are there, we will be greeted with a screen like this:

Debug_Screen

From here we can observe the compiled Julia code, our breakpoints, and several other things as the program runs. We will want to run our code through the debugger, and to do that, we can either click the big “Run and Debug” button or hit F5.

Debug_Error_Message

We step through the code a bit, and see some of what the debugger will show us.

Debugger_Annotated

1: The variables passed into the function

2: The variables local to the function

3: The indicator of which line we are currently on

4: A breakpoint indicator

We can set our breakpoints by double clicking next to the line numbers. After setting our breakpoints, we will be able to step through the code. As we step through the code line by line, the variables that get created are populated in the top section and their values are shown. Another neat feature the debugger gives that is not highlighted above but should be noted is the “CALL STACK” section. As the name suggests, this will show us the entire call stack as you step through the code. All of these are most likely things we have seen before in other debuggers, but they are useful nonetheless.

Keyboard Shortcuts

To wrap up, let’s look at a list of keyboard shortcuts for effective Julia extension usage.Note that a command like Alt+J Alt+C means press Alt+J followed by Alt+C.

Command Shortcut
Execute Code in REPL and Move Shift+Enter
Execute Code in REPL Ctrl+Enter
Execute Code Cell in REPL Alt+Enter
Execute Code Cell in REPL and Move Alt+Shift+Enter
Interrupt Execution Ctrl+C
Clear Current Inline Result Escape
Clear Inline Results In Editor Alt+J Alt+C
Select Current Module Alt+J Alt+M
New Julia File Alt+J Alt+N
Start REPL Alt+J Alt+O
Stop REPL Alt+J Alt+K
Restart REPL Alt+J Alt+R
Change Current Environment Alt+J Alt+E
Show Documentation Alt+J Alt+D
Show Plot Alt+J Alt+P
REPLVariables.focus Alt+J Alt+W
Interrupt Execution Ctrl+Shift+C
Browse Back Documentation Left
Browse Forward Documentation Right
Show Previous Plot Left
Show Next Plot Right
Show First Plot Home
Show Last Plot End
Delete plot Delete
Delete All Plots Shift+Delete

Summary

We reviewed some of the basics of the Julia VS Code extension. We looked at running a Julia file, the basics of code navigation and editing, the usefulness of the plot and table viewers, and some basic debugging features. This was only an overview, and there is much more that the extension has to offer! If you would like to do a deeper dive into the Julia VS Code extension, visit the official documentation or their GitHub.

If you would like to learn more about Julia so you can fully take advantage of the extension’s features, check out our Julia Basics series!

]]>

Getting Started with DataFrames.jl: A Beginner’s Guide

By: Joel Nelson

Re-posted from: https://blog.glcs.io/julia-dataframes

When doing any sort of development one will often find themselves in need of working with data in atabular format. This is especially true for those of us in data science, or data analysis, fields.In the Julia programming language one of the more popular libraries for this type of datawrangling is DataFrames.jl. In this blog post we’ll explore the beginnings of working with thispackage.

Introduction

The great thing about a package like Dataframes.jl is that it bridges the gap between traditionalprogramming and SQL (Structured Query Language). Databases are great tools for easily gaining insightsinto your data by joining, filtering, aggregating, sorting, etc… Dataframes.jl brings those goodies rightinto your hands by simply adding the package into your julia session. So, lets get started!

Getting Started

Adding the package is a few simple steps.

julia> using Pkgjulia> Pkg.add("DataFrames")julia> using DataFrames

The constructor for a DataFrame provides flexibility to create from arrays, tuples, constants, or files. The documentation covers all these, but for this post we’ll just explore one of the more common ways.

julia> df = DataFrame(a = 1:4, b = rand(4), c = "My first DataFrame")43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.141874  My first DataFrame   2      2  0.432084  My first DataFrame   3      3  0.47098   My first DataFrame   4      4  0.414639  My first DataFrame

You’ll notice in the code above we use a mix of datatypes including range, array, and scalar. The underlying vectors must be of the same sizeand the scalar gets broadcasted, or repeated, for each row. Also, pay attention that the types of each column are inferredbased on the arrays passed into the constructor.

Now, to access a column of a DataFrame there are also a few different possibilities. Here are a few examples of accessing thefirst column “a”.

julia> df.a4-element Vector{Int64}: 1 2 3 4julia> df."a"4-element Vector{Int64}: 1 2 3 4julia> df[!, "a"]4-element Vector{Int64}: 1 2 3 4julia> df[!, :a]4-element Vector{Int64}: 1 2 3 4julia> df[:, :a]4-element Vector{Int64}: 1 2 3 4

In these examples columns can be access directly with literals such as df.a, or more dynamically using brackets (since variables could be substituted.) You may also findyourself wondering the difference between ! and :, which is an important distinction!

The ! returns the underlying vector and : returns a copy. This can be showcased in anexample where we will attempt to change the description of the second value in column cto “I love Julia!”

julia> df[:, :c][2] = "I love Julia!"3julia> df43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.394165  My first DataFrame   2      2  0.809883  My first DataFrame   3      3  0.124035  My first DataFrame   4      4  0.886781  My first DataFramejulia> df[!, :c][2] = "I love Julia!"3julia> df43 DataFrame Row  a      b         c                        Int64  Float64   String                1      1  0.394165  My first DataFrame   2      3  0.809883  I love Julia!   3      3  0.124035  My first DataFrame   4      4  0.886781  My first DataFrame

Notice how the change will only persist to df when we access the column with !.

There is often a tradeoff between returning copies versus the actual underlying vectors. Returning a copy is generally considered safer since if the copy is later mutated the underlyingDataFrame remains unchanged. However, with very large DataFrames copying every column access willresult in an increase in memory. It is best to weigh those considerations and figure out whatapproach will work best for a given program.

Data Wrangling

Import / Export

Another great feature of the Julia programming language is that many different packages will interact wellwhen used together. For instance, DataFrames.jl and CSV.jl can be used to very easily import and exportdata.

First, we can save the DataFrame from above to CSV.

julia> using CSVjulia> path = joinpath(homedir(), "my_df.csv")julia> CSV.write(path, df)

And, reading in the DataFrame from file is just as easy!

julia> CSV.read(path, DataFrame)43 DataFrame Row  a      b         c                        Int64  Float64   String31              1      1  0.601361  My first DataFrame   2      2  0.178065  My first DataFrame   3      3  0.729591  My first DataFrame   4      4  0.280314  My first DataFrame

There are many keyword arguments to explore when handling csv files and the documentation is best forcovering all of these CSV.jl.

DataFrames.jl also supports writing and reading to multiple files types such as Arrow, JSON, Parquet, and others.

Joins

A join is a way to merge data from two DataFrames into a single DataFrame. There are several typesand they generally mimic the same types that a database would support.

  • innerjoin
  • leftjoin
  • rightjoin
  • outerjoin
  • semijoin
  • antijoin
  • crossjoin

Definitions of each can be found in either the documentation, or docstrings, but lets take a look at a fewexamples. Say we have the following DataFrame sets containing information from a school.

julia> student_df = DataFrame(student_id = 1:10, student_name = ["Joe", "Sally", "Jim", "Sandy", "Beth", "Alex", "Tom", "Liz", "Bill", "Carl"], teacher_id = repeat([1,2],5))103 DataFrame Row  student_id  student_name  teacher_id       Int64       String        Int64         1           1  Joe                    1   2           2  Sally                  2   3           3  Jim                    1   4           4  Sandy                  2   5           5  Beth                   1   6           6  Alex                   2   7           7  Tom                    1   8           8  Liz                    2   9           9  Bill                   1  10          10  Carl                   2julia> teacher_df = DataFrame(teacher_id = 1:2, teacher_name = ["Mr. Jackson", "Ms. Smith"])22 DataFrame Row  teacher_id  teacher_name       Int64       String          1           1  Mr. Jackson   2           2  Ms. Smithjulia> grade_df = DataFrame(exam_id = 1, student_id = vcat(1:3, 5:10), grade = [0.95, 0.93, 0.81, 0.85, 0.73, 0.88, 0.77, 0.75, 0.93])93 DataFrame Row  exam_id  student_id  grade         Int64    Int64       Float64    1        1           1     0.95   2        1           2     0.93   3        1           3     0.81   4        1           5     0.85   5        1           6     0.73   6        1           7     0.88   7        1           8     0.77   8        1           9     0.75   9        1          10     0.93

If we look at the grade_df we can see there are 9 results, but in the student_df we have 10 students.So, someone must have missed the exam! Let’s find out who that way we can alert the teacher to schedulea makeup.

Let’s do a leftjoin, which means every row will persist from the first DataFrame regardless if there isa match to the second DataFrame. The leftjoin function also takes an on keyword argumentto signify what column needs to be used to find matches.

julia> student_grade_df = leftjoin(student_df, grade_df, on=:student_id)105 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade            Int64       String        Int64       Int64?   Float64?      1           1  Joe                    1        1        0.95   2           2  Sally                  2        1        0.93   3           3  Jim                    1        1        0.81   4           5  Beth                   1        1        0.85   5           6  Alex                   2        1        0.73   6           7  Tom                    1        1        0.88   7           8  Liz                    2        1        0.77   8           9  Bill                   1        1        0.75   9          10  Carl                   2        1        0.93  10           4  Sandy                  2  missing  missing

We notice Sandy has a missing value for both the exam_id and grade fields. missing is a special datatype in Julia that is similar to a null value in databases. This would signify tous that there was no match in the grade_df meaning Sandy missed the exam. We can add one more join to get the respective teacher’s name.

julia> result_df = innerjoin(student_grade_df, teacher_df, on=:teacher_id)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           1  Joe                    1        1        0.95  Mr. Jackson   2           2  Sally                  2        1        0.93  Ms. Smith   3           3  Jim                    1        1        0.81  Mr. Jackson   4           5  Beth                   1        1        0.85  Mr. Jackson   5           6  Alex                   2        1        0.73  Ms. Smith   6           7  Tom                    1        1        0.88  Mr. Jackson   7           8  Liz                    2        1        0.77  Ms. Smith   8           9  Bill                   1        1        0.75  Mr. Jackson   9          10  Carl                   2        1        0.93  Ms. Smith  10           4  Sandy                  2  missing  missing     Ms. Smith

We used an innerjoin this time since we know that every student would have a teacher assigned.Now, we can let Ms. Smith know that she needs to reach out to Sandy to re-schedule her exam.

Sorting

Another helpful function for analysis is sort. Let’s sort our result_df by the grade column.

julia> sort(result_df, [:grade])106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           6  Alex                   2        1        0.73  Ms. Smith   2           9  Bill                   1        1        0.75  Mr. Jackson   3           8  Liz                    2        1        0.77  Ms. Smith   4           3  Jim                    1        1        0.81  Mr. Jackson   5           5  Beth                   1        1        0.85  Mr. Jackson   6           7  Tom                    1        1        0.88  Mr. Jackson   7           2  Sally                  2        1        0.93  Ms. Smith   8          10  Carl                   2        1        0.93  Ms. Smith   9           1  Joe                    1        1        0.95  Mr. Jackson  10           4  Sandy                  2  missing  missing     Ms. Smith

The function takes the DataFrame and an array of columns to sort on. Our result of sort is putting the lowestgrade first, but if we wanted it descending we can pass a rev keyword argument.

julia> sort(result_df, [:grade], rev=true)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           4  Sandy                  2  missing  missing     Ms. Smith   2           1  Joe                    1        1        0.95  Mr. Jackson   3           2  Sally                  2        1        0.93  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           7  Tom                    1        1        0.88  Mr. Jackson   6           5  Beth                   1        1        0.85  Mr. Jackson   7           3  Jim                    1        1        0.81  Mr. Jackson   8           8  Liz                    2        1        0.77  Ms. Smith   9           9  Bill                   1        1        0.75  Mr. Jackson  10           6  Alex                   2        1        0.73  Ms. Smith

In both these cases a copy of the DataFrame is returned and the result_df is left unchanged. But, if we wanted tosort in-place we can also use the sort! function that will update the passed DataFrame.

julia> sort!(result_df, [:grade], rev=true)106 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           4  Sandy                  2  missing  missing     Ms. Smith   2           1  Joe                    1        1        0.95  Mr. Jackson   3           2  Sally                  2        1        0.93  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           7  Tom                    1        1        0.88  Mr. Jackson   6           5  Beth                   1        1        0.85  Mr. Jackson   7           3  Jim                    1        1        0.81  Mr. Jackson   8           8  Liz                    2        1        0.77  Ms. Smith   9           9  Bill                   1        1        0.75  Mr. Jackson  10           6  Alex                   2        1        0.73  Ms. Smith

Split-apply-combine

Now that we have some basics down, it’s time to dive into aggregating results. In DataFrames.jl this isreferred to as a split-apply-combine strategy. It is a bit of a mouthful, but let’s walk through whatexactly this is referring to.

Split is simply breaking the DataFrame into groups using the groupby function. In our example lets splitour DataFrame by the teacher_name column.

julia> grouped_df = groupby(result_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row  student_id  student_name  teacher_id  exam_id  grade     teacher_name       Int64       String        Int64       Int64?   Float64?  String          1           1  Joe                    1        1      0.95  Mr. Jackson   2           3  Jim                    1        1      0.81  Mr. Jackson   3           5  Beth                   1        1      0.85  Mr. Jackson   4           7  Tom                    1        1      0.88  Mr. Jackson   5           9  Bill                   1        1      0.75  Mr. JacksonLast Group (5 rows): teacher_name = "Ms. Smith" Row  student_id  student_name  teacher_id  exam_id  grade       teacher_name       Int64       String        Int64       Int64?   Float64?    String          1           2  Sally                  2        1        0.93  Ms. Smith   2           6  Alex                   2        1        0.73  Ms. Smith   3           8  Liz                    2        1        0.77  Ms. Smith   4          10  Carl                   2        1        0.93  Ms. Smith   5           4  Sandy                  2  missing  missing     Ms. Smith

The result of calling groupby is of type GroupedDataFrame, which is basically a wrapperaround one, or many, groups of a DataFrame. In our example we have two teachers and so the resultGroupedDataFrame has two groups.

Now, lets try to get an average exam grade for our two teachers. This will introduce the combinefunction that takes a GroupedDataFrame and any number of aggregation functions. Let’s also addthe Statistics.jl package, so we can take advantage of the mean function.

julia> using Statisticsjulia> combine(grouped_df, :grade => mean)22 DataFrame Row  teacher_name  grade_mean        String        Float64?       1  Mr. Jackson         0.848   2  Ms. Smith     missing

The result is a DataFrame where the first column(s) will match our GroupedDataFrame key(s) and the subsequent column(s) will match the function(s) we pass for aggregation. However, Ms. Smith has a grade_mean of missing!?

In our earlier discussion we found that Sandy missed the exam, so her grade was set to missing. A missing valuebehaves differently than normal numbers, which is problematic in our aggregation function. Take a look at a very simpleexample.

julia> 1 + missingmissing

We notice that adding a value of 1 to missing equals missing. This is a necessary evil and you may be wonderingwhy don’t we just treat it as 0? Let’s see what happens to our results if we replace missing with 0.

julia> combine(grouped_df, :grade => (x -> mean(coalesce.(x, 0))))22 DataFrame Row  teacher_name  grade_function       String        Float64           1  Mr. Jackson            0.848   2  Ms. Smith              0.672

In the above example, instead of just passing mean as the function we create an anonymous function. This allows us to get a littlemore clever with adding a coalesce to replace the missing values with 0. We see from the results that Ms. Smith has a much lowerscoring average than Mr. Jackson. But, if we think about it the results are getting incorrectly skewed. We know Sandy didn’t actuallyscore a 0, but rather didn’t take the test at all. Treating her result as a 0 is skewing the average much lower than it should be.

In some cases replacing with a 0 would make sense, but not in this scenario. Here are a few better options:

We could just drop the rows that contain missing values prior to aggregation. DataFrames.jl provides a dropmissing functionspecifically for this.

julia> result_no_missing_df = dropmissing(result_df)96 DataFrame Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           1  Joe                    1        1     0.95  Mr. Jackson   2           2  Sally                  2        1     0.93  Ms. Smith   3           3  Jim                    1        1     0.81  Mr. Jackson   4           5  Beth                   1        1     0.85  Mr. Jackson   5           6  Alex                   2        1     0.73  Ms. Smith   6           7  Tom                    1        1     0.88  Mr. Jackson   7           8  Liz                    2        1     0.77  Ms. Smith   8           9  Bill                   1        1     0.75  Mr. Jackson   9          10  Carl                   2        1     0.93  Ms. Smithjulia> grouped_no_missing_df = groupby(result_no_missing_df, :teacher_name)GroupedDataFrame with 2 groups based on key: teacher_nameFirst Group (5 rows): teacher_name = "Mr. Jackson" Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           1  Joe                    1        1     0.95  Mr. Jackson   2           3  Jim                    1        1     0.81  Mr. Jackson   3           5  Beth                   1        1     0.85  Mr. Jackson   4           7  Tom                    1        1     0.88  Mr. Jackson   5           9  Bill                   1        1     0.75  Mr. JacksonLast Group (4 rows): teacher_name = "Ms. Smith" Row  student_id  student_name  teacher_id  exam_id  grade    teacher_name       Int64       String        Int64       Int64    Float64  String          1           2  Sally                  2        1     0.93  Ms. Smith   2           6  Alex                   2        1     0.73  Ms. Smith   3           8  Liz                    2        1     0.77  Ms. Smith   4          10  Carl                   2        1     0.93  Ms. Smithjulia> combine(grouped_no_missing_df, :grade => mean)22 DataFrame Row  teacher_name  grade_mean       String        Float64       1  Mr. Jackson        0.848   2  Ms. Smith          0.84

We now see that the two teachers average test scores are very similar. This approach would work wellif we never again needed the rows containing missing values.

But, if we wanted to keep those rows around and rather just exclude them from certain calculations. We can make use of another function, skipmissing, which will simply skip over the missing values.

julia> combine(grouped_df, :grade => (x -> mean(skipmissing(x))))22 DataFrame Row  teacher_name  grade_function       String        Float64           1  Mr. Jackson            0.848   2  Ms. Smith              0.84

One last thing to note on missing values is that it is easy to identify if one, or more, of your DataFramecolumns contains missing values. We talked earlier that DataFrames.jl infers the types for eachcolumn and displays them in the output. You’ll notice in result_df that the column teacher_id is of datatypeInt64 and exam_id is of Int64?. Here the ? denotes that missing values were found, so be careful!

Conclusion

We’ve touched on some of the topics that makes DataFrames.jl such a great general purpose package. It is a helpfultool to quickly interact for data exploration, or to be used in production code to manipulate tabular data. I hopeyou’ve enjoyed today’s reading and be sure to check out the rest of our blog posts on blog.glcs.io!

A Million Text Files And A Single Laptop

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/gnu-parallel-medium-data/

GNU Parallel Cat Unix

Wait…What? Why?

More often that I would like, I receive datasets where the data has only been partially cleaned, such as the picture on the right: hundreds, thousands…even millions of tiny files. Usually when this happens, the data all have the same format (such as having being generated by sensors or other memory-constrained devices).

The problem with data like this is that 1) it’s inconvenient to think about a dataset as a million individual pieces 2) the data in aggregate are too large to hold in RAM but 3) the data are small enough where using Hadoop or even a relational database seems like overkill.

Surprisingly, with judicious use of GNU Parallel, stream processing and a relatively modern computer, you can efficiently process annoying, “medium-sized” data as described above.

Data Generation

For this blog post, I used a combination of R and Python to generate the data: the “Groceries” dataset from the arules package for sampls ing transactions (with replacement), and the Python Faker (fake-factory) package to generate fake customer profiles and for creating the 1MM+ text files.

The contents of the data itself isn’t important for this blog post, but the data generation code is posted as a GitHub gist should you want to run these commands yourself.

Problem 1: Concatenating (cat * >> out.txt ?!)

The cat utility in Unix-y systems is familiar to most anyone who has ever opened up a Terminal window. Take some or all of the files in a folder, concatenate them together….one big file. But something funny happens once you get enough files…

$ cat * >> out.txt
-bash: /bin/cat: Argument list too long

That’s a fun thought…too many files for the computer to keep track of. As it turns out, many Unix tools will only accept about 10,000 arguments; the use of the asterisk in the `cat` command gets expanded before running, so the above statement passes 1,234,567 arguments to `cat` and you get an error message.

One (naive) solution would be to loop over every file (a completely serial operation):

for f in *; do cat "$f" >> ../transactions_cat/transactions.csv; done

Roughly 10,093 seconds later, you’ll have your concatenated file. Three hours is quite a coffee break…

Solution 1: GNU Parallel & Concatenation

Above, I mentioned that looping over each file gets you past the error condition of too many arguments, but it is a serial operation. If you look at your computer usage during that operation, you’ll likely see that only a fraction of a core of your computer’s CPU is being utilized. We can greatly improve that through the use of GNU Parallel:

ls | parallel -m -j $f "cat {} >> ../transactions_cat/transactions.csv"

The `$f` argument in the code is to highlight that you can choose the level of parallelism; however, you will not get infinitely linear scaling, as shown below (graph code, Julia):

Given that the graph represents a single run at each level of parallelism, it’s a bit difficult to say exactly where the parallelism gets maxed out, but at roughly 10 concurrent jobs, there’s no additional benefit. It’s also interesting to point out what the `-m` argument represents; by specifying `m`, you allow multiple arguments (i.e. multiple text files) to be passed as inputs into parallel. This alone leads to an 8x speedup over the naive loop solution.

Problem 2: Data > RAM

Now that we have a single file, we’ve removed the “one million files” cognitive dissonance, but now we have a second problem: at 19.93GB, the amount of data exceeds the RAM in my laptop (2014 MBP, 16GB of RAM). So in order to do analysis, either a bigger machine is needed or processing has to be done in a streaming or “chunked” manner (such as using the “chunksize” keyword in pandas).

But continuing on with our use of GNU Parallel, suppose we wanted to answer the following types of questions about our transactions data:

  1. How many unique products were sold?
  2. How many transactions were there per day?
  3. How many total items were sold per store, per month?

If it’s not clear from the list above, in all three questions there is an “embarrassingly parallel” portion of the computation. Let’s take a look at how to answer all three of these questions in a time- and RAM-efficient manner:

Q1: Unique Products

Given the format of the data file (transactions in a single column array), this question is the hardest to parallelize, but using a neat trick with the `tr` (transliterate) utility, we can map our data to one product per row as we stream over the file:

The trick here is that we swap the comma-delimited transactions with the newline character; the effect of this is taking a single transaction row and returning multiple rows, one for each product. Then we pass that down the line, eventually using `sort -u` to de-dup the list and `wc -l` to count the number of unique lines (i.e. products).

In a serial fashion, it takes quite some time to calculate the number of unique products. Incorporating GNU Parallel, just using the defaults, gives nearly a 4x speedup!

Q2. Transactions By Day

If the file format could be considered undesirable in question 1, for question 2 the format is perfect. Since each row represents a transaction, all we need to do is perform the equivalent of a SQL `Group By` on the date and sum the rows:

Using GNU Parallel starts to become complicated here, but you do get a 9x speed-up by calculating rows by date in chunks, then “reducing” again by calculating total rows by date (a trick I picked up at this blog post).

Q3. Total items Per store, Per month

For this example, it could be that my command-line fu is weak, but the serial method actually turns out to be the fastest. Of course, at a 14 minute run time, the real-time benefits to parallelization aren’t that great.

It may be possible that one of you out there knows how to do this correctly, but an interesting thing to note is that the serial version already uses 40-50% of the available CPU available. So parallelization might yield a 2x speedup, but seven minutes extra per run isn’t worth spending hours trying to the optimal settings.

But, I’ve got MULTIPLE files…

The three examples above showed that it’s possible to process datasets larger than RAM in a realistic amount of time using GNU Parallel. However, the examples also showed that working with Unix utilities can become complicated rather quickly. Shell scripts can help move beyond the “one-liner” syndrome, when the pipeline gets so long you lose track of the logic, but eventually problems are more easily solved using other tools.

The data that I generated at the beginning of this post represented two concepts: transactions and customers. Once you get to the point where you want to do joins, summarize by multiple columns, estimate models, etc., loading data into a database or an analytics environment like R or Python makes sense. But hopefully this post has shown that a laptop is capable of analyzing WAY more data than most people believe, using many tools written decades ago.