Tag Archives: Python

Six of One (Plot), Half-Dozen of the Other

By: randyzwitch - Articles

Re-posted from: http://badhessian.org/2014/07/six-of-one-plot-half-dozen-of-the-other/

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: wordpress-base-rWhile I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:wordpress-matplotlib

Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

Julia: Gadfly

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:wordpress-julia-gadflyWhile Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately. wordpress-julia-plotly

And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.

Web development in Julia: A progress report (Warning: Contains benchmarks)

By: Terence Copestake

Re-posted from: http://thenewphalls.wordpress.com/2014/07/11/web-development-in-julia-a-progress-report-warning-contains-benchmarks/

Continuing my quest to explore the idea of using Julia for web development, I wanted to address some of my own questions around performance and implementation. My two biggest concerns were:

  1. Should Julia web pages be served by a Julia HTTP server (such as HttpServer.jl) – or would it be better to have Julia work with existing software such as Apache and nginx?
  2. How would Julia perform on the web compared to the competition?

Addressing the HTTP server question

After some consideration, my personal conclusion is that a server implemented in Julia would be another codebase that would need to be maintained; would mean missing out on tools available to existing server software, such as .htaccess, modules and SDKs; and would ultimately feel like reinventing the wheel. I feel it would be more sensible to leverage existing software that already has active development and has been tried and tested in the wild.

Following from this, I knew that my primary performance concern should be the interface between the server and Julia. In my previous posts, I was using Apache and running Julia via CGI. CGI is slow enough, but a known fact of Julia is that the binary is somewhat slow to start due to internal processes/compilation. I figured that FastCGI would be the next best option – and as there are no existing solutions (except for an incomplete FastCGI library), I set about creating a FastCGI process manager for Julia.

FYI: I’ve decided to release all of my web-Julia-related code under the GitHub organisation Jaylle, which can be found at https://github.com/Jaylle. Currently only the FPM and CGI module are available, but in future that’s where I’ll add the web framework and whatever else gets developed.

I plan to elaborate on the process manager more in a future post, but in short there are two parts:

  • The FastCGI server / process manager (coded in C). This accepts requests and manages and delegates to the workers.
  • The worker (coded in Julia). This listens for TCP connections from the FPM, accepts a bunch of commands and then runs the requested Julia page/code.

This way, there’s always a pre-loaded version of Julia in memory, circumventing any startup concerns (unless a worker crashes, of course).

Some early benchmarks

Now that the FPM is in a usable prerelease state, I wanted to see how it could perform compared to the alternatives. In this case, I chose PHP (obvious) and Python. I chose Python because the name often crops up in Julia discussions and there’s a FastCGI module available for it.

To run these tests, I used the Apache ab tool from my Windows machine. The server is a cheap 1-core VPS running CentOS 6 64-bit.

In all tests, the server software used was nginx. For the languages, I used PHP-FPM for PHP, Web.py for Python and the Jaylle FPM for Julia.

The individual tests are superficial and the results anecdotal, but I just wanted something to give me an idea of how my FPM performed by comparison. To elaborate:

  • Basic output: Printed “Hello, [name]” – with [name] taken from the query string (?name=…)
  • Looped arithmetic: Adding and outputting numbers in a loop with 7000 iterations.
  • Looped method calls: Calling arithmetic-performing methods from within a loop with 7000 iterations.

Below is a table of the results. The numbers shown are requests per second; higher is better.

Basic output Looped arithmetic Looped method calls
PHP 28.17 11.29 10.92
Web.py (Python) 24.61 7.92 7.25
Jaylle (Julia) 24.85 5.27 5.12

The only thing that I can say from these results is that I’m comforted seeing that my FPM’s performance isn’t obviously terrible compared to the others, but that there’s probably some work that does need to be done to at least get it up to the same level as Python, if not PHP.

In other news, I’ve realised (4 years late) that all the cool people use Twitter now. I therefore have started actively using my account. I can’t promise that following me will improve your quality of life, but feel free to give it a chance: @phollocks

Coming soon: FPM documentation + writeup (as soon as I’m comfortable enough to tag a release).

Using Julia As A ‘Glue’ Language

By: randyzwitch - Articles

Re-posted from: http://randyzwitch.com/julia-odbc-jl/

While much of the focus in the Julia community has been on the performance aspects of Julia relative to other scientific computing languages, Julia is also perfectly suited to ‘glue’ together multiple data sources/languages. In this blog post, I will cover how to create an interactive plot using Gadfly.jl, by first preparing the data using Hadoop and Teradata Aster via ODBC.jl.

The example problem I am going to solve is calculating and visualizing the number of airplanes by hour in the air at any given time in the U.S. for the year 1987. Because of the structure and storage of the underlying data, I will need to write some custom Hive code, upload the data to Teradata Aster via a command-line utility, re-calculate the number of flights per hour using a built-in Aster function, then using Julia to visualize the data.

Step 1: Getting Data From Hadoop

In a prior set of blog posts, I talked about loading the airline dataset into Hadoop, then analyzing the dataset using Hive or Pig. Using ODBC.jl, we can use Hive via Julia to submit our queries. The hardest part of setting up this process is making sure that you have the appropriate Hive drivers for your Hadoop cluster and credentials (which isn’t covered here). Once you have your DSN set up, running Hive queries is as easy as the following:In this code, I’ve written my query as a Julia string, to keep my code easily modifiable. Then, I pass the Julia string object to the query() function, along with my ODBC connection object. This query runs on Hadoop through Hive, then streams the result directly to my local hard drive, making this a very RAM efficient (though I/O inefficient!) operation.

Step 2: Shelling Out To Load Data To Aster

Once I created the file with my Hadoop results in it, I now have a decision point: I can either A) do the rest of the analysis in Julia or B) use a different tool for my calculations. Because this is a toy example, I’m going to use Teradata Aster to do my calculations, which provides a convenient function called ‘burst()’ to regularize timestamps into fixed intervals. But before I can use Aster to ‘burst’ my data, I first need to upload it to the database.

While I could loop over the data within Julia and insert each record one at a time, Teradata provides a command-line utility to upload data in parallel. Running command-line scripts from within Julia is as easy as using the run() command, with each command surrounded in backticks:While I could’ve run this at the command-line, having all of this within an IJulia Notebook keeps all my work together, should I need to re-run this in the future.

Step 3: Using Aster For Calculations

With my data now loaded in Aster, I can normalize the timestamps to UTC, then ‘burst’ the data into regular time intervals. Again, all of this can be done via ODBC from within Julia:Since it might not be clear what I’m doing here, the ‘burst()’ function in Aster takes a row of data with a start and end timestamp, and potentially returns multiple rows which normalize the time between the timestamps. If you’re familiar with pandas in Python, it’s a similar functionality to ‘resample’ on a series of timestamps.

Step 4: Download Smaller Data Into Julia, Visualize

Now that the data has been processed from Hadoop to Aster through a series of queries, we now have a much smaller dataset that can be loaded into RAM and processed by Julia:The Gadfly code above produces the following plot (using a d3.js backend for interactivity):

Since this chart is in UTC, it might not be obvious what the interpretation is of the trend. Because the airline dataset represents flights either leaving or returning to the United States, there are many fewer planes in the air overnight and the early morning hours (UTC 7-10, 2-5am Eastern). During the hours when the airports are open, there appears to be a limit of roughly 2500 planes per hour in the sky.

Why Not Do All Of This In Julia?

At this point, you might be tempted to wonder why go through all of this effort? Couldn’t this all be done in Julia?

Yes, you probably could do all of this work in Julia with a sufficiently large amount of RAM. As a proof-of-concept, I hope I’ve shown that there is much more to Julia than micro-benchmarking Julia’s speed relative to other scientific programming languages. You’ll notice that in none of my code have I used any type annotations, as none would really make sense (nor would they improve performance).  And although this is a toy example purposely using multiple systems, I much more frequently use Julia in this manner at work than doing linear algebra or machine learning.

So next time you’re tempted to use Python or R or shell scripting or whatever, consider Julia as well. Julia is just as at-home as a scripting language as a scientific computing language.