Tag Archives: Visualization

Analyzing Graphs with Julia

By: DSB

Re-posted from: https://medium.com/coffee-in-a-klein-bottle/analyzing-graphs-with-julia-38e26d1d2f62?source=rss-8bd6ec95ab58------2

A brief tutorial on how to use Julia to analyze graphs using the JuliaGraphs packages

Example of Graph created using LigthsGraphs.jl and VegaLite.jl

First of all, let’s be clear. The goal of this article is to briefly introduce how to use Julia for analyzing graphs. Besides the many types of graphs (undirected, directed, bipartite, weighted…), there are also many methods for analyzing them (degree distribution, centrality measures, clustering measures, visual layouts …). Hence, a comprehensive introduction to Graph Analysis with Julia would be too large of a task.

Therefore, this tutorial focuses on undirected weighted graphs, since they encompass weightless graphs, and are usually more common than directed graphs¹.

JuliaGraphs Project

Almost every package you will need can be found in the JuliaGraphs Project. The project contains specific packages for plotting, network layouts, weighted graphs, and more. In our example, we’ll be using GraphPlot.jl and SimpleWeightedGraphs.jl. The good thing about the project is that these packages work together and are very similar in design. Hence, the functions you use for creating a weighted graph are very similar to the ones you use for creating a simple graph with LightGraphs.jl.

Creating your first Graph

Let’s create a DataFrame using the DataFrames.jl package. Each column will represent a person, and each row will represent an attribute. Therefore, our graph will be composed of nodes (people) and edges (people share the same attribute).

DataFrame used for creating the Graph

After creating the DataFrame, we create the graph. Note here that they are actually separate objects. To create the graph, you only need to specify the number of nodes, which is the number of columns. Below I present the code for the creation of the DataFrame and the Graph.

The nodes were inserted, now we have to create the appropriate edges. This is done by using the command add_edge!() which takes the graph, the nodes that must be connected, and the edge weight.

In our example, the the weight is equal to the number of shared attributes between two columns. For example, the first and second column both have 1’s in rows 1 and 7. Therefore, they share an edge with weight 2:

add_edge!(g,1,2,2) # add_edge!(graph, node_1, node_2, weight)

The following code loops through the data, and adds the edges to the graph.

Visualizing the Graph

With this, our graph is ready to be visualized. This can be easily done with the following command:

gplot(g,nodelabel=names(df),edgelinewidth=ew)
Output from gplot()

There are several different layouts to chose from, just take a look at the GraphPlots.jl page. Here is another example:

gplot(g,nodelabel=names(df),edgelinewidth=ew,layout=circular_layout)
Output of gplot() using another layout

Centrality and Minimum Spanning Tree

Let’s do some analysis in this graph. As I said in the beginning, there are many way to analyze a graph. Two very common methods are studying the centrality of each node, and creating a Minimum Spanning Tree (MST). The goal of this article is not to explain theoretical aspects of graphs, so I’ll assume that the reader knows what I’m talking about.

Here is an example of how to calculate the betweeness centrality and visualizing the results. Note that I added 1 in the first line of code just to enable a better visualization.

Output of code above. The nodes colors represent the centrality.

Finally, one can also easily create a MST. To do so, we must create a new graph, since we’ll be removing edges and adjusting the nodes’ locations. The code below shows how to do this. The only new function here is the kruskal_mst() which will give us the MST. You can choose if the MST is minimizing of maximizing the path. In our case, we want to create a tree the contains only the “strongest” connections, hence, we are maximizing.

MST from the code above.

Conclusion

This is the end of this brief tutorial. There are much more functionalities in the JuliaGraphs project, enabling a much more thorough analysis than the one presented here. Also, one can create more interactive visualizations using VegaLite.jl, but I’ll leave that for another article.

¹ Authors opinion.

² A Jupyter Notebook can also be found on Github.


Analyzing Graphs with Julia was originally published in Coffee in a Klein Bottle on Medium, where people are continuing the conversation by highlighting and responding to this story.

Six of One (Plot), Half-Dozen of the Other

By: randyzwitch - Articles

Re-posted from: http://badhessian.org/2014/07/six-of-one-plot-half-dozen-of-the-other/

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: wordpress-base-rWhile I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:wordpress-matplotlib

Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

Julia: Gadfly

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:wordpress-julia-gadflyWhile Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately. wordpress-julia-plotly

And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.