Tag Archives: R

A bioinformatics walk-through: Accessing protein-protein interaction interfaces for all known protein structures with PDBe PISA

If this summer’s posting became a little infrequent, part of the blame lies with computational research I’ve been working on, regarding the systems biology of chromosomal translocations and the ensuing chimeric proteins at the Medical Research Council Laboratory of Molecular Biology in Cambridge.

A sizeable part of bioinformatics ‘dry lab’ work falls into what has been described in the NYT as ‘data wrangling’ (or the work of a ‘data janitor’). This post is about accessing the data held in the Protein Databank in Europe’s repository of Proteins, Interfaces, Structures and Assemblies (PDBe PISA).

Sent out onto the web to find a source of structural protein-protein interaction data with amino acid-level resolution, my first port of call was the Nucleic Acids Research Molecular Biology Online Database Collection (which I’d read of in the opening chapters of Arthur Lesk’s Introduction to Bioinformatics) where I found a sizeable list of PPI databases.

Not wanting to click through each, I chose to browse this programmatically, using Javascript-automated AJAX requests (effectively asking the website to give me web pages but without displaying them) and just ‘scrape’ what I wanted (full workings here), as follows:

From these results, here’s a little background info on PDBe:

  {
    "name": "PDBe",
    "url": "http://www.ebi.ac.uk/pdbe/",
    "entryurl": "http://www.oxfordjournals.org/nar/database/summary/456",
    "desc": "EMBL-EBI's Protein Data Bank in Europe (PDBe) is the European resource for the collection, organization and dissemination of data about biological macromolecular structures. PDBe is one of four partners in the worldwide Protein Data Bank (wwPDB), the consortium entrusted with the collation, maintenance and distribution of the global repository of macromolecular structure data. PDBe uses a relational database that presents the data derived from the Protein Data Bank (PDB) in a consistent way and allows users to retrieve meaningful data using complex and sophisticated searches including simple textual queries or more complex 3D structure-based queries. PDBe has also developed a number of advanced tools for analysis of macromolecules. The \"Structure Integration with Function, Taxonomy and Sequence\" (SIFTS) initiative integrates data from a number of bioinformatics resources that is used by major global sequence, structure and protein-family resources. Furthermore, PDBe works actively with the X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy and cryo-Electron Microscopy (EM) communities and is a partner in the Electron Microscopy Data Bank (EMDB). The active involvement with the scientific communities has resulted in improved tools for structure deposition and analysis.",
    "ref": null,
    "absurl": "http://nar.oxfordjournals.org/cgi/content/abstract/42/D1/D285",
    "email": "pdbe@ebi.ac.uk"
  },

Web scraping can feel quite kludgy, and there are doubtless better ways to do the above. Having said that, it’s great for prototyping: you can use Javascript within a web browser console, i.e. without littering your computer with temporary files. What’s more, dedicated communities like the ScraperWiki forum are around to support and develop the associated tools, and in its more elaborate incarnations ‘scraping’ features in journals like Briefings in Bioinformatics (“Web scraping technologies in an API World” was published there just this week).

After having decided on PDBe PISA thanks to my scraped-together report, and finding no guidance on how to tackle the task, I turned to the bioinformatician’s equivalent of [computing/programming Q&A site] Stack Overflow known as Biostars. My question got a grand total of 0 answers(!), so what follows is my approach — which may either be of interest as a peek into the work going under the banner of ‘bioinformatics’ or as a guide to other scientists seeking to access the same information.

First off, a Python script parcelled up a list of every PDB code (the unique identifier to an author-deposited structure from X-ray crystallography, NMR etc.) in PDB into comma-separated chunks of 50, which were stuck onto the end of a web-service query as recommended. The server would process these queries, understood through its “API”: the CGI of cgi-bin in the URL means it’s invoking a script on the server, which in turn expects interfaces.pisa? to be followed by comma-separated PDB codes. Given these expectations, the API will respond in a regular manner each time, enabling reliable scripting.

With over 2000 such queries for interface data (each of them requesting 50 PDB-code-identified structures), this isn’t something you want to be doing manually. It wasn’t clear exactly which PDB entries were needed at the time, so the full complement was downloaded.

This download script just works for one query, putting the received XML in one file – to handle all 2029 queries, a bit of lateral thinking was required. 50 queries (each containing 50 PDB codes) were executed to make up a single interfacesij.xml file, where i is an integer 1 to 4, and likewise j from 1 to 10 (plus a bonus 4-11 to get those final 29). Download scripts (named similarly as getxmlij.py) were written individually by another script — code writing code…

With download scripts written, the task of running each of them consecutively fell to yet another Python script, playing the sound of Super Mario picking up a coin when each file finished downloading, or the Mario pause-game sound upon encountering an error, because I could because clear feedback becomes necessary on something taking days across multiple computers.

Inevitably a minority of the queries failed, and had to be obtained separately.

Once downloaded, various pattern matching text-processing programs were run on the XML from within a shell script — readers unfamiliar with programming may have heard of these this week thanks to the 22 year old security bug(s) being referred to as shellshock. Shell scripts make looping through files in this manner a simple task, and are becoming essential for everyday file manipulation now that I’m a reformed Windows user. For the 41 XML files, a function runprocessor was called, with instructions to:

  1. Split each file successively at every <pisa_interfaces> tag through to the closing </pisa_interfaces> tag, the line numbers of which were stored together in an ordered list (an “array variable”) pisapairs
  2. Write each of these sections to a cache file xmlcache.xml, of suitable size for parsing by a Python XML parser.
  3. Reduce the time spent by the parser by in turn splitting this cache into just the PDB entries in the shortlist of interest with a function extractsubsets
  4. Initiate a Python script to read the entire cachesubset.xml file into memory, and write the pertinent structural data into a report formatted as tab-separated values (TSV). This file is a mere few hundred megabytes compared to the 120 GB grand total for the XML.

Clicking Details for an interface on the list of all interfaces for a given protein structure, e.g. for the only one in spider silk precursor protein spidroin, shows the interfacial residues in yellow:
image

The output threads together all interfacial residues and the associated statistical figures for each on a single line for every interface, but it’s simple enough to separate out each according to commas (then colons) to get a longform residue-per-line output once all XML is processed.

Progress is indicated in terminal output, where the current i and j values are printed followed by the pisapair (i.e. which of the 50 pisa_interfaces tags) is being worked through:

image

As shown in the logfile, there are inevitable errors, such as Entry not found: it’s simple enough to find the difference between the output report file’s list of PDB codes and the input ‘shortlist’, which can be mapped back to the constituent files for any follow-up investigation (the “wrangling” facet of computational science alluded to earlier) since the order of the original 2029 queries is known:

I’m putting these together in a code repository on GitHub, with a disclaimer that it’s not fit for all purposes (for instance if you’re interested in H-bonds, in brown on the PISA website residue table, above).

A lot of this was painfully slow — there’s nothing to be done about the speed of downloading the files, given that its rate is limited by the server. Yes there was a lot of data to get through, but Python’s sluggishness at the final step makes me wonder if I could implement some leaner algorithm, parallelise, etc., but with term recommenced code optimisation on a successfully completed task isn’t top priority. Advice on improvements would be appreciated if you have any.

I’m currently reading Jones & Pevzner’s An Introduction to Bioinformatics Algorithms which gives insight into how you can analyse and improve these types of operations (the book is core reading for a Coursera.org lecture series which kicks off next month), and have been recommended Goldwasser & Tamassia’s Data Structures and Algorithms in Python (a few online resources in a similar vein are available here).

I’ve also been fiddling with Julia, an R-like language with C-like speeds — in a 2012 blog post its creators say they “created Julia, in short, because we are greedy”. Fernando Perez is overseeing its incorporation into IPython Notebooks as ‘Project Jupyter’ and a port of R’s ggplot2 library has recently emerged for Julia under the name of Gadfly (a tutorial IPy NB is up here).

I’m starting a final year undergraduate project as of this week, on mapping small RNA-seq data to miRNAs, under the supervision of the founder of the database central to cataloguing this class of non-coding RNA ‒ super exciting stuff! :¬)

If you’ve got questions on PISA you think I could help with, feel free to ask here, or shoot me an email.

PDBe PISA homepage

✣ Peter Briggs, a scientific programmer at STFC Daresbury Laboratory, has a nice little guide to the service here.

Six of One (Plot), Half-Dozen of the Other

By: randyzwitch - Articles

Re-posted from: http://badhessian.org/2014/07/six-of-one-plot-half-dozen-of-the-other/

This is a guest post by Randy Zwitch (@randyzwitch), a digital analytics and predictive modeling consultant in the Greater Philadelphia area. Randy blogs regularly about Data Science and related technologies at http://randyzwitch.com. He’s blogged at Bad Hessian before here.

WordPress Stats - Visitors vs. Views
WordPress Stats – Visitors vs. Views

For those of you with WordPress blogs and have the Jetpack Stats module installed, you’re intimately familiar with this chart. There’s nothing particularly special about this chart, other than you usually don’t see bar charts with the bars shown superimposed.

I wanted to see what it would take to replicate this chart in R, Python and Julia. Here’s what I found. (download the data).

R: ggplot2

Although I prefer to use other languages these days for my analytics work, there’s a certain inertia/nostalgia that when I think of making charts, I think of using ggplot2 and R. Creating the above chart is pretty straightforward, though I didn’t quite replicate the chart, as I couldn’t figure out how to make my custom legend not do the diagonal bar thing.

The R Cookbook talks about a hack to remove the diagonal lines from legends, so I don’t feel too bad about not getting it. I also couldn’t figure out how to force ggplot2 to give me the horizontal line at 10000. If anyone in the R community knows how to fix these, let me know!

(Pythonistas: I’m aware of the ggplot port by Yhat; functionality I used in my R code is still in TODO, so I didn’t pursue plotting with ggplot in Python)

R: Base Graphics

Of course, not everyone finds ggplot2 to be easy to understand, as it requires a different way of thinking about coding than most ‘base’ R functions. To that end, there are the base graphics built into R, which produced this plot: wordpress-base-rWhile I was able to nearly replicate the WordPress chart (except for the feature of having the dark bars slightly smaller width than the lighter), the base R syntax is horrid. The abbreviations for plotting arguments are indefensible, the center and width keywords seem to shift the range of the x-axis instead of changing the actual bar width, and in general, the experience plotting using base R was the worst of the six libraries I evaluated.

Python: matplotlib

In the past year or so, there’s been quite a lot of activity towards improving the graphics capabilities in Python. Historically, there’s been a lot of teeth-gnashing about matplotlib being too low-level and hard to work with, but with enough effort, the results are quite pleasant. Unlike with ggplot2 and base R, I was able to replicate all the features of the WordPress plot:wordpress-matplotlib

Python: Seaborn

One of the aforementioned improvements to matplotlib is Seaborn, which promises to be a higher-level means of plotting data than matplotlib, as well as adding new plotting functionality common in statistics and research. Re-creating this plot using Seaborn is a waste of the additional functionality of Seaborn, and as such, I found it more difficult to make this plot using Seaborn than I did with matplotlib.

To replicate the plot, I ended up hacking a solution together using both Seaborn functionality and matplotlib in order to be able to set bar width and to create the legend, which defeats the purpose of using Seaborn in the first place.

Julia: Gadfly

In the Julia community, Gadfly is clearly the standard for plotting graphics. Supporting d3.js, PNG, PS, and PDF, Gadfly is built to work with many popular back-end environments. I was able to replicate everything about the WordPress graph except for the legend:wordpress-julia-gadflyWhile Gadfly took a line or two more than base R in terms of fewest lines of code, I find the Gadfly syntax significantly more pleasant to work with.

Julia: Plot.ly

Plot.ly is an interesting ‘competitor’ in this challenge, as it’s not a language-specific package per-se. Rather, Plot.ly is a means of specifying plots using JSON, with lightweight Julia/Python/MATLAB/R wrappers. I was able to replicate nearly everything about the WordPress plot, with the exception of not having a line at 10000, having the legend vertical instead of horizontal and I couldn’t figure out how to set the bar widths separately. wordpress-julia-plotly

And The Winner Is…matplotlib?!

If you told me at the beginning of this exercise that matplotlib (and by extension, Seaborn) would be the only library that I would be able to replicate all the features of the WordPress graph, I wouldn’t have believed it. And yet, here we are. ggplot2 was certainly very close, and I’m certain that someone knows how to fix the diagonal line issue. I suspect I could submit an issue ticket to Gadfly.jl to get the feature added to create custom legends (and for that matter, make the request of Plot.ly for horizontal legends), so in the future there could be feature parity using these two libraries as well.

I hope we all agree there’s no hope for Base Graphics in R besides quick throwaway plots.

In the end, the best thing I can say from this exercise is that the analytics community is fortunate to have so many talented people working to provide these amazing visualization libraries. This graph was rather pedestrian in nature, so I didn’t even scratch the surface of what these various libraries can do. Even beyond the six libraries I chose, there are others I didn’t choose, including: prettyplotlib (Python), Bokeh (Python), Vincent (Python), rCharts (R), ggvis (R), Winston (Julia), ASCII Plots (Julia) and probably even more that I’m not even aware of! All free and open-source and miles apart from terrible looking Microsoft graphics in Excel and Powerpoint.

UPDATE: A Comparison of Programming Languages in Economics

By: Bradley Setzler

Re-posted from: https://juliaeconomics.com/2014/06/16/update-a-comparison-of-programming-languages-in-economics/

Here, you can find the latest version of the paper and all of the codes used in A Comparison of Programming Languages in Economics, by S. Boragan Aruoba and Jesús Fernández-Villaverde.

This is the paper that convinced me to give Julia a try.


Bradley J. Setzler