Re-posted from: https://estadistika.github.io//web/scraping/philippines/julia/programming/packages/2018/10/30/Introduction-to-Web-Scraping-Julia.html
Data nowadays are almost everywhere, often stored in as simple as traditional log books, to as complex as multi-connected-databases. Efficient collection of these datasets is crucial for analytics since data processing takes almost 50% of the overall workflow. An example where manual data collection can be automated is in the case of datasets published in the website, where providers are usually government agencies. For example in the Philippines, there is a website dedicated to Open Stat initiated by the Philippine Statistics Authority (PSA). The site hoards public datasets for researchers to use and are well prepared in CSV format, so consumers can simply download the file. Unfortunately, for some agencies this feature is not yet available. That is, users need to either copy-paste the data from the website, or request it to the agency directly (this also takes time). A good example of this is the seismic events of the Philippine Institute of Volcanology and Seismology (PHIVOLCS).
Data encoded in HTML can be parsed and saved into formats that’s workable for doing analyses (e.g. CSV, TSV, etc.). The task of harvesting and parsing data from the web is called web scraping, and PHIVOLCS’ Latest Seismic Events is a good playground for beginners. There are several tutorials available especially for Python (see this) and R (see this), but not much for Julia. Hence, this article is primarily for Julia users. However, this work introduces web tools as well – how to use it for inspecting the components of the website – which can be useful for non-Julia users.
Why Julia?
The creators of the language described it well in their first announcement (I suggest you read the full post): Why we created Julia? Here’s part of it:
We are greedy: we want more.
We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
(Did we mention it should be as fast as C?)
I used Julia in my master’s thesis for my MCMC simulations and benchmarked it against R. It took seconds in Julia while R took more than an hour (sampling over the posterior distribution). I could have optimized my R code using Rcpp (writting the performance-critical part in C++ to speed up and wrapped/call it in R), but I have no time for that. Hence, Julia solves the two-language problem.
Getting to know HTML
Since the data published in the websites are usually encoded as a table, it is therefore best to understand the structure of the HTML document before performing web scraping. HTML (Hypertext Markup Language) is a standardized system for tagging text files to achieve font, color, graphic, and hyperlink effects on World Wide Web pages [1]. For example, bold text in HTML is enclosed inside the <b>
tag, e.g. <b>
text</b>
, the result is text. A webpage is a HTML document that can be structured in several ways, one possible case is as follows:
Scrapers must be familiar with the hierarchy of the HTML document as this will be the template for the frontend source code of every website. Following the structure of the above figure, data encoded in HTML table are placed inside the td
(table data) tag, where td
is under tr
(table row), tr
is under tbody
(table body), and so on. td
is the lowest level tag (sorting by hierarchy) from the figure above that can contain data. However, td
can also take precedence over p
(paragraph), a
(hyperlink), b
(bold), i
(italic), span
(span), and even div
(division). So expect to encounter these under td
as well.
As indicated in the figure, each HTML tag can have attributes, such as id
and class
. To understand how the two differ, consider id="yellow"
and id="orange"
, these are unique identities (id
s) of colors. These id
s can be grouped into a class, e.g. class="colors"
. HTML tags are not required to have these attributes but are useful for adding custom styles and behavior when doing web development. This article will not dive into the details of the HTML document, but rather to give the reader a high level understanding. There are many resources available on the web, just google.
Inspecting the Source of the Website
In order to have an idea on the structure of the website, browsers such as Google Chrome and Mozilla Firefox include tools for Web Developers. For purpose of illustration but without loss of generality, this article will only scrape portion (why? read on and see the explanation below) of the September 2018 earthquake events. The web developer tools can be accessed from Tools > Web Developer in Firefox, and can be accessed from View > Developer in Google Chrome. The following video shows how to use the inspector tool of the Mozilla Firefox.
Scraping using Julia
To perform web scraping, Julia offers three libraries for the job, and these are Cascadia.jl, Gumbo.jl and HTTP.jl. HTTP.jl is used to download the frontend source code of the website, which then is parsed by Gumbo.jl into a hierarchical structured object; and Cascadia.jl provides a CSS selector API for easy navigation.
To start with, the following code will download the frontend source code of the PHIVOLCS’ Seismic Events for September 2018.
Extract the HTML source code and parsed it as follows:
Now to extract the header of the HTML table, use the Web Developer Tools for preliminary inspection on the components of the website. As shown in the screenshot below, the header of the table is enclosed inside the p
tag of the td
. Further, the p
tag is of class auto-style33
, which can be accessed via CSS selector by simply prefixing it with .
, i.e. .auto-style33
.
qres
contains the HTML tags that matched the CSS selector’s query. The result is further cleaned by removing the tabs, spaces and line breaks via Regular Expressions, and is done as follows:
Having the header names, next is to extract the data from the HTML table. Upon inspection, the td
s containing the data next to the header rows seem to have the following classes (see screenshot below): auto-style21
for first column (Date-Time), auto-style81
for second column (Latitude), auto-style80
for third and fourth columns (Longitude and Depth), auto-style74
for fifth column (Magnitude), and auto-style79
for sixth column (Location). Unfortunately, this is not consistent across rows (tr
s), and is therefore best not to use it with Cascadia.jl. Instead, use Gumbo.jl to navigate down the hierarchy of the Document Object Model of the HTML.
Starting with the table
tag which is of class .MsoNormalTable
(see screenshot below), the extraction proceeds down to tbody
then to tr
and finally to td
.
The following code describes how parsing is done, read the comments:
Complete Code for PHIVOLCS’ September 2018 (Portion) Seismic Events
The September 2018 Seismic Events are encoded in two separate HTML tables of the same class, named MsoNormalTable
. For purpose of simplicity, this article will only scrape the first portion (3rd-indexed, see line 14 below: tbody = html[3][1];
) of the table (581 rows). The second portion (4th-indexed, change line 14 below to: tbody = html[4][1];
) is left to the reader to try out and scrape it as well.
The following code wraps the parsers into functions, namely htmldoc
(downloads and parses the HTML source code of the site), scraper
(scrapes the downloaded HTML document), firstcolumn
(logic for parsing the first column of the table, used inside scraper
function).
Having the data, analyst can now proceed to do exploratory analyses, for example the following is the descriptive statistics of the variables:
describe
is clever enough not only to not return mean
and median
for non-continuous variables, but also determine the min
, max
and nunique
(number of uniques) for these variables (date and location).
End Note
I use Python primarily at work with BeautifulSoup as my go-to library for web scraping. Compared to Cascadia.jl and Gumbo.jl, BeautifulSoup offers comprehensive documentations and other resources that are useful for figuring out bugs, and understand how the module works. Having said, I hope this article somehow contributed to the documentation of the said Julia libraries. Further, I am confident to say that Cascadia.jl and Gumbo.jl are stable enough for the job.
Lastly, as a precaution to beginners, make sure to read the privacy policy (if any) of any website you want to scrape.