Towards Tsunami Informatics: Applying Machine Learning to Data Extracted from Twitter

2018 Sulawesi Earthquake & Tsunami

Even in 2018, our ability to provide accurate tsunami advisories and warnings is exceedingly challenged.

In best-case scenarios, advisories and warnings afford inhabitants of low-lying coastal areas minutes or (hopefully) longer to react.

In best-case scenarios, advisories and warnings are based upon in situ measurements via tsunameters – as ocean-bottom changes in seawater pressure serve as reliable precursors for impending tsunami arrival. (By way of analogy, tsunameters ‘see’ tsunamis as do radars ‘see’ precipitation. Based on ‘sight’ then, both offer a reasonable ability to ‘nowcast’.)

In typical scenarios, however, advisories and warnings can communicate mixed messages. In the case of the recent Sulawesi earthquake and tsunami for example, a nearby alert (for the Makassar Strait) was retracted after some 30 minutes, even though Palu, Indonesia experienced a ‘localized’ tsunami that resulted in significant losses – with current estimates placing the number of fatalities at more than 1200 people.

With ultimate regret stemming from significant loss of human life, the recent case for the residents of Palu is particularly painful, as alerting was not informed by tsunameter measurements owing to an ongoing dispute – an unresolved dispute that rendered the deployment of an array of tsunameters incomplete and inoperable. A dispute that, if resolved, could’ve provided this low-lying coastal area with accurate and potentially life-saving alerts.

Lessons from Past Events

It’s been only 5,025 days since the last tsunami affected Indonesia – the also devastating Boxing Day 2004 event in the Indian Ocean. All things considered, it’s truly wonderful that a strategic effort to deploy a network of tsunameters in this part the planet was in place; of course, it’s well beyond tragic that execution of the project was significantly hampered, and that almost 14 years later, inhabitants of this otherwise idyllic setting are left to suffer loss of such epic proportions.

I’m a huge proponent of tsunameters as last-resort, yet-accurate indicators for tsunami alerting. In their absence, the norm is for advisories and warnings that may deliver accurate alerts – “may” being the operative word here, as it often the case that alerts are issued only to be retracted at some future time … as was the case again for the recent Sulawesi event. Obviously, tsunami centers that ‘cry wolf’, run the risk of not being taken seriously – seriously, perhaps, in the case when they have correctly predicted an event of some significance.

It’s not that those scientific teams of geographers, geologists, geophysicists, oceanographers and more are in any way lax in attempting to do their jobs; it’s truly that the matter of tsunami prediction is exceedingly difficult. For example, unless you caught the January 2006 issue of Scientific American as I happened to, you’d likely be unaware that 4,933 days ago an earthquake affected (essentially) the same region as the Boxing Day 2004 event; regarded as a three-month-later aftershock, this event of similar earthquake magnitude and tectonic setting did not result in a tsunami.

Writing in this January 2006 issue of Scientific American, Geist et al. compared the two Indian Ocean events side-by-side – using one of those diagrams that this magazine is lauded for. The similarities between the two events are compelling. The seemingly subtle differences, however, are much more than compelling – as the tsunami-producing earlier of the two events bears testimony.

As a student of theoretical, global geophysics, but not specifically oceanography, seismology, tectonophysics or the like, I was unaware of the ‘shocking differences’ between these two events. However, my interest was captivated instantaneously!

Towards Tsunami Informatics

Graph Analytics?

It would take, however, some 3,000 days for my captivated interest to be transformed into a scientific communication. On the heels of successfully developing a framework and platform for knowledge representation with long-time friend and collaborator Jim Freemantle and others, our initial idea was to apply graph analytics to data extracted from Twitter – thus acknowledging that Twitter has the potential to serve as a source of data that might be of value in the context of tsunami alerting.

In hindsight, it’s fortunate that Jim and I did not spend a lot of time on the graph-analytics approach. In fact, arguably the most-valuable outcome from the poster we presented at a computer-science conference in June 2014 (HPCS, Halifax, Nova Scotia), was Jim’s Perl script (see, e.g., Listing 1 of our subsequent unpublished paper, or Listing 1.1 of our soon-to-be published book chapter) that extracted keyword-specified data (e.g., “#earthquake”) from Twitter streams.

Machine Learning: Classification

About two years later, stemming from conversations at the March 2016 Rice University Oil & Gas Conference in Houston, our efforts began to emphasize Machine Learning over graph analytics. Driving for results to present at a May 2016 Big Data event at Prairie View A&M University (PVAMU, also in the Houston area), a textbook example (literally!) taken from the pages of an O’Reilly book on Learning Spark showed some promise in allowing Jim and I to classify tweets – with hammy tweets encapsulating something deemed geophysically interesting, whereas spammy ones not so much. ‘Not so much’ was determined through supervised learning – in other words, results reported were achieved after a manual classification of tweets for the purpose of training the Machine Learning models. The need for manual training, and absence of semantics struck the two of us as ‘lacking’ from the outset; more specifically, each tokenized word of each tweet was represented as a feature vector – stated differently, data and metadata (e.g., Twitter handles, URLs) were all represented with the same (lacking) degree of semantic expression. Based upon our experience with knowledge-representation frameworks, we immediately sought a semantically richer solution.

Machine Learning: Natural Language Processing

It wasn’t until after I’d made a presentation at GTC 2017 in Silicon Valley the following year that the idea of representing words as embedded vectors would register with me. Working with Jim, two unconventional choices were made – namely, GloVe over word2vec and PyTorch over TensorFlow. Whereas academic articles justified our choice of Stanford’s GloVe, the case for PyTorch was made on less-rigorous grounds – grounds expounded in my GTC presentation and our soon-to-be published book chapter.

Our uptake of GloVe and PyTorch addressed our scientific imperative, as results were obtained for the 2017 instantiation of the same HPCS conference where this idea of tsunami alerting (based upon data extracted from Twitter) was originally hatched. In employing Natural Language Processing (NLP), via embedded word vectors, Jim and I were able to quantitatively explore tweets as word-based time series based upon their co-occurrences – stated differently, this word-vector quantification is based upon ‘the company’ (usage associations) that words ‘keep’. By referencing the predigested corpora available from the GloVe project, we were able to explore “earthquake” and “tsunami” in terms of distances, analogies and various kinds of similarities (e.g., cosine similarity).

Event-Reanalysis Examples

Our NLP approach appeared promising enough that we closed out 2017 with a presentation of our findings to date during an interdisciplinary session on tsunami science at the Fall Meeting of the American Geophysical Union held in New Orleans. To emphasize the scientific applicability of our approach, Jim and I focused on reanalyzing two-pairs of events (see Slide 10 here). Like the pair identified years previously in the 2006 Scientific American article, the more-recent event pairs we chose included earthquake-only plus tsunamigenic events originating in close geographic proximity, with similar oceanic and tectonic settings.

The most-promising results we reported (see slides 11 and 12 here and below) involved those cosine similarities obtained for earthquake-only versus tsunamigenic events; evident via clustering, the approach appears able to discriminate between the two classes of events based upon data extracted from Twitter. Even in our own estimation however, the clustering is weakly discriminating at best, and we expect to apply more-advanced approaches for NLP to further separate classes of events.

Agile Sprints - Events - 2017 AGU Fall Meeting - Twitter Tsunami - December 8, 2017


Ultimately, the ability to further validate and operationally deploy this alerting mechanism would require the data from Twitter be streamed and processed in real time – a challenge that some containerized implementation of Apache Spark would seem ideally suited to, for example. (Aspects of this Future Work are outlined in the final section of our HPCS 2017 book chapter.)

When it comes to tsunamis, alerting remains a challenge – especially in those parts of the planet under-serviced by networks of tsunameters … and even seismometers, tide gauges, etc. Thus prospects for enhancing the alerting capabilities remain valuable and warranted. Even though inherently fraught with subjectivity, data extracted from streamed Twitter data in real time appears to hold some promise for providing a data source that compliments the objective output from scientific instrumentation. Our approach, based upon Machine Learning via NLP, has demonstrated promising-enough early signs of success that ‘further research is required’. Given that this initiative has already benefited from useful discussions at conferences, suggestions are welcome, as it’s clear that even NLP has a lot more to offer beyond embedded word vectors.

How I Ended Up in Geophysical Fluid Dynamics

How I Ended Up in Geophysical Fluid Dynamics

Lately, I’ve been disclosing the various biases I bring to practicing and enabling Data Science. Motivated by my decision to (finally) self-curate an online, multimedia portfolio, I felt such biases to be material in providing the context that frames this effort. Elsewhere, I’ve shared my inherently scientific bias. In this post, I want to provide additional details. These details I’ve been able to extract verbatim from a blog post I wrote for Bright Computing in January 2015; once I’d settled on geophysics (see below), I aspired to be a seismologist … but, as you’ll soon find out, things didn’t pan out quite the way I’d expected:

I always wanted to be a seismologist.

Scratch that: I always wanted to be an astronaut. How could I help it? I grew up in suburban London (UK, not Ontario) watching James Burke cover the Apollo missions. (Guess I’m also revealing my age here!)

Although I never gave my childhood dream of becoming an astronaut more than a fleeting consideration, I did pursue a career in science.

As my high-school education drew to a close, I had my choices narrowed down to being an astronomer, geophysicist or a nuclear physicist. In grade 12 at Laurier Collegiate in Scarboro (Ontario, not UK … or elsewhere), I took an optional physics course that introduced me to astronomy and nuclear physics. And although I was taken by both subjects, and influenced by wonderful teachers, I dismissed both of these as areas of focus in university. As I recall, I had concerns that I wouldn’t be employable if I had a degree in astronomy, and I wasn’t ready to confront the ethical/moral/etc. dilemmas I expected would accompany a choice of nuclear physics. Go figure!

And so it was to geophysics I was drawn, again influenced significantly by courses in physical geography taught by a wonderful teacher at this same high school. My desire to be a seismologist persisted throughout my undergraduate degree at Montreal’s McGill Universitywhere I ultimately graduated with a B.Sc. in solid Earth geophysics. Armed with my McGill degree, I was in a position to make seismology a point of focus.

But I didn’t. Instead, at Toronto’s York University, I applied Geophysical Fluid Dynamics (GFD) to Earth’s deep interior – mostly Earth’s fluid outer core. Nothing superficial here (literally), as the core only begins some 3,000 km below where we stand on the surface!

Full disclosure: In graduate school, the emphasis was GFD. However, seismology crept in from time to time. For example, I made use of results from deep-Earth seismology in estimating the viscosity of Earth’s fluid outer core. Since this is such a deeply remote region of our planet, geophysicists need to content themselves with observations accessible via seismic and other methods.

From making use of Apache Spark to improve the performance of seismic processing (search for “Reverse-Time Seismic Migration” or “RTM” in my Portfolio), to the analysis of ‘seismic data’ extracted from Twitter (search for “Twitter”in my Portfolio), seismology has taken center stage in a number of my projects as a practitioner of Data Science. However, so has the geophysical fluid dynamics of Earth’s mantle and outer core. Clearly, you can have your geeky cake and eat it too!

Annotation Modeling: To Appear in Comp & Geosci

What a difference a day makes!
Yesterday I learned that my paper on semantic platforms was rejected.
Today, however, the news was better as a manuscript on annotation modeling was
accepted for publication.
It’s been a long road for this paper:

The abstract of the paper is as follows:

Annotation Modeling with Formal Ontologies:
Implications for Informal Ontologies

L. I. Lumb[1], J. R. Freemantle[2], J. I. Lederman[2] & K. D.
[1] Computing and Network Services, York University, 4700 Keele Street,
Toronto, Ontario, M3J 1P3, Canada
[2] Earth & Space Science and Engineering, York University, 4700 Keele
Street, Toronto, Ontario, M3J 1P3, Canada
Knowledge representation is increasingly recognized as an important component of any cyberinfrastructure (CI). In order to expediently address scientific needs, geoscientists continue to leverage the standards and implementations emerging from the World Wide Web Consortium’s (W3C) Semantic Web effort. In an ongoing investigation, previous efforts have been aimed towards the development of a semantic framework for the Global Geodynamics Project (GGP). In contrast to other efforts, the approach taken has emphasized the development of informal ontologies, i.e., ontologies that are derived from the successive extraction of Resource Description Format (RDF) representations from eXtensible Markup Language (XML), and then Web Ontology Language (OWL) from RDF. To better understand the challenges and opportunities for incorporating annotations into the emerging semantic framework, the present effort focuses on knowledge-representation modeling involving formal ontologies. Although OWL’s internal mechanism for annotation is constrained to ensure computational completeness and decidability, externally originating annotations based on the XML Pointer Language (XPointer) can easily violate these constraints. Thus the effort of modeling with formal ontologies allows for recommendations applicable to the case of incorporating annotations into informal ontologies.

I expect the whole paper will be made available in the not-too-distant future …

Earth and Space Science Informatics at the 2007 Fall Meeting of the American Geophysical Union

In a previous post, I referred to Earth Science Informatics as a discipline-in-the-making.

To support this claim, I cited a number of data points. And of these data points, the 2006 Fall Meeting of the American Geophysical Union (AGU) stands out as a key enabler.

With 22 sessions posted, the 2007 Fall Meeting of the AGU is well primed to further enable the development of this discipline.

Because I’m a passionate advocate of this intersection between the Earth Sciences and Informatics, I’m involved in convening three of the 22 Earth and Space Science Informatics sessions:

I encourage you to take a moment to review the calls for participation for these three, as well as the other 19, sessions in Earth and Space Science Informatics at the 2007 Fall Meeting of the AGU.

Digital Terrain Mapping via LIDAR

From the purely scientific (ozone-column mapping, imaging hydrometeors in clouds) to commercial (on-board detection of clear air turbulence, CAT), my exposure to LIDAR applications has been primarily atmospheric.

Of course, other applications of LIDAR technology exist, and one of these is Digital Terrain Mapping (DTM).

Terra Remote Sensing Inc. is a leader in LIDAR-based DTM. Particularly impressive is their ability to perform surface DTM in areas of dense vegetation. As I learned at a very recent meeting of the Ontario Association of Remote Sensing (OARS), Terra has already found a number of very practical applications for LIDAR-based DTM.

Some additional applications that come to mind are:

  • DTM of urban canopies for atmospheric experiments – Terra has already mapped buildings for various purposes. The same approach could be used to better ground (sorry 😉 atmospheric experiments. For example, the boundary-layer modeling that was conducted for Joint Urban 2003 (JU03) employed a digitization of Oklahoma City. A LIDAR-based DTM would’ve made this an even-more realistic effort.
  • Monitoring the progress of Global Change in the Arctic – In addition to LIDAR-based DTM, Terra is also having some success characterizing surfaces based on LIDAR intensity measurements. Because open water and a glacier would be expected to have different DTM and intensity characteristics, Terra should also be able to monitor Global Change as nunataks are progressively transformed into traditional islands (land isolated and surrounded by open water). With the Arctic as a bellwether for Global Change, it’s not surprising that the nunatak-to-island transformation is getting attention.

Although my additional examples are (once again) atmospheric in nature, as Terra is demonstrating, there are numerous applications for LIDAR-based technologies.

Annotation Paper Submitted to HPCS 2007 Event

I’ve blogged and presented recently (locally and at an international scientific event) on the topic of annotation and knowledge representation.

Working with co-authors Jerusha Lederman, Jim Freemantle and Keith Aldridge, a written version of the recent AGU presentation has been prepared and submitted to the HPCS 2007 event. The abstract is as follows:

Semantically Enabling the Global Geodynamics Project:
Incorporating Feature-Based Annotations via XML Pointer Language (XPointer)

Earth Science Markup Language (ESML) is efficient and effective in representing scientific data in an XML-based formalism. However, features of the data being represented are not accounted for in ESML. Such features might derive from events, identifications, or some other source. In order to account for features in an ESML context, they are considered from the perspective of annotation. Although it is possible to extend ESML to incorporate feature-based annotations internally, there are complicating factors identified that apply to ESML and most XML dialects. Rather than pursue the ESML-extension approach, an external representation for feature-based annotations via XML Pointer Language (XPointer) is developed. In previous work, it has been shown that it is possible to extract relationships from ESML-based representations, and capture the results in the Resource Description Format (RDF). Application of this same requirement to XPointer-based annotations of ESML representations results in a revised semantic framework for the Global Geodynamics Project (GGP).

Once the paper is accepted, I’ll make a pre-submission version available online.

Because the AGU session I participated in has also issued a call for papers, I’ll be extending the HPCS 2007 submission in various interesting ways.

And finally, thoughts are starting to gel on how annotations may be worked into the emerging notions I’ve been having on knowledge-based heuristics.

Stay tuned.