Towards Tsunami Informatics: Applying Machine Learning to Data Extracted from Twitter

2018 Sulawesi Earthquake & Tsunami

Even in 2018, our ability to provide accurate tsunami advisories and warnings is exceedingly challenged.

In best-case scenarios, advisories and warnings afford inhabitants of low-lying coastal areas minutes or (hopefully) longer to react.

In best-case scenarios, advisories and warnings are based upon in situ measurements via tsunameters – as ocean-bottom changes in seawater pressure serve as reliable precursors for impending tsunami arrival. (By way of analogy, tsunameters ‘see’ tsunamis as do radars ‘see’ precipitation. Based on ‘sight’ then, both offer a reasonable ability to ‘nowcast’.)

In typical scenarios, however, advisories and warnings can communicate mixed messages. In the case of the recent Sulawesi earthquake and tsunami for example, a nearby alert (for the Makassar Strait) was retracted after some 30 minutes, even though Palu, Indonesia experienced a ‘localized’ tsunami that resulted in significant losses – with current estimates placing the number of fatalities at more than 1200 people.

With ultimate regret stemming from significant loss of human life, the recent case for the residents of Palu is particularly painful, as alerting was not informed by tsunameter measurements owing to an ongoing dispute – an unresolved dispute that rendered the deployment of an array of tsunameters incomplete and inoperable. A dispute that, if resolved, could’ve provided this low-lying coastal area with accurate and potentially life-saving alerts.

Lessons from Past Events

It’s been only 5,025 days since the last tsunami affected Indonesia – the also devastating Boxing Day 2004 event in the Indian Ocean. All things considered, it’s truly wonderful that a strategic effort to deploy a network of tsunameters in this part the planet was in place; of course, it’s well beyond tragic that execution of the project was significantly hampered, and that almost 14 years later, inhabitants of this otherwise idyllic setting are left to suffer loss of such epic proportions.

I’m a huge proponent of tsunameters as last-resort, yet-accurate indicators for tsunami alerting. In their absence, the norm is for advisories and warnings that may deliver accurate alerts – “may” being the operative word here, as it often the case that alerts are issued only to be retracted at some future time … as was the case again for the recent Sulawesi event. Obviously, tsunami centers that ‘cry wolf’, run the risk of not being taken seriously – seriously, perhaps, in the case when they have correctly predicted an event of some significance.

It’s not that those scientific teams of geographers, geologists, geophysicists, oceanographers and more are in any way lax in attempting to do their jobs; it’s truly that the matter of tsunami prediction is exceedingly difficult. For example, unless you caught the January 2006 issue of Scientific American as I happened to, you’d likely be unaware that 4,933 days ago an earthquake affected (essentially) the same region as the Boxing Day 2004 event; regarded as a three-month-later aftershock, this event of similar earthquake magnitude and tectonic setting did not result in a tsunami.

Writing in this January 2006 issue of Scientific American, Geist et al. compared the two Indian Ocean events side-by-side – using one of those diagrams that this magazine is lauded for. The similarities between the two events are compelling. The seemingly subtle differences, however, are much more than compelling – as the tsunami-producing earlier of the two events bears testimony.

As a student of theoretical, global geophysics, but not specifically oceanography, seismology, tectonophysics or the like, I was unaware of the ‘shocking differences’ between these two events. However, my interest was captivated instantaneously!

Towards Tsunami Informatics

Graph Analytics?

It would take, however, some 3,000 days for my captivated interest to be transformed into a scientific communication. On the heels of successfully developing a framework and platform for knowledge representation with long-time friend and collaborator Jim Freemantle and others, our initial idea was to apply graph analytics to data extracted from Twitter – thus acknowledging that Twitter has the potential to serve as a source of data that might be of value in the context of tsunami alerting.

In hindsight, it’s fortunate that Jim and I did not spend a lot of time on the graph-analytics approach. In fact, arguably the most-valuable outcome from the poster we presented at a computer-science conference in June 2014 (HPCS, Halifax, Nova Scotia), was Jim’s Perl script (see, e.g., Listing 1 of our subsequent unpublished paper, or Listing 1.1 of our soon-to-be published book chapter) that extracted keyword-specified data (e.g., “#earthquake”) from Twitter streams.

Machine Learning: Classification

About two years later, stemming from conversations at the March 2016 Rice University Oil & Gas Conference in Houston, our efforts began to emphasize Machine Learning over graph analytics. Driving for results to present at a May 2016 Big Data event at Prairie View A&M University (PVAMU, also in the Houston area), a textbook example (literally!) taken from the pages of an O’Reilly book on Learning Spark showed some promise in allowing Jim and I to classify tweets – with hammy tweets encapsulating something deemed geophysically interesting, whereas spammy ones not so much. ‘Not so much’ was determined through supervised learning – in other words, results reported were achieved after a manual classification of tweets for the purpose of training the Machine Learning models. The need for manual training, and absence of semantics struck the two of us as ‘lacking’ from the outset; more specifically, each tokenized word of each tweet was represented as a feature vector – stated differently, data and metadata (e.g., Twitter handles, URLs) were all represented with the same (lacking) degree of semantic expression. Based upon our experience with knowledge-representation frameworks, we immediately sought a semantically richer solution.

Machine Learning: Natural Language Processing

It wasn’t until after I’d made a presentation at GTC 2017 in Silicon Valley the following year that the idea of representing words as embedded vectors would register with me. Working with Jim, two unconventional choices were made – namely, GloVe over word2vec and PyTorch over TensorFlow. Whereas academic articles justified our choice of Stanford’s GloVe, the case for PyTorch was made on less-rigorous grounds – grounds expounded in my GTC presentation and our soon-to-be published book chapter.

Our uptake of GloVe and PyTorch addressed our scientific imperative, as results were obtained for the 2017 instantiation of the same HPCS conference where this idea of tsunami alerting (based upon data extracted from Twitter) was originally hatched. In employing Natural Language Processing (NLP), via embedded word vectors, Jim and I were able to quantitatively explore tweets as word-based time series based upon their co-occurrences – stated differently, this word-vector quantification is based upon ‘the company’ (usage associations) that words ‘keep’. By referencing the predigested corpora available from the GloVe project, we were able to explore “earthquake” and “tsunami” in terms of distances, analogies and various kinds of similarities (e.g., cosine similarity).

Event-Reanalysis Examples

Our NLP approach appeared promising enough that we closed out 2017 with a presentation of our findings to date during an interdisciplinary session on tsunami science at the Fall Meeting of the American Geophysical Union held in New Orleans. To emphasize the scientific applicability of our approach, Jim and I focused on reanalyzing two-pairs of events (see Slide 10 here). Like the pair identified years previously in the 2006 Scientific American article, the more-recent event pairs we chose included earthquake-only plus tsunamigenic events originating in close geographic proximity, with similar oceanic and tectonic settings.

The most-promising results we reported (see slides 11 and 12 here and below) involved those cosine similarities obtained for earthquake-only versus tsunamigenic events; evident via clustering, the approach appears able to discriminate between the two classes of events based upon data extracted from Twitter. Even in our own estimation however, the clustering is weakly discriminating at best, and we expect to apply more-advanced approaches for NLP to further separate classes of events.

Agile Sprints - Events - 2017 AGU Fall Meeting - Twitter Tsunami - December 8, 2017

Discussion

Ultimately, the ability to further validate and operationally deploy this alerting mechanism would require the data from Twitter be streamed and processed in real time – a challenge that some containerized implementation of Apache Spark would seem ideally suited to, for example. (Aspects of this Future Work are outlined in the final section of our HPCS 2017 book chapter.)

When it comes to tsunamis, alerting remains a challenge – especially in those parts of the planet under-serviced by networks of tsunameters … and even seismometers, tide gauges, etc. Thus prospects for enhancing the alerting capabilities remain valuable and warranted. Even though inherently fraught with subjectivity, data extracted from streamed Twitter data in real time appears to hold some promise for providing a data source that compliments the objective output from scientific instrumentation. Our approach, based upon Machine Learning via NLP, has demonstrated promising-enough early signs of success that ‘further research is required’. Given that this initiative has already benefited from useful discussions at conferences, suggestions are welcome, as it’s clear that even NLP has a lot more to offer beyond embedded word vectors.

On Knowledge-Based Representations for Actionable Data …

I bumped into a professional acquaintance last week. After describing briefly a presentation I was about to give, he offered to broker introductions to others who might have an interest in the work I’ve been doing. To initiate the introductions, I crafted a brief description of what I’ve been up to for the past 5 years in this area. I’ve also decided to share it here as follows: 

As always, [name deleted], I enjoyed our conversation at the recent AGU meeting in Toronto. Below, I’ve tried to provide some context for the work I’ve been doing in the area of knowledge representations over the past few years. I’m deeply interested in any introductions you might be able to broker with others at York who might have an interest in applications of the same.

Since 2004, I’ve been interested in expressive representations of data. My investigations started with a representation of geophysical data in the eXtensible Markup Language (XML). Although this was successful, use of the approach underlined the importance of metadata (data about data) as an oversight. To address this oversight, a subsequent effort introduced a relationship-centric representation via the Resource Description Format (RDF). RDF, by the way, forms the underpinnings of the next-generation Web – variously known as the Semantic Web, Web 3.0, etc. In addition to taking care of issues around metadata, use of RDF paved the way for increasingly expressive representations of the same geophysical data. For example, to represent features in and of the geophysical data, an RDF-based scheme for annotation was introduced using XML Pointer Language (XPointer). Somewhere around this point in my research, I placed all of this into a framework.

A data-centric framework for knowledge representation.

A data-centric framework for knowledge representation.

 In addition to applying my Semantic Framework to use cases in Internet Protocol (IP) networking, I’ve continued to tease out increasingly expressive representations of data. Most recently, these representations have been articulated in RDFS – i.e., RDF Schema. And although I have not reached the final objective of an ontological representation in the Web Ontology Language (OWL), I am indeed progressing in this direction. (Whereas schemas capture the vocabulary of an application domain in geophysics or IT, for example, ontologies allow for knowledge-centric conceptualizations of the same.)  

From niche areas of geophysics to IP networking, the Semantic Framework is broadly applicable. As a workflow for systematically enhancing the expressivity of data, the Framework is based on open standards emerging largely from the World Wide Web Consortium (W3C). Because there is significant interest in this next-generation Web from numerous parties and angles, implementation platforms allow for increasingly expressive representations of data today. In making data actionable, the ultimate value of the Semantic Framework is in providing a means for integrating data from seemingly incongruous disciplines. For example, such representations are actually responsible for providing new results – derived by querying the representation through a ‘semantified’ version of the Structured Query Language (SQL) known as SPARQL. 

I’ve spoken formally and informally about this research to audiences in the sciences, IT, and elsewhere. With York co-authors spanning academic and non-academic staff, I’ve also published four refereed journal papers on aspects of the Framework, and have an invited book chapter currently under review – interestingly, this chapter has been contributed to a book focusing on data management in the Semantic Web. Of course, I’d be pleased to share any of my publications and discuss aspects of this work with those finding it of interest.

With thanks in advance for any connections you’re able to facilitate, Ian. 

If anything comes of this, I’m sure I’ll write about it here – eventually!

In the meantime, feedback is welcome.

Annotation Modeling: To Appear in Comp & Geosci

What a difference a day makes!
Yesterday I learned that my paper on semantic platforms was rejected.
Today, however, the news was better as a manuscript on annotation modeling was
accepted for publication.
It’s been a long road for this paper:

The abstract of the paper is as follows:

Annotation Modeling with Formal Ontologies:
Implications for Informal Ontologies

L. I. Lumb[1], J. R. Freemantle[2], J. I. Lederman[2] & K. D.
Aldridge[2]
[1] Computing and Network Services, York University, 4700 Keele Street,
Toronto, Ontario, M3J 1P3, Canada
[2] Earth & Space Science and Engineering, York University, 4700 Keele
Street, Toronto, Ontario, M3J 1P3, Canada
Knowledge representation is increasingly recognized as an important component of any cyberinfrastructure (CI). In order to expediently address scientific needs, geoscientists continue to leverage the standards and implementations emerging from the World Wide Web Consortium’s (W3C) Semantic Web effort. In an ongoing investigation, previous efforts have been aimed towards the development of a semantic framework for the Global Geodynamics Project (GGP). In contrast to other efforts, the approach taken has emphasized the development of informal ontologies, i.e., ontologies that are derived from the successive extraction of Resource Description Format (RDF) representations from eXtensible Markup Language (XML), and then Web Ontology Language (OWL) from RDF. To better understand the challenges and opportunities for incorporating annotations into the emerging semantic framework, the present effort focuses on knowledge-representation modeling involving formal ontologies. Although OWL’s internal mechanism for annotation is constrained to ensure computational completeness and decidability, externally originating annotations based on the XML Pointer Language (XPointer) can easily violate these constraints. Thus the effort of modeling with formal ontologies allows for recommendations applicable to the case of incorporating annotations into informal ontologies.

I expect the whole paper will be made available in the not-too-distant future …

AGU Poster: Relationship-Centric Ontology Integration

Later today in San Francisco, at the 2007 Fall Meeting of the American Geophysical Union (AGU), one of my co-authors will be presenting our poster entitled “Relationship-Centric Ontology Integration” (abstract).

This poster will be in a session for which I was a co-convenor and described elsewhere.

A PDF-version of the poster is available elsewhere (agu07_the_poster_v2.pdf).

Earth and Space Science Informatics at the 2007 Fall Meeting of the American Geophysical Union

In a previous post, I referred to Earth Science Informatics as a discipline-in-the-making.

To support this claim, I cited a number of data points. And of these data points, the 2006 Fall Meeting of the American Geophysical Union (AGU) stands out as a key enabler.

With 22 sessions posted, the 2007 Fall Meeting of the AGU is well primed to further enable the development of this discipline.

Because I’m a passionate advocate of this intersection between the Earth Sciences and Informatics, I’m involved in convening three of the 22 Earth and Space Science Informatics sessions:

I encourage you to take a moment to review the calls for participation for these three, as well as the other 19, sessions in Earth and Space Science Informatics at the 2007 Fall Meeting of the AGU.