Towards Tsunami Informatics: Applying Machine Learning to Data Extracted from Twitter

2018 Sulawesi Earthquake & Tsunami

Even in 2018, our ability to provide accurate tsunami advisories and warnings is exceedingly challenged.

In best-case scenarios, advisories and warnings afford inhabitants of low-lying coastal areas minutes or (hopefully) longer to react.

In best-case scenarios, advisories and warnings are based upon in situ measurements via tsunameters – as ocean-bottom changes in seawater pressure serve as reliable precursors for impending tsunami arrival. (By way of analogy, tsunameters ‘see’ tsunamis as do radars ‘see’ precipitation. Based on ‘sight’ then, both offer a reasonable ability to ‘nowcast’.)

In typical scenarios, however, advisories and warnings can communicate mixed messages. In the case of the recent Sulawesi earthquake and tsunami for example, a nearby alert (for the Makassar Strait) was retracted after some 30 minutes, even though Palu, Indonesia experienced a ‘localized’ tsunami that resulted in significant losses – with current estimates placing the number of fatalities at more than 1200 people.

With ultimate regret stemming from significant loss of human life, the recent case for the residents of Palu is particularly painful, as alerting was not informed by tsunameter measurements owing to an ongoing dispute – an unresolved dispute that rendered the deployment of an array of tsunameters incomplete and inoperable. A dispute that, if resolved, could’ve provided this low-lying coastal area with accurate and potentially life-saving alerts.

Lessons from Past Events

It’s been only 5,025 days since the last tsunami affected Indonesia – the also devastating Boxing Day 2004 event in the Indian Ocean. All things considered, it’s truly wonderful that a strategic effort to deploy a network of tsunameters in this part the planet was in place; of course, it’s well beyond tragic that execution of the project was significantly hampered, and that almost 14 years later, inhabitants of this otherwise idyllic setting are left to suffer loss of such epic proportions.

I’m a huge proponent of tsunameters as last-resort, yet-accurate indicators for tsunami alerting. In their absence, the norm is for advisories and warnings that may deliver accurate alerts – “may” being the operative word here, as it often the case that alerts are issued only to be retracted at some future time … as was the case again for the recent Sulawesi event. Obviously, tsunami centers that ‘cry wolf’, run the risk of not being taken seriously – seriously, perhaps, in the case when they have correctly predicted an event of some significance.

It’s not that those scientific teams of geographers, geologists, geophysicists, oceanographers and more are in any way lax in attempting to do their jobs; it’s truly that the matter of tsunami prediction is exceedingly difficult. For example, unless you caught the January 2006 issue of Scientific American as I happened to, you’d likely be unaware that 4,933 days ago an earthquake affected (essentially) the same region as the Boxing Day 2004 event; regarded as a three-month-later aftershock, this event of similar earthquake magnitude and tectonic setting did not result in a tsunami.

Writing in this January 2006 issue of Scientific American, Geist et al. compared the two Indian Ocean events side-by-side – using one of those diagrams that this magazine is lauded for. The similarities between the two events are compelling. The seemingly subtle differences, however, are much more than compelling – as the tsunami-producing earlier of the two events bears testimony.

As a student of theoretical, global geophysics, but not specifically oceanography, seismology, tectonophysics or the like, I was unaware of the ‘shocking differences’ between these two events. However, my interest was captivated instantaneously!

Towards Tsunami Informatics

Graph Analytics?

It would take, however, some 3,000 days for my captivated interest to be transformed into a scientific communication. On the heels of successfully developing a framework and platform for knowledge representation with long-time friend and collaborator Jim Freemantle and others, our initial idea was to apply graph analytics to data extracted from Twitter – thus acknowledging that Twitter has the potential to serve as a source of data that might be of value in the context of tsunami alerting.

In hindsight, it’s fortunate that Jim and I did not spend a lot of time on the graph-analytics approach. In fact, arguably the most-valuable outcome from the poster we presented at a computer-science conference in June 2014 (HPCS, Halifax, Nova Scotia), was Jim’s Perl script (see, e.g., Listing 1 of our subsequent unpublished paper, or Listing 1.1 of our soon-to-be published book chapter) that extracted keyword-specified data (e.g., “#earthquake”) from Twitter streams.

Machine Learning: Classification

About two years later, stemming from conversations at the March 2016 Rice University Oil & Gas Conference in Houston, our efforts began to emphasize Machine Learning over graph analytics. Driving for results to present at a May 2016 Big Data event at Prairie View A&M University (PVAMU, also in the Houston area), a textbook example (literally!) taken from the pages of an O’Reilly book on Learning Spark showed some promise in allowing Jim and I to classify tweets – with hammy tweets encapsulating something deemed geophysically interesting, whereas spammy ones not so much. ‘Not so much’ was determined through supervised learning – in other words, results reported were achieved after a manual classification of tweets for the purpose of training the Machine Learning models. The need for manual training, and absence of semantics struck the two of us as ‘lacking’ from the outset; more specifically, each tokenized word of each tweet was represented as a feature vector – stated differently, data and metadata (e.g., Twitter handles, URLs) were all represented with the same (lacking) degree of semantic expression. Based upon our experience with knowledge-representation frameworks, we immediately sought a semantically richer solution.

Machine Learning: Natural Language Processing

It wasn’t until after I’d made a presentation at GTC 2017 in Silicon Valley the following year that the idea of representing words as embedded vectors would register with me. Working with Jim, two unconventional choices were made – namely, GloVe over word2vec and PyTorch over TensorFlow. Whereas academic articles justified our choice of Stanford’s GloVe, the case for PyTorch was made on less-rigorous grounds – grounds expounded in my GTC presentation and our soon-to-be published book chapter.

Our uptake of GloVe and PyTorch addressed our scientific imperative, as results were obtained for the 2017 instantiation of the same HPCS conference where this idea of tsunami alerting (based upon data extracted from Twitter) was originally hatched. In employing Natural Language Processing (NLP), via embedded word vectors, Jim and I were able to quantitatively explore tweets as word-based time series based upon their co-occurrences – stated differently, this word-vector quantification is based upon ‘the company’ (usage associations) that words ‘keep’. By referencing the predigested corpora available from the GloVe project, we were able to explore “earthquake” and “tsunami” in terms of distances, analogies and various kinds of similarities (e.g., cosine similarity).

Event-Reanalysis Examples

Our NLP approach appeared promising enough that we closed out 2017 with a presentation of our findings to date during an interdisciplinary session on tsunami science at the Fall Meeting of the American Geophysical Union held in New Orleans. To emphasize the scientific applicability of our approach, Jim and I focused on reanalyzing two-pairs of events (see Slide 10 here). Like the pair identified years previously in the 2006 Scientific American article, the more-recent event pairs we chose included earthquake-only plus tsunamigenic events originating in close geographic proximity, with similar oceanic and tectonic settings.

The most-promising results we reported (see slides 11 and 12 here and below) involved those cosine similarities obtained for earthquake-only versus tsunamigenic events; evident via clustering, the approach appears able to discriminate between the two classes of events based upon data extracted from Twitter. Even in our own estimation however, the clustering is weakly discriminating at best, and we expect to apply more-advanced approaches for NLP to further separate classes of events.

Agile Sprints - Events - 2017 AGU Fall Meeting - Twitter Tsunami - December 8, 2017

Discussion

Ultimately, the ability to further validate and operationally deploy this alerting mechanism would require the data from Twitter be streamed and processed in real time – a challenge that some containerized implementation of Apache Spark would seem ideally suited to, for example. (Aspects of this Future Work are outlined in the final section of our HPCS 2017 book chapter.)

When it comes to tsunamis, alerting remains a challenge – especially in those parts of the planet under-serviced by networks of tsunameters … and even seismometers, tide gauges, etc. Thus prospects for enhancing the alerting capabilities remain valuable and warranted. Even though inherently fraught with subjectivity, data extracted from streamed Twitter data in real time appears to hold some promise for providing a data source that compliments the objective output from scientific instrumentation. Our approach, based upon Machine Learning via NLP, has demonstrated promising-enough early signs of success that ‘further research is required’. Given that this initiative has already benefited from useful discussions at conferences, suggestions are welcome, as it’s clear that even NLP has a lot more to offer beyond embedded word vectors.

How I Ended Up in Geophysical Fluid Dynamics

How I Ended Up in Geophysical Fluid Dynamics

Lately, I’ve been disclosing the various biases I bring to practicing and enabling Data Science. Motivated by my decision to (finally) self-curate an online, multimedia portfolio, I felt such biases to be material in providing the context that frames this effort. Elsewhere, I’ve shared my inherently scientific bias. In this post, I want to provide additional details. These details I’ve been able to extract verbatim from a blog post I wrote for Bright Computing in January 2015; once I’d settled on geophysics (see below), I aspired to be a seismologist … but, as you’ll soon find out, things didn’t pan out quite the way I’d expected:

I always wanted to be a seismologist.

Scratch that: I always wanted to be an astronaut. How could I help it? I grew up in suburban London (UK, not Ontario) watching James Burke cover the Apollo missions. (Guess I’m also revealing my age here!)

Although I never gave my childhood dream of becoming an astronaut more than a fleeting consideration, I did pursue a career in science.

As my high-school education drew to a close, I had my choices narrowed down to being an astronomer, geophysicist or a nuclear physicist. In grade 12 at Laurier Collegiate in Scarboro (Ontario, not UK … or elsewhere), I took an optional physics course that introduced me to astronomy and nuclear physics. And although I was taken by both subjects, and influenced by wonderful teachers, I dismissed both of these as areas of focus in university. As I recall, I had concerns that I wouldn’t be employable if I had a degree in astronomy, and I wasn’t ready to confront the ethical/moral/etc. dilemmas I expected would accompany a choice of nuclear physics. Go figure!

And so it was to geophysics I was drawn, again influenced significantly by courses in physical geography taught by a wonderful teacher at this same high school. My desire to be a seismologist persisted throughout my undergraduate degree at Montreal’s McGill Universitywhere I ultimately graduated with a B.Sc. in solid Earth geophysics. Armed with my McGill degree, I was in a position to make seismology a point of focus.

But I didn’t. Instead, at Toronto’s York University, I applied Geophysical Fluid Dynamics (GFD) to Earth’s deep interior – mostly Earth’s fluid outer core. Nothing superficial here (literally), as the core only begins some 3,000 km below where we stand on the surface!

Full disclosure: In graduate school, the emphasis was GFD. However, seismology crept in from time to time. For example, I made use of results from deep-Earth seismology in estimating the viscosity of Earth’s fluid outer core. Since this is such a deeply remote region of our planet, geophysicists need to content themselves with observations accessible via seismic and other methods.

From making use of Apache Spark to improve the performance of seismic processing (search for “Reverse-Time Seismic Migration” or “RTM” in my Portfolio), to the analysis of ‘seismic data’ extracted from Twitter (search for “Twitter”in my Portfolio), seismology has taken center stage in a number of my projects as a practitioner of Data Science. However, so has the geophysical fluid dynamics of Earth’s mantle and outer core. Clearly, you can have your geeky cake and eat it too!

Annotation Modeling: To Appear in Comp & Geosci

What a difference a day makes!
Yesterday I learned that my paper on semantic platforms was rejected.
Today, however, the news was better as a manuscript on annotation modeling was
accepted for publication.
It’s been a long road for this paper:

The abstract of the paper is as follows:

Annotation Modeling with Formal Ontologies:
Implications for Informal Ontologies

L. I. Lumb[1], J. R. Freemantle[2], J. I. Lederman[2] & K. D.
Aldridge[2]
[1] Computing and Network Services, York University, 4700 Keele Street,
Toronto, Ontario, M3J 1P3, Canada
[2] Earth & Space Science and Engineering, York University, 4700 Keele
Street, Toronto, Ontario, M3J 1P3, Canada
Knowledge representation is increasingly recognized as an important component of any cyberinfrastructure (CI). In order to expediently address scientific needs, geoscientists continue to leverage the standards and implementations emerging from the World Wide Web Consortium’s (W3C) Semantic Web effort. In an ongoing investigation, previous efforts have been aimed towards the development of a semantic framework for the Global Geodynamics Project (GGP). In contrast to other efforts, the approach taken has emphasized the development of informal ontologies, i.e., ontologies that are derived from the successive extraction of Resource Description Format (RDF) representations from eXtensible Markup Language (XML), and then Web Ontology Language (OWL) from RDF. To better understand the challenges and opportunities for incorporating annotations into the emerging semantic framework, the present effort focuses on knowledge-representation modeling involving formal ontologies. Although OWL’s internal mechanism for annotation is constrained to ensure computational completeness and decidability, externally originating annotations based on the XML Pointer Language (XPointer) can easily violate these constraints. Thus the effort of modeling with formal ontologies allows for recommendations applicable to the case of incorporating annotations into informal ontologies.

I expect the whole paper will be made available in the not-too-distant future …

Earth and Space Science Informatics at the 2007 Fall Meeting of the American Geophysical Union

In a previous post, I referred to Earth Science Informatics as a discipline-in-the-making.

To support this claim, I cited a number of data points. And of these data points, the 2006 Fall Meeting of the American Geophysical Union (AGU) stands out as a key enabler.

With 22 sessions posted, the 2007 Fall Meeting of the AGU is well primed to further enable the development of this discipline.

Because I’m a passionate advocate of this intersection between the Earth Sciences and Informatics, I’m involved in convening three of the 22 Earth and Space Science Informatics sessions:

I encourage you to take a moment to review the calls for participation for these three, as well as the other 19, sessions in Earth and Space Science Informatics at the 2007 Fall Meeting of the AGU.

Digital Terrain Mapping via LIDAR

From the purely scientific (ozone-column mapping, imaging hydrometeors in clouds) to commercial (on-board detection of clear air turbulence, CAT), my exposure to LIDAR applications has been primarily atmospheric.

Of course, other applications of LIDAR technology exist, and one of these is Digital Terrain Mapping (DTM).

Terra Remote Sensing Inc. is a leader in LIDAR-based DTM. Particularly impressive is their ability to perform surface DTM in areas of dense vegetation. As I learned at a very recent meeting of the Ontario Association of Remote Sensing (OARS), Terra has already found a number of very practical applications for LIDAR-based DTM.

Some additional applications that come to mind are:

  • DTM of urban canopies for atmospheric experiments – Terra has already mapped buildings for various purposes. The same approach could be used to better ground (sorry 😉 atmospheric experiments. For example, the boundary-layer modeling that was conducted for Joint Urban 2003 (JU03) employed a digitization of Oklahoma City. A LIDAR-based DTM would’ve made this an even-more realistic effort.
  • Monitoring the progress of Global Change in the Arctic – In addition to LIDAR-based DTM, Terra is also having some success characterizing surfaces based on LIDAR intensity measurements. Because open water and a glacier would be expected to have different DTM and intensity characteristics, Terra should also be able to monitor Global Change as nunataks are progressively transformed into traditional islands (land isolated and surrounded by open water). With the Arctic as a bellwether for Global Change, it’s not surprising that the nunatak-to-island transformation is getting attention.

Although my additional examples are (once again) atmospheric in nature, as Terra is demonstrating, there are numerous applications for LIDAR-based technologies.

Annotation Paper Submitted to HPCS 2007 Event

I’ve blogged and presented recently (locally and at an international scientific event) on the topic of annotation and knowledge representation.

Working with co-authors Jerusha Lederman, Jim Freemantle and Keith Aldridge, a written version of the recent AGU presentation has been prepared and submitted to the HPCS 2007 event. The abstract is as follows:

Semantically Enabling the Global Geodynamics Project:
Incorporating Feature-Based Annotations via XML Pointer Language (XPointer)

Earth Science Markup Language (ESML) is efficient and effective in representing scientific data in an XML-based formalism. However, features of the data being represented are not accounted for in ESML. Such features might derive from events, identifications, or some other source. In order to account for features in an ESML context, they are considered from the perspective of annotation. Although it is possible to extend ESML to incorporate feature-based annotations internally, there are complicating factors identified that apply to ESML and most XML dialects. Rather than pursue the ESML-extension approach, an external representation for feature-based annotations via XML Pointer Language (XPointer) is developed. In previous work, it has been shown that it is possible to extract relationships from ESML-based representations, and capture the results in the Resource Description Format (RDF). Application of this same requirement to XPointer-based annotations of ESML representations results in a revised semantic framework for the Global Geodynamics Project (GGP).

Once the paper is accepted, I’ll make a pre-submission version available online.

Because the AGU session I participated in has also issued a call for papers, I’ll be extending the HPCS 2007 submission in various interesting ways.

And finally, thoughts are starting to gel on how annotations may be worked into the emerging notions I’ve been having on knowledge-based heuristics.

Stay tuned.

Genetic Aesthetics: Generative Software Meets Genetic Algorithms

I’m still reading Cloninger’s book, and just read a section on Generative Software (GS) – software used by contemporary designers to “… automate an increasingly large portion of the creative process.” As implied by the name, GS can produce a tremendous amount of output. It’s then up to the designer to be creatively stimulated as they sift through the GS output.

As I was reading Cloninger’s description, I couldn’t help but make my own connections with Genetic Algorithms (GAs). I’ve seen GAs applied in the physical sciences. For example, GAs can be used to generate models to fit data. The scientist provides an ancestor (a starting model), and then variations are derived through genetic processes such as mutation. Only the models with appropriate levels of fitness survive subsequent generations. Ultimately, what results is the best (i.e., most fit) model that explains the data according to the GA process.

In an analogous way, this is also what happens with the output from GS. Of course, in the GS case, it is the designer her/himself who determines what survives according to their own criteria.

The GS-GA connection is even stronger than my own association may cause you to believe.

In interviewing Joshua Davis for his book, Cloninger states:

At one point, you talked about creating software that would parse through the output of your generative software and select the iterations you were most likely to choose.

Davis responds:

That’s something [programmer] Branden Hall and I worked on called Genetic Aesthetic. It uses a neural network and genetic algorithms to create a “hot or not” situation. It says, “Rate this composition I generated on a scale from 1 to 10.” If I give it a 1, it says, “This isn’t beautiful. I should look at what kind of numbers were generated in this iteration and record those as unfavorable.” You have to train the software. Because the process is based on variables and numbers, over a very short period of time it’s able to learn what numbers are unsatisfactory and what numbers are satisfactory to that individual human critic. It changes per individual.

That certainly makes the GS-GA connection explicit and poetic, Genetic Aesthetic – I like that!

I’ve never worked with GAs. However, I did lead a project at KelResearch where our objective was to classify hydrometeors (i.e., raindrops, snowflakes, etc.). The hydrometeors were observed in situ by a sensor deployed on the wing of an airplane. Data was collected as the plane flew through winter storms. (Many of these campaigns were spearheaded by Prof. R. E. Stewart.) What we attempted to do was automate the classification of the hydrometeors on the basis of their shape. More specifically, we attempted to estimate the fractal dimension of each observed hydrometeor in the hopes of providing at automated classification scheme. Although this was feasible in principle, the resolution offered by the sensor made this impractical. Nonetheless, it was a interesting opportunity for me to personally explore the natural Genetic Aesthetics afforded by Canadian winter storms!

Annotating at the AGU

Perhaps two years ago, it was a challenge to find appropriate sessions at the American Geophysical Union Fall Meeting for submissions that addressed the intersection between geophysics and knowledge representation.

A year ago, there were quite a few to choose from.

This year, I was almost overwhelmed by choice.

I ended up selecting the “Earth and Space Science Cyberinfrastructure: Application and Theory of Knowledge Representation” session in the “Earth and Space Science Informatics” section. The work I intend to present, co-authored with Jerusha Lederman and Keith Aldridge also of York University, is described via an abstract elsewhere. I’ll need to prepare well as I’m presenting in good company and have only 15 minutes!

The makings for a productive and stimulating meeting are clearly present.

And for a Canadian in December, it’s pretty difficult not to enjoy the Bay Area!

Alternate Mechanism for Earth’s Magnetic Field

About 3,000 km beneath our feet lies Earth’s third ocean. Quite unlike the water-based first and air-based second, this third ocean is an iron-based alloy. Because this liquid-state alloy (aka. Earth’s fluid/liquid outer core) is an electrical conductor, currents can exist. Owing to a well-known reciprocity between electrical currents and magnetic fields, this third ocean factors significantly in an observable known to any of us surface dwellers who have ever wielded a compass.

Although there’s no question that Earth possesses a magnetic field, there are a number of open scientific questions about this pervasive, natural phenomenon. A number of these questions are directed at the sustainability of this magnetic field over geologically significant timescales. Well evidenced in the geologic record over a few billion years, Earth’s magnetic field requires a self-sustaining mechanism (aka. a geodynamo) to account for its very existence and peculiarities (such as reversals).

The starting point for scientific investigators is that this electrically conducting region is under rotation. (Taken from the appropriate scientific perspective, a perspective that takes into account planetary scales and fluid dynamical properties, it turns out this region is rotating rapidly.) The same Earth’s rotation that causes deflection of trade winds in the atmosphere (via the Coriolis effect) also figures significantly in the energetics of Earth’s magnetic field. Rotational effects alone, however, cannot account for the existence, longevity and peculiarities of Earth’s magnetic field over the visible geologic past.

This conundrum has forced the scientists who study this phenomenon to speculate on mechanisms for Earth’s magnetic field. Many of the suggested mechanisms call from some degree of additional stirring. This additional stirring causes deviations from the otherwise steady state of solid body rotation. Simply put, these deviations cause motion in the electrically conducting fluid that in turn result in magnetic fields.

Since the late 1970s, the favored mechanism for additional stirring has been based on buoyancy. In the case of Earth’s third ocean, buoyancy is thought to result from solidification. More specifically, as Earth’s centremost region (know as the inner core) grows by iron crystallization, the residual light element(s) in the alloy is/are released buoyantly. This combined effect of chemistry and fluid dynamics is thought to result in compositional convection.

Compositional convection, however, can be challenged on a number of fronts. Rather than pursue that here, my present purpose is to relate another mechanism for Earth’s magnetic field. As with the previous mechanism, deviations from an otherwise steady state of solid body rotation are key in this case as well. Rotation enforces cylindrical symmetry. This enforcement even applies when the body that’s under rotation (Earth in this case) has an almost spherical symmetry to it. Almost is definitely the operative word here, as Earth isn’t perfectly spherically symmetric. In fact, Earth is an oblate spheroid that bulges at the equator and is depressed at the poles. This combination results in an opposition of symmetries, cylindrical (owing to Earth’s rotation) versus spheroidal (owing to the boundary that contains Earth’s third ocean). As has been demonstrated experimentally and theoretically, the introduction of deviations in the presence of such symmetry oppositions can cause significant instabilities. In Earth’s case, periodic deviations might originate from precession and/or tides.

Experimental, theoretical, numerical and observational studies of such instabilities have been one of Keith Aldridge‘s research themes for over a decade at Toronto’s York University. In addition to ongoing experimental studies with graduate student Ross Baker, Keith has been collaborating with post-doc David McMillan and I on supportive observational evidence. Because the instabilities we’re all interested in are periodic, we’ve been looking for indirect evidence in historical records of Earth’s magnetic field. Deep-ocean sediments, extracted as drill cores, are proving useful in our attempts to analyze relative variations in Earth’s magnetic field intensity over the past 70,000 years. In short, our analysis of variations in paleointensity allows us to further support fluid-flow instabilities as a viable mechanism for Earth’s magnetic field.

A scientific account of this investigation has recently been accepted for publication in an appropriate journal, Physics of the Earth and Planetary Interiors. A preprint is currently available online.