Towards Tsunami Informatics: Applying Machine Learning to Data Extracted from Twitter

2018 Sulawesi Earthquake & Tsunami

Even in 2018, our ability to provide accurate tsunami advisories and warnings is exceedingly challenged.

In best-case scenarios, advisories and warnings afford inhabitants of low-lying coastal areas minutes or (hopefully) longer to react.

In best-case scenarios, advisories and warnings are based upon in situ measurements via tsunameters – as ocean-bottom changes in seawater pressure serve as reliable precursors for impending tsunami arrival. (By way of analogy, tsunameters ‘see’ tsunamis as do radars ‘see’ precipitation. Based on ‘sight’ then, both offer a reasonable ability to ‘nowcast’.)

In typical scenarios, however, advisories and warnings can communicate mixed messages. In the case of the recent Sulawesi earthquake and tsunami for example, a nearby alert (for the Makassar Strait) was retracted after some 30 minutes, even though Palu, Indonesia experienced a ‘localized’ tsunami that resulted in significant losses – with current estimates placing the number of fatalities at more than 1200 people.

With ultimate regret stemming from significant loss of human life, the recent case for the residents of Palu is particularly painful, as alerting was not informed by tsunameter measurements owing to an ongoing dispute – an unresolved dispute that rendered the deployment of an array of tsunameters incomplete and inoperable. A dispute that, if resolved, could’ve provided this low-lying coastal area with accurate and potentially life-saving alerts.

Lessons from Past Events

It’s been only 5,025 days since the last tsunami affected Indonesia – the also devastating Boxing Day 2004 event in the Indian Ocean. All things considered, it’s truly wonderful that a strategic effort to deploy a network of tsunameters in this part the planet was in place; of course, it’s well beyond tragic that execution of the project was significantly hampered, and that almost 14 years later, inhabitants of this otherwise idyllic setting are left to suffer loss of such epic proportions.

I’m a huge proponent of tsunameters as last-resort, yet-accurate indicators for tsunami alerting. In their absence, the norm is for advisories and warnings that may deliver accurate alerts – “may” being the operative word here, as it often the case that alerts are issued only to be retracted at some future time … as was the case again for the recent Sulawesi event. Obviously, tsunami centers that ‘cry wolf’, run the risk of not being taken seriously – seriously, perhaps, in the case when they have correctly predicted an event of some significance.

It’s not that those scientific teams of geographers, geologists, geophysicists, oceanographers and more are in any way lax in attempting to do their jobs; it’s truly that the matter of tsunami prediction is exceedingly difficult. For example, unless you caught the January 2006 issue of Scientific American as I happened to, you’d likely be unaware that 4,933 days ago an earthquake affected (essentially) the same region as the Boxing Day 2004 event; regarded as a three-month-later aftershock, this event of similar earthquake magnitude and tectonic setting did not result in a tsunami.

Writing in this January 2006 issue of Scientific American, Geist et al. compared the two Indian Ocean events side-by-side – using one of those diagrams that this magazine is lauded for. The similarities between the two events are compelling. The seemingly subtle differences, however, are much more than compelling – as the tsunami-producing earlier of the two events bears testimony.

As a student of theoretical, global geophysics, but not specifically oceanography, seismology, tectonophysics or the like, I was unaware of the ‘shocking differences’ between these two events. However, my interest was captivated instantaneously!

Towards Tsunami Informatics

Graph Analytics?

It would take, however, some 3,000 days for my captivated interest to be transformed into a scientific communication. On the heels of successfully developing a framework and platform for knowledge representation with long-time friend and collaborator Jim Freemantle and others, our initial idea was to apply graph analytics to data extracted from Twitter – thus acknowledging that Twitter has the potential to serve as a source of data that might be of value in the context of tsunami alerting.

In hindsight, it’s fortunate that Jim and I did not spend a lot of time on the graph-analytics approach. In fact, arguably the most-valuable outcome from the poster we presented at a computer-science conference in June 2014 (HPCS, Halifax, Nova Scotia), was Jim’s Perl script (see, e.g., Listing 1 of our subsequent unpublished paper, or Listing 1.1 of our soon-to-be published book chapter) that extracted keyword-specified data (e.g., “#earthquake”) from Twitter streams.

Machine Learning: Classification

About two years later, stemming from conversations at the March 2016 Rice University Oil & Gas Conference in Houston, our efforts began to emphasize Machine Learning over graph analytics. Driving for results to present at a May 2016 Big Data event at Prairie View A&M University (PVAMU, also in the Houston area), a textbook example (literally!) taken from the pages of an O’Reilly book on Learning Spark showed some promise in allowing Jim and I to classify tweets – with hammy tweets encapsulating something deemed geophysically interesting, whereas spammy ones not so much. ‘Not so much’ was determined through supervised learning – in other words, results reported were achieved after a manual classification of tweets for the purpose of training the Machine Learning models. The need for manual training, and absence of semantics struck the two of us as ‘lacking’ from the outset; more specifically, each tokenized word of each tweet was represented as a feature vector – stated differently, data and metadata (e.g., Twitter handles, URLs) were all represented with the same (lacking) degree of semantic expression. Based upon our experience with knowledge-representation frameworks, we immediately sought a semantically richer solution.

Machine Learning: Natural Language Processing

It wasn’t until after I’d made a presentation at GTC 2017 in Silicon Valley the following year that the idea of representing words as embedded vectors would register with me. Working with Jim, two unconventional choices were made – namely, GloVe over word2vec and PyTorch over TensorFlow. Whereas academic articles justified our choice of Stanford’s GloVe, the case for PyTorch was made on less-rigorous grounds – grounds expounded in my GTC presentation and our soon-to-be published book chapter.

Our uptake of GloVe and PyTorch addressed our scientific imperative, as results were obtained for the 2017 instantiation of the same HPCS conference where this idea of tsunami alerting (based upon data extracted from Twitter) was originally hatched. In employing Natural Language Processing (NLP), via embedded word vectors, Jim and I were able to quantitatively explore tweets as word-based time series based upon their co-occurrences – stated differently, this word-vector quantification is based upon ‘the company’ (usage associations) that words ‘keep’. By referencing the predigested corpora available from the GloVe project, we were able to explore “earthquake” and “tsunami” in terms of distances, analogies and various kinds of similarities (e.g., cosine similarity).

Event-Reanalysis Examples

Our NLP approach appeared promising enough that we closed out 2017 with a presentation of our findings to date during an interdisciplinary session on tsunami science at the Fall Meeting of the American Geophysical Union held in New Orleans. To emphasize the scientific applicability of our approach, Jim and I focused on reanalyzing two-pairs of events (see Slide 10 here). Like the pair identified years previously in the 2006 Scientific American article, the more-recent event pairs we chose included earthquake-only plus tsunamigenic events originating in close geographic proximity, with similar oceanic and tectonic settings.

The most-promising results we reported (see slides 11 and 12 here and below) involved those cosine similarities obtained for earthquake-only versus tsunamigenic events; evident via clustering, the approach appears able to discriminate between the two classes of events based upon data extracted from Twitter. Even in our own estimation however, the clustering is weakly discriminating at best, and we expect to apply more-advanced approaches for NLP to further separate classes of events.

Agile Sprints - Events - 2017 AGU Fall Meeting - Twitter Tsunami - December 8, 2017


Ultimately, the ability to further validate and operationally deploy this alerting mechanism would require the data from Twitter be streamed and processed in real time – a challenge that some containerized implementation of Apache Spark would seem ideally suited to, for example. (Aspects of this Future Work are outlined in the final section of our HPCS 2017 book chapter.)

When it comes to tsunamis, alerting remains a challenge – especially in those parts of the planet under-serviced by networks of tsunameters … and even seismometers, tide gauges, etc. Thus prospects for enhancing the alerting capabilities remain valuable and warranted. Even though inherently fraught with subjectivity, data extracted from streamed Twitter data in real time appears to hold some promise for providing a data source that compliments the objective output from scientific instrumentation. Our approach, based upon Machine Learning via NLP, has demonstrated promising-enough early signs of success that ‘further research is required’. Given that this initiative has already benefited from useful discussions at conferences, suggestions are welcome, as it’s clear that even NLP has a lot more to offer beyond embedded word vectors.

Developing Your Expertise in Machine Learning: Podcasts for Breadth vs. Depth

From ad hoc to highly professional, there’s no shortage of resources when it comes to learning Machine Learning. Not only should podcasts be blatantly regarded as both viable and valuable resources, the two I cover in this post present opportunities for improving your breadth and/or depth in Machine Learning.

Machine Learning Guide

As a component of his own process for ramping up his knowledge and skills in the area of Machine Learning, OCDevel’s Tyler Renelle has developed an impressive resource of some 30 podcasts. Through this collection of episodes, Tyler’s is primarily a breadth play when it comes to the matter of learning Machine Learning, though he alludes to depth as well in how he positions his podcasts:

Where your other resources provide the machine learning trees, I provide the forest. Consider me your syllabus. At the end of every episode I provide high-quality curated resources for learning each episode’s details.

As I expect you’ll agree, with Tyler’s Guide, the purely audio medium of podcasting permits the breadth of Machine Learning to be communicated extremely effectively; in his own words, Tyler states:

Audio may seem inferior, but it’s a great supplement during exercise/commute/chores.

I couldn’t agree more. Even from the earliest of those episodes in this series, Tyler demonstrates the viability and value of this medium. In my opinion, he is particularly effective for at least three reasons:

  1. Repetition – Extremely important in any learning process, regardless of the medium, repetition is critical when podcasting is employed as a tool for learning.
  2. Analogies – Again, useful in learning regardless of the medium involved, yet extremely so in the case of podcasting. Imagine effective, simple, highly visual and sometimes fun analogies being introduced to explain, for example, a particular algorithm for Machine Learning.
  3. Enthusiasm – Perhaps a no-brainer, but enthusiasm serves to captivate interest and motivate action.

As someone who’s listened to each and every one of those 30 or so episodes, I can state with some assuredness that: We are truly fortunate that Tyler has expended the extra effort to share what he has learned in the hope that it’ll also help others. The quality of the Guide is excellent. If anything, I recall occasionally taking exception to some of the mathematical details related by Tyler. Because Tyler approaches this Guide from the perspective of an experienced developer, lapses mathematical in nature are extremely minor, and certainly do not detract from the overall value of the podcast.

After sharing his Guide, Tyler started up Machine Learning Applied:

an exclusive podcast series on practical/applied tech side of the same. Smaller, more frequent episodes.

Unfortunately, with only six episodes starting from May 2018, and none since mid-July, this more-applied series hasn’t yet achieved the stature of its predecessor. I share this more as a statement of fact than criticism, as sustaining the momentum to deliver such involved content on a regular cadence is not achieved without considerable effort – and, let’s be realistic, more than just a promise of monetization.

This Week in Machine Learning and AI

Whereas OCDevel’s Guide manifests itself as a one-person, breadth play, This Week in Machine Learning and AI (TWiML&AI) exploits the interview format in probing for depth. Built upon the seemingly tireless efforts of knowledgeable and skilled interviewer Sam Charrington, TWiML&AI podcasts allow those at the forefront of Machine Learning to share the details of their work – whether that translates to their R&D projects, business ventures or some combination thereof.

Like Tyler Renelle, Sam has a welcoming and nurturing style that allows him to ensure his guests are audience-centric in their responses – even if that means an episode is tagged with a ‘geek alert’ for those conversations that include mathematical details, for example. As someone who engages in original research in Machine Learning, I have learned a lot from TWiML&AI. Specifically, after listening to a number of episodes, I’ve followed up on show notes by delving a little deeper into something that sounded interesting; and on more than a few occasions, I’ve unearthed something of value for those projects I’m working on. Though Sam has interviewed some of the most well known in this rapidly evolving field, it is truly wonderful that TWiML&AI serves as an equal-opportunity platform – a platform that allows voices that might otherwise be marginalized to also be heard.

At this point, Sam and his team at TWIML&AI have developed a community around the podcast. The opportunity for deeper interaction exists through meetups, for example – meetups that have ranged from focused discussion on a particularly impactful research paper, to a facilitated study group in support of a course. In addition to all of this online activity, Sam and his team participate actively in a plethora of events, and have even been known to host events in person as well.

One last thought regarding TWiML&AI: The team here takes significant effort to ensure that each of the 185 episodes (and counting!) is well documented. While this is extremely useful, I urge you not to merely make your decision on what to listen to based upon teasers and notes alone. Stated differently, I can relate countless examples for which I perceived a very low level of interest prior to actually listening to an episode, only to be both surprised and delighted when I did. As I recall well my from my running days, run for that first kilometre or so (0.6214 of a mile 😉 ) before you make the decision as to how far you’ll run that day.

From the understandably predictable essentials of breadth, to the sometimes surprising and delightful details of depth, these two podcasts well illustrate the complementarity between the schools of breadth and depth. Based upon my experience, you’ll be well served by taking in both of these podcasts – whether you need to jumpstart or engage-in-continuous learning. Have a listen.

Pencasting with a Wacom tablet: Time to revisit this option

Around the start of the Fall term in September 2014, I found myself in a bit of a bind: My level of frustration with Livescribe pencasting had peaked, and was I desperately seeking alternatives. To be clear, it was changes to the Livescribe platform that were the source of this frustration, rather than pencasting as a means for visual communication. In fact, if anything, a positive aspect of the Livescribe experience was that I was indeed SOLD on pencasting as an extremely effective means for communicating visually – an approach that delivered significant value in instructional settings such as the large classes I was teaching at the university level.

In an attempt to make use of an alternative to the Livescribe platform then, I discovered and acquired a small Wacom tablet. Whereas I rapidly became proficient in use of the Livescribe Echo smartpen, because it was truly like making use of a regular pen, my own learning curve with the Wacom solution was considerably steeper.

To be concrete, you can view on Youtube a relatively early attempt. As one viewer commented:

Probably should practice the lecture. Too many pauses um er ah.

Honestly, that was more a reflection of my grasp of the Wacom platform than my expertise with the content I was attempting to convey through this real-time screen capture. In other words, my comfort level with this technology was so low that I was distracted by it. Given that many, many thousands of visual (art) professionals make use of this or similar solutions from Wacom, I’m more that willing to admit that this one was ‘on me’ – I wasn’t ‘a natural’.

With the Wacom solution, you need to train your eyes to be fixed on your screen, while your hand writes/draws/etc. on the tablet. Not exactly known for my hand-eye coordination in general, it’s evident that I struggled with this technology. As I look at the results some four years later, I’m not quite as dismayed as I expected to be. My penmanship isn’t all that bad – even though I still find writing and drawing with this tablet to be a taxing exercise in humility. In hindsight, I’m also fairly pleased with the Wacom tablet’s ability to permit use of colour, as well as lines of different thicknesses. This flexibility, completely out of scope in the solution from Livescribe, introduces a whole next level of prospects for visual communication.

Knowing that others have mastered the Wacom platform, and having some personal indication of its potential to produce useful results, I’m left with the idea of giving this approach another try – soon. I’ll let you know how it goes.

Livescribe Pencasting: Seizing Uncertainty from Success

Echo’es of a Glorified Past

[Optional musical accompaniment: From their album Meddle, Pink Floyd’s Echoes (via their Youtube channel).]

I first learned about pencasting from an elementary-school teacher at a regional-networking summit in 2011.

It took me more than a year to acquire the technology and start experimenting. My initial experiences, in making use of this technology in a large, first-year course at the university level, were extremely encouraging; and after only a week’s worth of experimentation, my ‘findings’ were summarized as follows:

  • Decent-quality pencasts can be produced with minimal effort
  • Pencasts compliment other instructional media
  • Pencasts allow the teacher to address both content and skills-oriented objectives
  • Pencasts are learner-centric
  • I’m striving for authentic, not perfect pencasts

The details that support each of these findings are provided in my April 2012 blog post. With respect to test driving the pencasts in a large-lecture venue, I subsequently shared:

  • The visual aspects of the pencast are quite acceptable
  • The audio quality of the pencasts is very good to excellent
  • One-to-many live streaming of pencasts works well
  • Personal pencasting works well

Again, please refer to my original post in October 2012 for the details.

Over the next year or so, I must’ve developed of order 20 pencasts using the Livescribe Echo smartpen – please see below for a sample. Given that the shortest of these pencasts ran 15-20 minutes, my overall uptake and investment in the technology was significant. I unexpectedly became ‘an advocate for the medium’, as I shared my pencasts with students in the courses I was teaching, colleagues who were also instructing in universities and high schools, plus textbook publishers. At one point, I even had interest from both the publisher and an author of the textbook I was using in my weather and climate class to develop a few pencasts – pencasts that would subsequently be made available as instructional media to any other instructor who was making use of this same textbook.

[Sample pencast: Download a mathematical example – namely, Hydrostatic Equation – Usage Example – 2013-06-15T12-10-06-0. Then, use the Livescribe player, desktop, iOS or Android app to view/listen.]

The Slings and Arrows of Modernization

Unfortunately, all of this changed in the Summer of 2015. Anticipating the impending demise of Adobe Flash technology, in what was marketed as a ‘modernization’ effort, Livescribe rejected this one-time staple in favour of their own proprietary appropriation of the Adobe PDF. Along with the shift to the Livescribe-proprietary format for pencasts then, was an implicit requirement to make use of browser, desktop or mobile apps from this sole-source vendor. As if these changes weren’t enough, Livescribe then proceeded to close its online community – the vehicle through which many of us were sharing our pencasts with our students, colleagues, etc. My frustration was clearly evident in a comment posted to Livescribe’s blog in September 2014:

This may be the tipping point for me and Livescribe products – despite my investment in your products and in pencast development … I’ve been using virtual machines on my Linux systems to run Windows so that I can use your desktop app. The pay off for this inconvenience was being able to share pencasts via the platform-neutral Web. Your direction appears to introduce complexities that translate to diminishing returns from my increasingly marginalized Linux/Android perspective …

From the vantage point of hindsight in 2018, and owing to the ongoing realization of the demise of Flash, I fully appreciate that Livescribe had to do something about the format they employed to encode their pencasts; and, given that there aren’t any open standards available (are there?), they needed to develop their own, proprietary format. What remains unfortunate, however, is the implicit need to make use of their proprietary software to actually view and listen to the pencasts. As far as I can tell, their browser-based viewer still doesn’t work on popular Linux-based platforms (e.g., Ubuntu), while you’ll need to have a Microsoft Windows or Apple Mac OS X based platform to make use of their desktop application. Arguably, the most-positive outcome from ‘all of this’ is that their apps for iOS and Android devices are quite good. (Of course, it took them some time before the Android app followed the release of the iOS app.)

Formats aside, the company’s decision to close its community still, from the vantage point of 2018, strikes me as a strategic blunder of epic proportions. (Who turns their back on their community and expects to survive?) Perhaps they (Livescribe) didn’t want to be in the community-hosting business themselves. And while I can appreciate and respect that position, alternatives were available at the time, and abound today.

Pencasting Complexified

[Full disclosure: I neither own, nor have I used the Livescribe 3 smartpen alluded to in the following paragraph. In other words, this is my hands-off take on the smartpen. I will happily address factual errors.]

At one point, and in my opinion, the simplicity of the Livescribe Echo smartpen was its greatest attribute. As a content producer, all I needed was the pen and one of Livescribe’s proprietary notebooks, plus a quiet place in which to record my pencasts. Subsequent innovation from the company resulted in the Livescribe 3 smartpen. Though it may well be designed “… to work and write like a premium ballpoint pen …”, the complexity introduced now requires the content producer to have the pen, the notebook, a bluetooth headset plus an iOS or Android device to capture pencasts. In this case, there is a serious price to be paid for modernization – both figuratively and literally.

According to Wikipedia, the Livescribe 3 smartpen was introduced in November 2013. And despite the acquisition by Anoto about two-years later, innovation appears to have ceased. So much for first-mover advantage, and Jim Marggraff’s enviable track record of innovation.

My need to pencast remains strong – even in 2018. If you’ve read this far, I’m sure you’ll understand why I might be more than slightly reluctant to fork out the cash for a Livescribe 3 smartpen. There may be alternatives, however; and I do expect that future posts may share my findings, lessons learned, best practices, etc.

Feel free to weigh in on this post via the comments – especially if you have alternatives to suggest. Please note: Support for Linux highly desirable.

Demonstrating Your Machine Learning Expertise: Optimizing Breadth vs. Depth

Developing Expertise

When it comes to developing your expertise in Machine Learning, there seem to be two schools of thought:

  • Exemplified by articles that purport to have listed, for example, the 10-most important methods you need to know to ace a Machine Learning interview, the School of Breadth emphasizes content-oriented objectives. By amping up with courses/workshops to programs (e.g., certificates, degrees) then, the justification for broadening your knowledge of Machine Learning is self-evident.
  • Find data that interests you, and work with it using a single approach for Machine Learning. Thus the School of Depth emphasizes skills-oriented objectives that are progressively mastered as you delve into data, or better yet, a problem of interest.

Depending upon whichever factors you currently have under consideration then (e.g., career stage, employment status, desired employment trajectory, …), breadth versus depth may result in an existential crisis when it comes to developing and ultimately demonstrating your expertise in Machine Learning – with a modicum of apologies if that strikes you as a tad melodramatic.

Demonstrating Expertise

Somewhat conflicted, at least, is in all honesty how I feel at the moment myself.

On Breadth

Even a rapid perusal of the Machine Learning specific artifacts I’ve self-curated into my online, multimedia Data Science Portfolio makes one thing glaringly evident: The breadth of my exposure to Machine Learning has been somewhat limited. Specifically, I have direct experience with classification and Natural Language Processing in Machine Learning contexts from the practitioner’s perspective. The more-astute reviewer, however, might look beyond the ‘pure ML’ sections of my portfolio and afford me additional merit for (say) my mathematical and/or physical sciences background, plus my exposure to concepts directly or indirectly applicable to Machine Learning – e.g., my experience as a scientist with least-squares modeling counting as exposure at a conceptual level to regression (just to keep this focused on breadth, for the moment).

True confession: I’ve started more than one course in Machine Learning in a blunt-instrument attempt to address this known gap in my knowledge of relevant methods. Started is, unfortunately, the operative word, as (thus far) any attempt I’ve made has not been followed through – even when there are options for community, accountability, etc. to better-ensure success. (Though ‘life got in the way’ of me participating fully in the study group facilitated by the wonderful team that delivers the This Week in Machine Learning & AI Podcast, such approaches to learning Machine Learning are appealing in principle – even though my own engagement was grossly inconsistent.)

On Depth

What then about depth? Taking the self-serving but increasingly concrete example of my own Portfolio, it’s clear that (at times) I’ve demonstrated depth. Driven by an interesting problem aimed at improving tsunami alerting by processing data extracted from Twitter, for example, the deepening progression with co-author Jim Freemantle has been as follows:

  1. Attempt to apply existing knowledge-representation framework to the problem by extending it (the framework) to include graph analytics
  2. Introduce tweet classification via Machine Learning
  3. Address the absence of semantics in the classification-based approach through the introduction of Natural Language Processing (NLP) in general, and embedded word vectors in particular
  4. Next steps …

(Again, please refer to my Portfolio for content relating to this use case.) Going deeper, in this case, is not a demonstration of a linear progression; rather, it is a sequence of outcomes realized through experimentation, collaboration, consultation, etc. For example, the seed to introduce Machine Learning into this tsunami-alerting initiative was planted on the basis of informal discussions at an oil and gas conference … and later, the introduction of embedded word vectors, was similarly the outcome of informal discussions at a GPU technology conference.

Whereas these latter examples are intended primarily to demonstrate the School of Depth, it is clear that the two schools of thought aren’t mutually exclusive. For example, in delving into a problem of interest Jim and I may have deepened our mastery of specific skills within NLP, however we have also broadened our knowledge within this important subdomain of Machine Learning.

One last thought here on depth. At the outset, neither Jim nor I had as an objective any innate desire to explore NLP. Rather, the problem, and more importantly the demands of the problem, caused us to ‘gravitate’ towards NLP. In other words, we are wedded more to making scientific progress (on tsunami alerting) than a specific method for Machine Learning (e.g., NLP).

Next Steps

Net-net then, it appears to be that which motivates us that dominates in practice – in spite, perhaps, of our best intentions. In my own case, my existential crisis derives from being driven by problems into depth, while at the same time seeking to demonstrate a broader portfolio of expertise with Machine Learning. To be more specific, there’s a part of me that wants to apply LSTMs (foe example) to the tsunami-alerting use case, whereas another part knows I must broaden (at least a little!) my portfolio when it comes to methods applicable to Machine Learning.

Finally then, how do I plan to address this crisis? For me, it’ll likely manifest itself as a two-pronged approach:

  1. Enrol and follow through on a course (at least!) that exposes me to one or more methods of Machine Learning that compliments my existing exposure to classification and NLP.
  2. Identify a problem, or problems of interest, that allow me to deepen my mastery of one or more of these ‘newly introduced’ methods of Machine Learning.

In a perfect situation, perhaps we’d emphasize breadth and depth. However, when you’re attempting to introduce, pivot, re-position, etc. yourself, a trade off between breadth versus depth appears to be inevitable. An introspective reflection, based upon the substance of a self-curated portfolio, appears to be an effective and efficient means for roadmapping how gaps can be identified and ultimately addressed.


In many settings/environments, Machine Learning and Data Science in general, are team sports. Clearly then, a viable way to address the challenges and opportunities presented by depth versus breadth is to hire accordingly – i.e., hire for depth and breadth in your organization.

Revisiting the Estimation of Fractal Dimension for Image Classification

Classification is a well-established use case for Machine Learning. Though textbook examples abound, standard examples include the classification of email into ham versus spam, or images of cats versus dogs.

Circa 1994, I was unaware of Machine Learning, but I did have a use case for quantitative image classification. I expect you’re familiar with those brave souls known as The Hurricane Hunters – brave because they explicitly seek to locate the eyes of hurricanes using an appropriately tricked out, military-grade aircraft. Well, these hunters aren’t the only brave souls when it comes to chasing down storms in the pursuit of atmospheric science. In an effort to better understand Atlantic storms (i.e., East Coast, North America), a few observational campaigns featured aircraft flying through blizzards at various times during Canadian winters.

In addition to standard instrumentation for atmospheric and navigational observables, these planes were tricked out in an exceptional way:

For about two-and-a-half decades, Knollenberg-type [ref 4] optical array probes have been used to render in-situ digital images of hydrometeors. Such hydrometeors are represented as a two-dimensional matrix, whose individual elements depend on the intensity of transmitted light, as these hydrometeors pass across a linear optical array of photodiodes. [ref 5]

In other words, the planes were equipped with underwing optical sensors that had the capacity to obtain in-flight images of

hydrometeor type, e.g. plates, stellar crystals, columns, spatial dendrites, capped columns, graupel, and raindrops. [refs 1,7]

(Please see the original paper for the references alluded to here.)

Even though this is hardly a problem in Big Data, a single flight might produce tens to hundreds to thousands of hydrometeor images that needed to be manually classified by atmospheric scientists. Working for a boutique consultancy focused on atmospheric science, and having excellent relationships with Environment Canada scientists who make Cloud Physics their express passion, an opportunity to automate the classification of hydrometeors presented itself.

Around this same time, I became aware of fractal geometrya visually arresting and quantitative description of nature popularized by proponents such as Benoit Mandlebrot. Whereas simple objects (e.g., lines, planes, cubes) can be associated with an integer dimension (e.g., 1, 2 and 3, respectively), objects in nature (e.g., a coastline, a cloud outline) can be better characterized by a fractional dimension – a real-valued fractal dimension that lies between the integer value for a line (i.e., 1) and the two-dimensional (i.e., 2) value for a plane.

Armed with an approach for estimating fractal dimension then, my colleagues and I sought to classify hydrometeors based on their subtle to significant geometrical expressions. Although the idea was appealing in principle, the outcome on a per-hydrometeor basis was a single, scalar result that attempted to capture geometrical uniqueness. In isolation, this approach was simply not enough to deliver an automated scheme for quantitatively classifying hydrometeors.

I well recall some of the friendly conversations I had with my scientific and engineering peers who attended the conference at Montreal’s Ecole Polytechnique. Essentially, the advice I was given, was to regard the work I’d done as a single dimension of the hydrometeor classification problem. What I really needed to do was develop additional dimensions for classifying hydrometeors. With enough dimensions then, the resulting multidimensional classification scheme would be likely to have a much-better chance of delivering the automated solution sought by the atmospheric scientists.

In my research, fractal dimensions were estimated using various algorithms; they were not learned. However, they could be – as is clear from the efforts of others (e.g., the prediction of fractal dimension via Machine Learning). And though my pursuit of such a suggestion will have to wait for a subsequent research effort, a learned approach might allow for the introduction of much more of a multidimensional scheme for quantitative classification of hydrometeors via Machine Learning. Of course, from the hindsight of 2018, there are a number possibilities for quantitative classification via Machine Learning – possibilities that I fully expect would result in more useful outcomes.

Whereas fractals don’t receive as much attention these days as they once did, and certainly not anything close to the deserved hype that seems to pervade most discussions of Machine Learning, there may still be some value in incorporating their ability to quantify geometry into algorithms for Machine Learning. From a very different perspective, it might be interesting to see if the architecture of deep neural networks can be characterized through an estimation of their fractal dimension – if only to tease out geometrical similarities that might be otherwise completely obscured.

While I, or (hopefully) others, ponder such thoughts, there is no denying the stunning expression of the fractal geometry of nature that fractals have rendered visual.

Prob & Stats Gaps: Sprinting for Closure

Prob & Stats Gap

When it comes to the mathematical underpinnings for Deep Learning, I’m extremely passionate. In fact, my perspective can be summarized succinctly:

Deep Learning – Deep Math = Deep Gap.

In reflecting upon my own mathematical credentials for Deep Learning, when it came to probability and statistics, I previously stated:

Through a number of courses in Time Series Analysis (TSA), my background affords me an appreciation for prob & stats. In other words, I have enough context to appreciate this need, and through use of quality, targeted resources (e.g., Goodfellow et al.’s textbook), I can close out the gaps sufficiently – in my case, for example, Bayes’ Rule and information theory.

Teaching to Learn

DSC02681Although I can certainly leverage quality, targeted resources, I wanted to share here a complementary approach. One reason for doing this is that resources such as Goodfellow et al.’s textbook may not be readily accessible to everyone – in other words, some homework is required before some of us are ready to crack open this excellent resource, and make sense of the prob & stats summary provided there.

So, in the spirit of progressing towards being able to leverage appropriate references such as Goodfellow et al.’s textbook, please allow me to share here a much-more pragmatic suggestion:

Tutor a few high school students in prob & stats to learn prob & stats.

Just in case the basic premise of this suggestion isn’t evident, it is: By committing to teaching prob & stats, you must be able to understand prob & stats. And as an added bonus, this commitment of tutoring each of a few students (say) once a week, establishes and reinforces a habit – a habit that is quite likely, in this case, to ensure you stick with your objective to broaden and deepen your knowledge/skills when it comes to probability and statistics.

As an added bonus, this is a service for which you could charge a fee – full rate for tutoring math at the high-school level to gratis, depending upon the value you’ll be able to offer your students … of course, a rate you could adjust over time, as your expertise with prob & stats develops.

Agile Sprints

Over recent years, I’ve found it particularly useful to frame initiatives such as this one in the form of Agile Sprints – an approach I’ve adopted and adapted from the pioneering efforts of J D Meier. To try this for yourself, I suggest the following two-step procedure:

  1. Review JD’s blog post on sprints – there’s also an earlier post of his that is both useful and relevant.
  2. Apply the annotated template I’ve prepared here to a sprint of your choosing. Because the sample template I’ve shared is specific to the prob & stats example I’ve been focused on in this post, I’ve also included a blank version of the sprint template here.


Before you go, there’s one final point I’d like to draw your attention to – and that’s lead and lag measures. Whereas lag measures focus on your (wildly) important goal (WIG), lead measures emphasize those behaviors that’ll get you there. To draw from the example I shared for addressing a math gap in prob & stats, the lag measure is:

MUST have enhanced my knowledge/skills in the area of prob & stats such that I am better prepared to review Deep Learning staples such as Goodfellow et al.’s textbook

In contrast, examples of lead measures are each of the following:

SHOULD have sought tutoring positions with local and/or online services

COULD have acquired the textbook relevant for high-school level prob & stats

With appropriately crafted lead measures then, the likelihood that your WIG will be achieved is significantly enhanced. Kudos to Cal Newport for emphasizing the importance of acting on lead measures in his Deep Work book. For all four disciplines of execution, you can have a closer look at Newport’s book, or go to the 4DX source – the book or by simply Googling for resources on “the 4 disciplines of execution”.

Of course, the approach described here can be applied to much more than a gap in your knowledge/skills of prob & stats. And as I continue the process of self-curating my Data Science Portfolio, I expect to unearth additional challenges and opportunities – challenges and opportunities that can be well approached through 4DX’d Agile Sprints.