My Next Chapter in Distributed Computing: Joining Sylabs to Containerize HPC and Deep Learning

HPC in the Cloud?

Back in 2015, my long-time friend and scientific collaborator James (Jim) Freemantle suggested I give a talk to his local association of remote-sensing professionals. In hindsight, and much more importantly it would turn out for me, was his suggestion to juxtapose Cloud computing and High Performance Computing (HPC) in this talk. Although it’s still available via the Ontario Association of Remote Sensing (OARS) website here, the abstract for my talk High Performance Computing in the Cloud? read:

High Performance Computing (HPC) in the Cloud is viable in numerous applications. Common to all successful applications for cloud-based HPC is the ability to embrace latency. Early successes were achieved with embarrassingly parallel HPC applications involving minimal amounts of data – in other words, there was little or no latency to be hidden. More recently the HPC-cloud community has become increasingly adept in its ability to ‘hide’ latency and, in the process, support increasingly more sophisticated HPC applications in public and private clouds. In this presentation, real-world applications, deemed relevant to remote sensing, will illustrate aspects of these sophistications for hiding latency in accounting for large volumes of data, the need to pass messages between simultaneously executing components of distributed-memory parallel applications, as well as (processing) workflows/pipelines. Finally, the impact of containerizing HPC for the cloud will be considered through the relatively recent creation of the Cloud Native Computing Foundation.

I delivered the talk in November 2015 at Toronto’s Ryerson University to a small, but engaged group, and made the slides available via the OARS website and Slideshare.

As you can read for yourself from the abstract and slides, or hear in the Univa webinar that followed in February 2016, I placed a lot of emphasis on latency in juxtaposing the cloud and HPC; from the perspective of HPC workloads, an emphasis that remains justifiable today. Much later however, in working hands-on on various cloud projects for Univa, I’d appreciate the challenges and opportunities that data introduces; but more on that (data) another time …

Cloud-Native HPC?

Also in hindsight, I am pleased to see that I made this relatively early connection (for me, anyway) with the ‘modern’ notion of what it means to be cloud native. Unfortunately, this is a phrase bandied about with reckless abandon at times – usage that causes the phrase to become devoid of meaning. So, in part, in my OARS talk and the Univa webinar, I related cloud native as understood by the Cloud Native Computing Foundation:

Cloud native computing uses an open source software stack to deploy applications as microservices, packaging each part into its own container, and dynamically orchestrating those containers to optimize resource utilization.

If you even possess just a working knowledge of HPC at a high level, you’ll immediately appreciate that there’s the possibility for more than a little tension that’s likely to surface in realizing any vision of ‘cloud-native HPC’. Why? HPC applications have not traditionally been architected with microservices in mind; in fact, they employ the polar opposite of microservices in implementation. Therefore, the notion of taking existing HPC applications and simply executing them within a Docker container can present some challenges and opportunities – even though numerous examples of successfully containerized HPC applications do exist (see, for example, the impressive array of case studies over at UberCloud).

In some respects, when it comes to containerizing HPC applications, this is just the tip of the iceberg. In following up on the Univa webinar with a Q&A blog on HPC in the Cloud in February 2016, I quoted Univa CTO Fritz Ferstl in regard to a question on checkpointing Docker containers:

The mainstream of the container ecosystem views them as ephemeral – i.e., you can just kill them, restart them (whether on the same node or elsewhere), and then they somehow re-establish ‘service’ (i.e., what they are supposed to do … even though this isn’t an intrinsic capability of a Docker container).

Whereas ephemeral resonates soundly with microservices-based applications, this is hardly a ‘good fit’ for HPC applications. And because they share numerous characteristics with traditional HPC applications, emerging applications and workloads in AI, Deep Learning and Machine Learning suffer a similar fate: They aren’t always a good fit for traditional containers along the implementation lines of Docker. From nvidia-docker to the relatively recent and impressive integration between Univa Grid Engine and Docker, it’s blatantly evident that there are significant technical gymnastics required to leverage GPUs for applications that one seeks to execute within a Docker container. For years now for traditional HPC applications and workflows, and more recently for Deep Learning use cases, there is an implied requirement to tap into GPUs as computational resources.

A Singular Fit

For these and other reasons, Singularity has been developed, as a ‘vehicle’ for containerization that is simply a better fit for HPC and Deep Learning applications and their corresponding workflows. Because I have very recently joined the team at Sylabs, Inc. as a technical writer, you can expect to hear a whole lot more from me on containerization via Singularity – here, or even more frequently, over at the Sylabs blog and in Lab Notes.

Given that my acknowledged bias towards Distributed computing includes a significant dose of Cloud computing, I’m sure you can appreciate that it’s with so much more than a casual degree of enthusiasm that I regard my new opportunity with Sylabs – a startup that is literally defining how GOV/EDU as well as enterprise customers can sensibly and easily acquire the benefits of containerizing their applications and workflows on everything from isolated PCs (laptops and desktops) to servers, VMs and/or instances that exist in their datacenters and/or clouds.

From my ‘old’ friend/collaborator Jim to the team at Sylabs that I’ve just joined, and to everyone in between, it’s hard not to feel a sense of gratitude at this juncture. With HPC’s premiere event less than a month away in Dallas, I look forward to reconnecting with ‘my peeps’ at SC18, and ensuring they are aware of the exciting prospects Singularity brings to their organizations.

Preserving Content for Your Portfolio: Kudos to The Internet Archive

Preserving Science

I’ve been publishing articles since the last century.

In fact, my first, legitimate publication was a letter to science journal Nature with my then thesis supervisor (Keith Aldridge) in 1987 … that’s 31 years ago . Armed with nothing more than Google Scholar, searching for “aldridge lumb nature 1987” yields access to the article via Nature’s website in fractions of a second. Moreover, since the introduction of Digital Object Identifiers (DOIs) around the turn of the last century (circa 2000), articles such as this one are uniquely identifiable and findable via a URL – e.g., the URL for our Nature letter is http://dx.doi.org/10.1038/325421a0.

In this letter to Nature, Keith and I cite an 1880 publication authored by Lord Kelvin – whom, it appears, is known for fluid dynamics in addition to the temperature scale that bears his name … and, of course, much more! Through this and other citations, Keith and I explicitly acknowledged how the contributions of others enabled us to produce our letter – in other words, we made it clear how we have been able to stand on the shoulders of giants.

In addition to assigning intellectual credit where it is due, this personal reflection underscores the importance of preserving contributions over the long haul – make that 138 years in the case of Kelvin’s 1880 paper. Preservation is a well-established practice in the case of scientific journals, for example, even though it may be necessary to draw upon analog renditions captured via print or microfiche rather than some digital representation.

In self-curating portfolios recently, it’s been made increasingly clear to me that content preservation has not been a focal point in the digital realm.

Digital Properties

Let’s make use of Grid Computing for the purpose of providing an illustrative example. In its heyday, a popular and reputable online magazine was GRIDtoday: “DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY”. Other than a passing reference in pioneering publisher Tom Tabor’s BIO (you can search for his BIO here), I expect you’ll be hard pressed to locate very much at all regarding this once-thriving online property. Like Grid Computing itself: GRIDtoday, gone tomorrow; RIP GRIDtoday. Of course, Grid Computing Planet (GCP) suffered a similar fate.

My purpose here is not to question those extremely reasonable business decisions that resulted in closing down operations on GCP or GRIDtoday – Tabor Communications, for example, boasts three, prized ‘properties’ as of this writing … one of which (HPCwire) predates the inception of GRIDtoday, and remains a go-to source for all things HPC.

Grid Computing remains an important chapter in my professional life – especially given my claims for genetic imprinting via Distributed Computing. However, based upon my desire to assemble a portfolio of my work that includes Grid Computing, the /dev/null redirection of those bits that collectively represented GRIDtoday and GCP is problematical. In particular, and even though I collaborated upon articles and book chapters that have been preserved in analog and/or digital representations, articles contributed to GRIDtoday and GCP still retain value to me personally – value that I’d like to incorporate into my Portfolio.

Enter The Internet Archive

Fortunately, also since close to the end of the last century, The Internet Archive has been:

… building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, [they] provide free access to researchers, historians, scholars, the print disabled, and the general public. Our mission is to provide Universal Access to All Knowledge.

I’m not intending to imply that those items I was able to have published via GRIDtoday and GCP carry ‘a Kelvin of clout’ however, for more than purely sentimental reasons it’s truly wonderful that The Internet Archive has attempted to preserve those artifacts that collectively comprised these publications in their heyday. Although I haven’t yet attempted to locate an article I wrote for GCP, I was able to retrieve two articles from the archive for GRIDtoday:

  • Towards The Telecosmic Grid – Published originally in December 2002, in this article I ‘channeled’ George Gilder is asserting that: “Isolating and manipulating discrete wavelengths of visible light across intelligent optical transport media results in the grid – a specific instance of The Telecosmic Grid. Several examples serve as beacons of possibility.” More on this soon (I hope) in a separate post that revisits this possibility.
  • Open Grid Forum: Necessary … but Sufficient? – Published originally in June 2006, this may be the most-opinionated article I’ve ever had appear in any media format! It generated a decent amount of traffic for GRIDtoday, as well as an interesting accusation – an accusation ‘leaked’, incidentally, through a mailing list archive.

Given that these two GRIDtoday articles are currently accessible via The Internet Archive means that I can include each of them directly in my Portfolio, and update my blog posts that make reference to them. Having laid intellectual claim (in 2002 I’ll have you know!!! 😉 to various possibilities telecosmic in nature, I’ll be able to soon revisit the same through the guise of hindsight. Whereas I fully appreciate that business decisions need to be made, and as consequence once-popular landing pages necessarily disappear, it’s truly fortunate that The Internet Archive has our collective backs on this. So, if this post has any key takeaways, it’s simply this:

Please donate to The Internet Archive.

Thanks Brewster!

Disclosures Regarding My Portfolios: Attributing the Contributions of Others

‘Personal’ Achievement?

October 8, 2018 was an extremely memorable night for Drew Brees at the Mercedes-Benz Superdome in New Orleans. Under the intense scrutiny of Monday Night Football, the quarterback of the New Orleans Saints became the leading passer in the history of the National Football League. (For those not familiar with this sport, you can think of his 72,103-yard milestone as a lifetime-achievement accomplishment of ultramarathon’ic proportions.) The narrative on Brees’ contributions to ‘the game’ are anything but complete. In fact, the longer he plays, the more impressive this milestone becomes, as he continues to place distance between himself and every other NFL QB.

Of course the record books, and Brees’ inevitable induction into the Pro Football Hall of Fame, will all position this as an individual-achievement award. Whenever given the opportunity to reflect upon seemingly personal achievements such as the all-time passing leader, Brees is quick to acknowledge those who have enabled him to be so stunningly successful in such a high-profile, high-pressure role – from family and friends, to teammates, coaches, and more.

As I wrote about another NFL quarterback in a recent post, like Tom Brady, Brees remains a student-of-the-game. He is also known for his off-the-field work ethic that he practices with the utmost intensity in preparing for those moments when he takes the main stage along with his team. Therefore, when someone like Brees shares achievements with those around him, it’s clearly an act that is sincerely authentic.

Full Disclosure

At the very least, self-curating and sharing in public some collection of your work has more than the potential to come across as an act of blatant self-indulgence – and, of course, to some degree it is! At the very worst, however, is the potential for such an effort to come across as a purely individual contribution. Because contribution matters so much to me personally, I wanted to ensure that any portfolio I self-curate includes appropriate disclosures; disclosures that acknowledge the importance of collaboration, opportunity, support, and so on, from my family, friends and acquaintances, peers and co-workers, employers, customers and partners, sponsors, and more. In other words, and though in a very different context, like Brees I want to ensure that what comes across as ‘My Portfolio’ rightly acknowledges that this too is a team sport.

In the interests of generic disclosures then, the following is an attempt to ensure the efforts of others are known explicitly:

  • Articles, book chapters and posters – Based on authorships, affiliations and acknowledgements, portfolio artifacts such as articles, book chapters and posters make explicit collaborators, enablers and supporters/influencers, respectively. In this case, there’s almost no need for further disclosure.
  • Blog posts – Less formal than the written and oral forms of communication already alluded to above and below, it’s through the words themselves and/or hyperlinks introduced that the contributions of others are gratefully and willingly acknowledged. Fortunately, it is common practice for page-ranking algorithms to take into account the words and metadata that collectively comprise blog posts, and appropriately afford Web pages stronger rankings based upon these and other metrics.
  • Presentations – My intention here is to employ Presentations as a disclosure category for talks, webinars, workshops, courses, etc. – i.e., all kinds of oral communications that may or may not be recorded. With respect to this category, my experience is ‘varied’ – e.g., in not always allowing for full disclosure regarding collaborators, though less so regarding affiliations. Therefore, to make collaborators as well as supporters/influencers explicit, contribution attributions are typically included in the materials I’ve shared (e.g., the slides corresponding to my GTC17 presentation) and/or through the words I’ve spoken. Kudos are also warranted for the organizations I’ve represented in some of these cases as well, as it has been a byproduct of this representation that numerous opportunities have fallen into my lap – though often owing to a sponsorship fee, to be completely frank. Finally, sponsoring organizations are also deserving of recognition, as it is often their mandate (e.g., a lead-generation marketing program that requires a webinar, a call for papers/proposals) that inspires what ultimately manifests itself as some artifact in one of my portfolios; having been on the event-sponsor’s side more than a few times, I am only too well aware of the effort involved in creating the space for presentations … a contribution that cannot be ignored.

From explicit to vague, disclosures regarding contribution are clearly to barely evident. Regardless, for those portfolios shared via my personal blog (Data Science Portfolio and Cloud Computing Portfolio), suffice it to say that there were always others involved. I’ve done my best to make those contributions clear, however I’m sure that unintentional omissions, errors and/or (mis)representations exist. Given that these portfolios are intentionally positioned and executed as works-in-progress, I look forward to addressing matters as they arise.

Ian Lumb’s Cloud Computing Portfolio

When I first introduced it, it made sense to me (at the time, at least!) to divide my Data Science Portfolio into two parts; the latter part was “… intended to showcase those efforts that have enabled other Data Scientists” – in other words, my contributions as a Data Science Enabler.

As of today, most of what was originally placed in that latter part of my Data Science Portfolio has been transferred to a new portfolio – namely one that emphasizes Cloud computing. Thus my Cloud Computing Portfolio is a self-curated, online, multimedia effort intended to draw together into a cohesive whole my efforts in Cloud computing; specifically this new Portfolio is organized as follows:

  • Strictly Cloud – A compilation of contributions in which Cloud computing takes centerstage
  • Cloud-Related – A compilation of contributions drawn from clusters and grids to miscellany. Also drawn out in this section, however, are contributions relating to containerization.

As with my Data Science Portfolio, you’ll find in my Cloud Computing Portfolio everything from academic articles and book chapters, to blog posts, to webinars and conference presentations – in other words, this Portfolio also lives up to its multimedia billing!

Since this is intentionally a work-in-progress, like my Data Science Portfolio, feedback is always welcome as there will definitely be revisions applied !

PyTorch Then & Now: A Highly Viable Framework for Deep Learning

Why PyTorch Then?

In preparing for a GTC 2017 presentation, I was driven to emphasize CUDA-enabled GPUs as the platform upon which I’d run my Machine Learning applications. Although I’d already had some encouraging experience with Apache Spark’s MLlib in a classification problem, ‘porting’ from in-memory computations based upon use of CPUs to GPUs was and remains ‘exploratory’ – with, perhaps, the notable exception of a cloud-based offering from Databricks themselves. Instead, in ramping up for this Silicon Valley event, I approached this ‘opportunity’ with an open mind and began my GPU-centric effort by starting at an NVIDIA page for developers. As I wrote post-event in August 2017:

Frankly, the outcome surprised me: As a consequence of my GTC-motivated reframing, I ‘discovered’ Natural Language Processing (NLP) – broadly speaking, the use of human languages by a computer. Moreover, by reviewing the breadth and depth of possibilities for actually doing some NLP on my Twitter data, I subsequently ‘discovered’ PyTorch – a Python-based framework for Deep Learning that can readily exploit GPUs. It’s important to note that PyTorch is not the only choice available for engaging in NLP on GPUs, and it certainly isn’t the most-obvious choice. As I allude to in my GTC presentation, however, I was rapidly drawn to PyTorch.

Despite that most-obvious choice (I expect) of TensorFlow, I selected PyTorch for reasons that included the following:

Not bad for version 0.1 of a framework, I’d say! In fact, by the time I was responding to referee’s feedback in revising a book chapter (please see “Refactoring Earthquake-Tsunami Causality with Big Data Analytics” under NLP in my Data Science Portfolio), PyTorch was revised to version 0.2.0. This was a very welcome revision in the context of this chapter revision, however, as it included a built-in method for performing cosine similarities (“cosine_similarity”) – the key discriminator for quantitatively assessing the semantic similarity between two word vectors.

Perhaps my enthusiasm for PyTorch isn’t all that surprising, as I do fit into one of their identified user profiles:

PyTorch has gotten its biggest adoption from researchers, and it’s gotten about a moderate response from data scientists. As we expected, we did not get any adoption from product builders because PyTorch models are not easy to ship into mobile, for example. We also have people who we did not expect to come on board, like folks from OpenAI and several universities.

Towards PyTorch 1.0

In this same August 2017 O’Reilly podcast (from which I extracted the above quote on user profiles), Facebook’s Soumith Chintala stated:

Internally at Facebook, we have a unified strategy. We say PyTorch is used for all of research and Caffe 2 is used for all of production. This makes it easier for us to separate out which team does what and which tools do what. What we are seeing is, users first create a PyTorch model. When they are ready to deploy their model into production, they just convert it into a Caffe 2 model, then ship into either mobile or another platform.

Perhaps it’s not entirely surprising then that the 1.0 release intends to “… marry PyTorch and Caffe2 which gives the production-level readiness for PyTorch.” My understanding is that researchers (and others) retain the highly favorable benefit of developing in PyTorch but then, via the new JIT compiler, acquire the ability to deploy into production via Caffe2 or “… [export] to C++-only runtimes for use in larger projects”; thus PyTorch 1.0’s production reach extends to runtimes other than just Python-based ones – e.g., those runtimes that drive iOS, Android and other mobile devices. With TensorFlow already having emerged as the ‘gorilla of all frameworks’, the productionizing choice in the implementation of PyTorch will be well received by Facebook and other proponents of Caffe2.

The productionization of PyTorch also includes:

  • A C++ frontend – “… a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend …” that “… is intended to enable research in high performance, low latency and bare metal C++ applications.”
  • Distributed PyTorch enhancements – Originally introduced in version 0.2.0 of PyTorch, “… the torch.distributed package … allows you to exchange Tensors among multiple machines.” Otherwise a core-competence of distributed TensorFlow, this ability to introduce parallelism via distributed processing becomes increasingly important as Deep Learning applications and their workflows transition from prototypes into production – e.g., as the demands of training escalate. In PyTorch 1.0, use of a new library (“C10D”) is expected to significantly enhance performance, while asynchronously enabling communications – even when use is made of the familiar-to-HPC-types Message Passing Interface (MPI).

In May 2018, over on Facebook’s developer-centric blog, Bill Jia posted:

Over the coming months, we’re going to refactor and unify the codebases of both the Caffe2 and PyTorch 0.4 frameworks to deduplicate components and share abstractions. The result will be a unified framework that supports efficient graph-mode execution with profiling, mobile deployment, extensive vendor integrations, and more.

As of this writing, a version 1 release candidate for PyTorch 1.0 is available via GitHub.

Stable releases for previous versions are available for ‘local’ or cloud use.

Key Takeaway: Why PyTorch Now!

Whereas it might’ve been a no-brainer to adopt TensorFlow as your go-to framework for all of your Deep Learning needs, I found early releases of PyTorch to be an effective enabler over a year ago – when it was only at the 0.2.0 release stage! Fortunately, the team behind PyTorch has continued to advance the capabilities offered – capabilities that are soon to officially include production-ready distributed processing. If you’re unaware of PyTorch, or bypassed it in the past, it’s likely worth another look right now.

4DXing Your Procrastination with Lead Measures

When it comes to productivity, there’s a whole self-help industry, culture, etc., that’s established itself. No wonder, more than ever, many of us really do need ‘the help’ … especially when it comes to ‘regulars’ like procrastination – “… the avoidance of doing a task that needs to be accomplished.” From the highly academic to the exceedingly practical, procrastination rightly serves as the target for research investigations to hacks, respectively. That being the case, there is hardly the need to add to this mountainous and at times unexpected terrain that today characterizes the landscape that is procrastination – except, to share what might be a serendipitous intersection between procrastination and lead measures. For those seeking more-comprehensive resources on procrastination, you’ll need to look elsewhere (via Google for example), as our focus here needs to be squarely on this intersection.

Before making the procrastination connection, we need to ensure we’re on the same page when it comes to lead measures.

Lead Measures

As I understand them, based upon Cal Newport’s introduction in Deep Work, the notion of lead measures derives from McChesney et al.’s The 4 Disciplines of Execution. Briefly, and in my own words, the approach requires you shift your emphasis onto lead measures when chasing your Wildly Important Goals (WIGs) – i.e., it requires you to emphasize those behaviors that’ll allow you to ensure the success of your WIGs.

Suppose your WIG, a lag measure, is to find a job. A fine example of a lead measure then is networking. In other words, if you ‘merely’ ensure you network with three people per week for example, you’ll be executing a behavior that will contribute towards your WIG of finding a job. In fact, your lead measures with respect to networking could be even-more granular – for example:

Reach out to (say) 15 people in my LinkedIn network to secure introductions to people at employers of interest

The act of reaching out to those you know, via the enabling means provided by LinkedIn (for example), serves as an effective behavior for securing the introductions you require for networking. As an effective lead measure, it has an embarrassingly low activation energy. Whereas I’m not purporting to be an expert here on job seeking, networking, etc., I’m sure you get the gist of lead measures through this example. (In the case of networking, for example, one cannot overemphasize the importance of meeting in person – a behavior that has an even greater potential to serve as a lead measure, I’m sure many might argue.)

I shared another concrete example of a very different nature recently. At its core, the idea was to chase the WIG of propping up math skills in probability and statistics, for the purpose of engaging in the field of Deep Learning. Amongst the lead measures suggested, was that of tutoring prob & stats to high-school students as a vehicle for acquiring valuable knowledge and skills in this mathematical cornerstone for Deep Learning.

Whereas even a SMART’ly crafted WIG whose core objective is “getting a job” or “learning some math” has the potential to be overwhelming, teasing out lead measures results in things that are readily actionable … and that allows us to return to the matter of procrastination …

Turfing Procrastination

By employing the execution-discipline of acting on lead measures then, I believe we have herein the potential for a very different take on addressing the challenges and opportunities rendered through procrastination. With its execution focus that emphasizes the ‘right behaviors’, teasing out easily actionable lead measures is so much more than merely breaking down WIGs into digestible chunks – it’s a strategic approach that, when executed, has a much greater likelihood for success. In some respects then, lead measures serve as a foil for those lag measures corresponding to your WIG.

As the examples above illustrate, effective lead measures are the means through which lag measures that encapsulate your WIGs are achieved. Since we are concerned here specifically with the matter of procrastination, successful application of the approach described here rests upon striving for appropriately teased out lead measures. Stated simply, you’ll know you’ve ‘arrived’ at the right articulation of your lead measures when you can honestly state: “I can do that, right now”. In the examples used here, that would mean reaching out to LinkedIn contacts and booking math-tutoring sessions.

The process of ‘extracting’ those lead measures that best ensure the success of your WIG takes some concentrated effort and practice. For example, networking has been deemed by qualified others to be of value to job seekers at certain stages of their career; it may not be appropriate to others whose circumstances are quite different. Tutoring math to learn math will only be useful to those who have some experience both as a tutor and in math; it may be almost useless to others. In other words, the need for personalization is also critical in teasing out lead measures – and this is especially so when it comes to procrastination. So important is the matter of personalization to effectively crafted lead measures that I suspect the need for additional posts, courses/workshops, coaching, etc. Of course, this’ll demand an even deeper level of appreciation of the very nature of procrastination. For example, many have alluded to the emotional angles of procrastination. Owing to the inherently behavioral foundation of execution mediated via lead measures then, there exists an approach here that has the potential for mounting a targeted attack on procrastination at an emotional level! I remain optimistic then, that lead measures could become our superpower for addressing procrastination – as they have the potential to collectively serve as the foil for the lag measures that encapsulate our WIGs.

Accountable Scoreboards

The most-comprehensive template I’ve ever run across for ensuring the success of habits-oriented goals can be found in Michael Hyatt’s Your Best Year Ever. A less-comprehensive, yet effective ‘template’ is provided by your calendar – a fine tool for tracking your streaks. Popularized by comedian Jerry Seinfeld, streak tracking addresses the final two of four execution disciplines head on – namely the need to keep a compelling scoreboard, as well as a cadence of accountability.

To make use of our two examples above for one final time, tracking might translate to:

  • Logging weekly the number of times you’ve reached out to LinkedIn contacts for networking referrals, and then weekly logging the referrals acquired, and finally the (monthly) conversations held.
  • Logging weekly the number of math students tutored, and for each time spent, over some period of time (e.g., a term, semester, session, course).

Such quantitative measures render your actions visible, without ambiguity. If your outcomes aren’t matching your expectations, you have at your fingertips the data to validate your hypotheses – e.g.:

Is reaching out to 15 LinkedIn contacts weekly producing the number (3) of referrals I need?

With evidence then, you can always re-examine your lead measures to ensure they are appropriately aligned with your encapsulation of your WIG via your lag measure.

It’s a no-brainer that professional athletes engage intensely in the habit of daily practice – and the best, NFL football’s Tom Brady for example, never stop! Even though along with his teammates and the coaching staff of the New England Patriots they have collectively won five Super Bowl championships, Brady remains a ‘student of the game’, practising with the utmost intensity in a effort to repeat this feat for a record-setting sixth time. The same could be shared with respect to world-class musicians irrespective of musical genre.

Actions quantified, through streak tracking on calendars to thorough templates, allows each of us to confront procrastination through the discipline of execution – thus elevating our game, our level of play, to the level of a professional. Stated differently, armed with the right lead measures, these final two steps ensure that failure is not an option – or that we’ll, at the very least, land a whole-lot closer to our WIGs.

Key Takeaways

Having WIGs is great. However, achieving WIGs is better. The difference is in execution. By focusing on lead measures, your prospects for actually achieving your WIGs will be significantly enhanced in practice (literally). Appropriately teased out lead measures have the potential to significantly inhibit procrastination – by focusing on behaviors that you can implement immediately … behaviors you can make use of to evaluate your progress in an objective way. Net-net, the four disciplines of execution can be especially valuable to those of us ‘prone’ towards procrastination; they can become a highly effective mitigation strategy.

Bonus Takeaway

Commiting to a WIG, or some aspect of a WIG, that can be achieved in about a month’s time can be extremely appealing. Whereas the focus of a 30-day effort may reduce symptoms of procrastination in some of us, interweaving lead measures into the context of an Agile Sprint is even more likely to drive most of us to the next level of achievement. For a concrete example, based upon the teaching-math-to-learn-math ‘use case’ above, please have a look at the Agile Sprints section in this previous post.

Towards Tsunami Informatics: Applying Machine Learning to Data Extracted from Twitter

2018 Sulawesi Earthquake & Tsunami

Even in 2018, our ability to provide accurate tsunami advisories and warnings is exceedingly challenged.

In best-case scenarios, advisories and warnings afford inhabitants of low-lying coastal areas minutes or (hopefully) longer to react.

In best-case scenarios, advisories and warnings are based upon in situ measurements via tsunameters – as ocean-bottom changes in seawater pressure serve as reliable precursors for impending tsunami arrival. (By way of analogy, tsunameters ‘see’ tsunamis as do radars ‘see’ precipitation. Based on ‘sight’ then, both offer a reasonable ability to ‘nowcast’.)

In typical scenarios, however, advisories and warnings can communicate mixed messages. In the case of the recent Sulawesi earthquake and tsunami for example, a nearby alert (for the Makassar Strait) was retracted after some 30 minutes, even though Palu, Indonesia experienced a ‘localized’ tsunami that resulted in significant losses – with current estimates placing the number of fatalities at more than 1200 people.

With ultimate regret stemming from significant loss of human life, the recent case for the residents of Palu is particularly painful, as alerting was not informed by tsunameter measurements owing to an ongoing dispute – an unresolved dispute that rendered the deployment of an array of tsunameters incomplete and inoperable. A dispute that, if resolved, could’ve provided this low-lying coastal area with accurate and potentially life-saving alerts.

Lessons from Past Events

It’s been only 5,025 days since the last tsunami affected Indonesia – the also devastating Boxing Day 2004 event in the Indian Ocean. All things considered, it’s truly wonderful that a strategic effort to deploy a network of tsunameters in this part the planet was in place; of course, it’s well beyond tragic that execution of the project was significantly hampered, and that almost 14 years later, inhabitants of this otherwise idyllic setting are left to suffer loss of such epic proportions.

I’m a huge proponent of tsunameters as last-resort, yet-accurate indicators for tsunami alerting. In their absence, the norm is for advisories and warnings that may deliver accurate alerts – “may” being the operative word here, as it often the case that alerts are issued only to be retracted at some future time … as was the case again for the recent Sulawesi event. Obviously, tsunami centers that ‘cry wolf’, run the risk of not being taken seriously – seriously, perhaps, in the case when they have correctly predicted an event of some significance.

It’s not that those scientific teams of geographers, geologists, geophysicists, oceanographers and more are in any way lax in attempting to do their jobs; it’s truly that the matter of tsunami prediction is exceedingly difficult. For example, unless you caught the January 2006 issue of Scientific American as I happened to, you’d likely be unaware that 4,933 days ago an earthquake affected (essentially) the same region as the Boxing Day 2004 event; regarded as a three-month-later aftershock, this event of similar earthquake magnitude and tectonic setting did not result in a tsunami.

Writing in this January 2006 issue of Scientific American, Geist et al. compared the two Indian Ocean events side-by-side – using one of those diagrams that this magazine is lauded for. The similarities between the two events are compelling. The seemingly subtle differences, however, are much more than compelling – as the tsunami-producing earlier of the two events bears testimony.

As a student of theoretical, global geophysics, but not specifically oceanography, seismology, tectonophysics or the like, I was unaware of the ‘shocking differences’ between these two events. However, my interest was captivated instantaneously!

Towards Tsunami Informatics

Graph Analytics?

It would take, however, some 3,000 days for my captivated interest to be transformed into a scientific communication. On the heels of successfully developing a framework and platform for knowledge representation with long-time friend and collaborator Jim Freemantle and others, our initial idea was to apply graph analytics to data extracted from Twitter – thus acknowledging that Twitter has the potential to serve as a source of data that might be of value in the context of tsunami alerting.

In hindsight, it’s fortunate that Jim and I did not spend a lot of time on the graph-analytics approach. In fact, arguably the most-valuable outcome from the poster we presented at a computer-science conference in June 2014 (HPCS, Halifax, Nova Scotia), was Jim’s Perl script (see, e.g., Listing 1 of our subsequent unpublished paper, or Listing 1.1 of our soon-to-be published book chapter) that extracted keyword-specified data (e.g., “#earthquake”) from Twitter streams.

Machine Learning: Classification

About two years later, stemming from conversations at the March 2016 Rice University Oil & Gas Conference in Houston, our efforts began to emphasize Machine Learning over graph analytics. Driving for results to present at a May 2016 Big Data event at Prairie View A&M University (PVAMU, also in the Houston area), a textbook example (literally!) taken from the pages of an O’Reilly book on Learning Spark showed some promise in allowing Jim and I to classify tweets – with hammy tweets encapsulating something deemed geophysically interesting, whereas spammy ones not so much. ‘Not so much’ was determined through supervised learning – in other words, results reported were achieved after a manual classification of tweets for the purpose of training the Machine Learning models. The need for manual training, and absence of semantics struck the two of us as ‘lacking’ from the outset; more specifically, each tokenized word of each tweet was represented as a feature vector – stated differently, data and metadata (e.g., Twitter handles, URLs) were all represented with the same (lacking) degree of semantic expression. Based upon our experience with knowledge-representation frameworks, we immediately sought a semantically richer solution.

Machine Learning: Natural Language Processing

It wasn’t until after I’d made a presentation at GTC 2017 in Silicon Valley the following year that the idea of representing words as embedded vectors would register with me. Working with Jim, two unconventional choices were made – namely, GloVe over word2vec and PyTorch over TensorFlow. Whereas academic articles justified our choice of Stanford’s GloVe, the case for PyTorch was made on less-rigorous grounds – grounds expounded in my GTC presentation and our soon-to-be published book chapter.

Our uptake of GloVe and PyTorch addressed our scientific imperative, as results were obtained for the 2017 instantiation of the same HPCS conference where this idea of tsunami alerting (based upon data extracted from Twitter) was originally hatched. In employing Natural Language Processing (NLP), via embedded word vectors, Jim and I were able to quantitatively explore tweets as word-based time series based upon their co-occurrences – stated differently, this word-vector quantification is based upon ‘the company’ (usage associations) that words ‘keep’. By referencing the predigested corpora available from the GloVe project, we were able to explore “earthquake” and “tsunami” in terms of distances, analogies and various kinds of similarities (e.g., cosine similarity).

Event-Reanalysis Examples

Our NLP approach appeared promising enough that we closed out 2017 with a presentation of our findings to date during an interdisciplinary session on tsunami science at the Fall Meeting of the American Geophysical Union held in New Orleans. To emphasize the scientific applicability of our approach, Jim and I focused on reanalyzing two-pairs of events (see Slide 10 here). Like the pair identified years previously in the 2006 Scientific American article, the more-recent event pairs we chose included earthquake-only plus tsunamigenic events originating in close geographic proximity, with similar oceanic and tectonic settings.

The most-promising results we reported (see slides 11 and 12 here and below) involved those cosine similarities obtained for earthquake-only versus tsunamigenic events; evident via clustering, the approach appears able to discriminate between the two classes of events based upon data extracted from Twitter. Even in our own estimation however, the clustering is weakly discriminating at best, and we expect to apply more-advanced approaches for NLP to further separate classes of events.

Agile Sprints - Events - 2017 AGU Fall Meeting - Twitter Tsunami - December 8, 2017

Discussion

Ultimately, the ability to further validate and operationally deploy this alerting mechanism would require the data from Twitter be streamed and processed in real time – a challenge that some containerized implementation of Apache Spark would seem ideally suited to, for example. (Aspects of this Future Work are outlined in the final section of our HPCS 2017 book chapter.)

When it comes to tsunamis, alerting remains a challenge – especially in those parts of the planet under-serviced by networks of tsunameters … and even seismometers, tide gauges, etc. Thus prospects for enhancing the alerting capabilities remain valuable and warranted. Even though inherently fraught with subjectivity, data extracted from streamed Twitter data in real time appears to hold some promise for providing a data source that compliments the objective output from scientific instrumentation. Our approach, based upon Machine Learning via NLP, has demonstrated promising-enough early signs of success that ‘further research is required’. Given that this initiative has already benefited from useful discussions at conferences, suggestions are welcome, as it’s clear that even NLP has a lot more to offer beyond embedded word vectors.