Ian Lumb’s Cloud Computing Portfolio

When I first introduced it, it made sense to me (at the time, at least!) to divide my Data Science Portfolio into two parts; the latter part was “… intended to showcase those efforts that have enabled other Data Scientists” – in other words, my contributions as a Data Science Enabler.

As of today, most of what was originally placed in that latter part of my Data Science Portfolio has been transferred to a new portfolio – namely one that emphasizes Cloud computing. Thus my Cloud Computing Portfolio is a self-curated, online, multimedia effort intended to draw together into a cohesive whole my efforts in Cloud computing; specifically this new Portfolio is organized as follows:

  • Strictly Cloud – A compilation of contributions in which Cloud computing takes centerstage
  • Cloud-Related – A compilation of contributions drawn from clusters and grids to miscellany. Also drawn out in this section, however, are contributions relating to containerization.

As with my Data Science Portfolio, you’ll find in my Cloud Computing Portfolio everything from academic articles and book chapters, to blog posts, to webinars and conference presentations – in other words, this Portfolio also lives up to its multimedia billing!

Since this is intentionally a work-in-progress, like my Data Science Portfolio, feedback is always welcome as there will definitely be revisions applied !

Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Foraging for Resources in the Multicore Present and Future

HPC consultant Wolfgang Gentzsch has thoughtfully updated the case of multicore architectures in the HPC context. Over on LinkedIn, via one of the HPC discussion groups, I responded with:

I also enjoyed your article, Wolfgang – thank you. Notwithstanding the drive towards cluster-on-a-chip architectures, HPC customers will require workload managers (WLMs) that interface effectively and efficiently with O/S-level features/functionalities (e.g., MCOPt Multicore Manager from eXludus for Linux, to re-state your example). To me, this is a need well evidenced in the past: For example, various WLMs were tightly integrated with IRIX’s cpuset functionality (http://www.sgi.com/products/software/irix/releases/irix658.html) to allow for topology-aware scheduling in this NUMA-based offering from SGI. In present and future multicore contexts, the appetite for petascale and exascale computing will drive the need for such WLM-O/S integrations. In addition to the multicore paradigm, what makes ‘this’ future particularly interesting, is that some of these multicore architectures will exist in a hybrid (CPU/GPU) cloud – a cloud that may compliment in-house resources via some bursting capability (e.g., Bright’s cloud bursting, http://www.brightcomputing.com/Linux-Cluster-Cloud-Bursting.php). As you also well indicated in your article, it is incumbent upon all stakeholders to ensure that this future is a friendly as possible (e.g., for developers and users). To update a phrase originally spun by Herb Sutter (http://www.gotw.ca/publications/concurrency-ddj.htm) in the multicore context, not only is the free lunch over, its getting tougher to find and ingest lunches you’re willing to pay for!

We certainly live in interesting times!

Platform Acquires Scali Manage

From the joint release:

Platform Computing announced today it has acquired the Scali Manage business from Massachusetts-based Scali Inc. Scali Manage is an integrated and flexible High Performance Computing (HPC) cluster management and monitoring system. This strategic acquisition supports Platform’s vision to be the partner of choice for HPC infrastructure software worldwide. The Scali Manage product complements Platform’s existing HPC offerings and extends Platform’s products’ cluster and grid management capabilities.

As someone who worked for both companies, I can honestly state that this really does sound like a win-win outcome.

Scali has chosen to focus on its industry-leading MPI product.

Platform has broadened its cluster-management offering in a very significant way.

I remain a huge fan of Scali Manage more than a year after my departure from Scali.

Why?

Scali Manage is standards-based.

To appreciate the depth of this statement, please read my blog post from March 2006.

Moreover, Scali Manage is likely still the only software that can make this claim. Yes, there are open source offerings. But none of these are based on open standards like WBEM and Eclipse.

With people and technology transferring from Scali to Platform, I expect a very rosy future for Scali Manage.

Grid Computing’s Identity Crisis

Hanoch Eiron, Open Grid Forum (OGF) vice president of marketing, recently contributed a special feature to GRIDtoday. Even though Eiron’s contribution spans a mere three paragraphs, there is ample content to comment on.

Eiron opens with:

Let’s face it — the Grid hype by commercial vendors in the past few years was premature. Some would say that it has actually slowed the development of grids as it created customer expectations that could not be met.

IBM’s arrival on the Grid Computing scene, publically marked by their endorsement of the Open Source Globus Toolkit, signified the dawn of vendor-generated hype. However long before IBM sought to paint Grid Computing blue, it was Global Grid Forum (GGF) and Globus Project representatives who were the source of hype. Back in these BBB (Before Big Blue) days, academic gridders evangelized that Grid Computing represented the next phase in the ongoing evolution of Distributed Computing. And specifically with respect to Grid Computing standards and the Globus Toolkit:

This evolution in standards has wreaked havoc on the implementation front. For example, in moving from Versions 2 (protocol-specific implementation based on FTP, HTTP, LDAP, etc.) to 3 (introduction of Web services via OGSI) to 4 (refinement of previously introduced OGSI Web Services to WS-RF), the Open Source Globus Toolkit has undergone significant changes. When such changes break forward-compatibility in subsequent versions of the software, standards evolution becomes an impediment to adoption.

For a specific example, consider CERN’s gamble with Grid Computing:

The standards flux, that resulted in evolving variants of the Globus Toolkit, caused CERN and its affiliates some grief for at least two reasons.

  • First, projects like the LHC require significant advance planning. Evolving standards and implementations make advance planning even more challenging, and the allusions to gambling quite appropriate.
  • Second, despite the fact that CERN’s primary activity is academic research, CERN needs to provide a number of production-quality services. Again, such service levels are difficult to deliver on when standards and implementations are in a state of continuous change.

In other words, it’s not just vendors who have been guilty of hype and over-promising on deliverables.

Later in his first paragraph, Eiron states: “… it is clear that from a public perception standpoint, grids are now in a trough.” I couldn’t agree more. As the recent GridWorld event has ably demonstrated, considerable confusion exists about Grid Computing. Newbies, early adopters and even the Griderati, are uncomfortable with the term, unclear on what it means and how it fits into the broader context of clustering, cyberinfrastructure, Distributed Computing, High Performance Computing (HPC), Service Oriented Architecture (SOA), Utility Computing, virtualization, Web Services, etc. (That adaptive enterprise and autonomic computing don’t receive much play is of mild consolation.) Grid Computing is in a trough because it is suffering from a serious identity crisis. Fortunately, Eiron and OGF are not in denial, and have plans to address this situation.

Eiron refers to Grid Computing’s latest poster child, eBay. And although I haven’t had the benefit of a deep dive on the technical aspects of the eBay Grid, I expect it to be a grid more in positioning than substance. In a GRIDtoday Q&A with Paul Strong, distinguished research scientist at eBay Research Labs, there is evidence of cluster-level workload management, clustered databases, farms of Web servers, and other examples of Distributed Computing technologies. However, nothing that Strong discusses seems that griddy. All of this echoes what I wrote previously in a GRIDtoday article:

The highest-profile demonstrations of Grid computing run the risk of trivializing Grid computing. It may seem harsh to paint the well-intentioned World Community Grid as technologically trivial, but in terms of full disclosure, this is not the most sophisticated demonstration of Grid computing. Equally damaging are those clustered applications (like Oracle 10g) that masquerade as Grid-enabled. Taking such license serves only to confuse and dilute the very essence of Grid computing.

Eiron’s own words serve well in summing up here:

Is it clear that the community needs to do a better job of explaining the role of grids within the landscape of close and perhaps somewhat overlapping technologies, such as virtualization, services-oriented architecture (SOA), automation, etc. The Grid community also needs to better articulate how the architectures, industry standards and products can help customers reap the benefits of grids. It can use the perception trough as an opportunity to re-group and create a solid story that can be delivered upon, or morph into something else. It seems that much of the influence on how things will evolve is now in the Grid community’s own hands.

Of course, only time will tell if this window of opportunity is still open, and if the Grid Computing community is able to capitalize on it.

Grid: Early Adopters Prefer A Better Term

In a recent GRIDtoday article William Fellows, a savvy principal analyst with The 451 Group, states:

When asked, 70 percent of early adopters who responded to a survey said there is a better term than “Grid” to describe their distributed computing architectures: 23 percent said virtualization, 23 percent said HPC, 19 percent said utility computing, 19 percent said clustering, and 15 percent said SOA.

Sadly, this serves only to underline much of what I’ve blogging about lately.