Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Over “On the Bright side …” Cloud Use Cases from Bio-IT World

As some of you know, I’ve recently joined Bright Computing.

Last week, I attended Bio-IT World 2013 in Boston. Bright had an excellent show – lots of great conversations, and even an award!

During numerous conversations, the notion of extending on-site IT infrastructure into the cloud was raised. Bright has an excellent solution for this.

What also emerged during the conversations were two uses for this extension of local IT resources via the cloud. I thought this was worth capturing and sharing. You can read about the use cases I identified over “On the Bright side …

Foraging for Resources in the Multicore Present and Future

HPC consultant Wolfgang Gentzsch has thoughtfully updated the case of multicore architectures in the HPC context. Over on LinkedIn, via one of the HPC discussion groups, I responded with:

I also enjoyed your article, Wolfgang – thank you. Notwithstanding the drive towards cluster-on-a-chip architectures, HPC customers will require workload managers (WLMs) that interface effectively and efficiently with O/S-level features/functionalities (e.g., MCOPt Multicore Manager from eXludus for Linux, to re-state your example). To me, this is a need well evidenced in the past: For example, various WLMs were tightly integrated with IRIX’s cpuset functionality (http://www.sgi.com/products/software/irix/releases/irix658.html) to allow for topology-aware scheduling in this NUMA-based offering from SGI. In present and future multicore contexts, the appetite for petascale and exascale computing will drive the need for such WLM-O/S integrations. In addition to the multicore paradigm, what makes ‘this’ future particularly interesting, is that some of these multicore architectures will exist in a hybrid (CPU/GPU) cloud – a cloud that may compliment in-house resources via some bursting capability (e.g., Bright’s cloud bursting, http://www.brightcomputing.com/Linux-Cluster-Cloud-Bursting.php). As you also well indicated in your article, it is incumbent upon all stakeholders to ensure that this future is a friendly as possible (e.g., for developers and users). To update a phrase originally spun by Herb Sutter (http://www.gotw.ca/publications/concurrency-ddj.htm) in the multicore context, not only is the free lunch over, its getting tougher to find and ingest lunches you’re willing to pay for!

We certainly live in interesting times!

IBM-Acquired Platform: Plan for Sustained, Partner-Friendly HPC Innovation Required

Over on LinkedIn, there’s an interesting discussion taking place in the “High Performance & Super Computing” group on the recently announced acquisition of Markham-based Platform Computing by IBM. My comment (below) was stimulated by concerns regarding the implications of this acquisition for IBM’s traditional competitors (i.e., other system vendors such as Cray, Dell, HP, etc.):

It could be argued:

“IBM groks vendor-neutral software and services (e.g., IBM Global Services), and therefore coopetition.”

At face value then, it’ll be business-as-usual for IBM-acquired Platform – and therefore its pre-acquisition partners and customers.

While business-as-usual plausibly applies to porting Platform products to offerings from IBM’s traditional competitors, I believe the sensitivity to the new business relationship (Platform as an IBM business unit) escalates rapidly for any solution that has value in HPC.

Why?

To deliver a value-rich solution in the HPC context, Platform has to work (extremely) closely with the ‘system vendor’. In many cases, this closeness requires that Intellectual Property (IP) of a technical and/or business nature be communicated – often well before solutions are introduced to the marketplace and made available for purchase. Thus Platform’s new status as an IBM entity, has the potential to seriously complicate matters regarding risk, trust, etc., relating to the exchange of IP.

Although it’s been stated elsewhere that IBM will allow Platform measures of post-acquisition independence, I doubt that this’ll provide sufficient comfort for matters relating to IP. While NDAs specific to the new (and independent) Platform business unit within IBM may offer some measure of additional comfort, I believe that technically oriented approaches offer the greatest promise for mitigating concerns relating to risk, trust, etc., in the exchange of IP.

In principle, one possibility is the adoption of open standards by all stakeholders. Such standards hold the promise of allowing for the integration between products via documented interfaces and protocols, while allowing (proprietary) implementation specifics to remain opaque. Although this may sound appealing, the availability of such standards remains elusive – despite various, well-intended efforts (by HPC, Grid, Cloud, etc., communities).

While Platform’s traditional competitors predictably and understandably gorge themselves sharing FUD, it obviously behooves both Platform and IBM to expend some effort allaying the concerns of their customers and partner ecosystem.

I’d be interested to hear of others’ suggestions as to how this new business relationship might allow for sustained innovation in the HPC context from IBM-acquired Platform.

Disclaimer: Although I do not have a vested financial interest in this acquisition, I did work for Platform from 1998-2005.

To reiterate here then:

How can this new business relationship allow for sustained, partner-friendly innovation in the HPC context from IBM-acquired Platform?

Please feel free to share your thoughts on this via comments to this post.

Multi-Touch Computational Steering

About 1:35 into

Jeff Han impressively demonstrates a lava-lamp application on a multi-touch user interface.

Having spent considerable time in the past pondering the fluid dynamics (e.g., convection) of the Earth’s atmosphere and deep interior (i.e., mantle and core), Han’s demonstration immediately triggered a scientific use case: Is it possible to computationally steer scientific simulations via multi-touch user interfaces?

A quick search via Google returns almost 20,000 hits … In other words, I’m likely not the first to make this connection 😦

In my copious spare time, I plan to investigate further …

Also of note is how this connection was made: A friend sent me a link to an article on Apple’s anticipated tablet product. Since so much of the anticipation of the Apple offering relates to the user interface, it’s not surprising that reference was made to Jeff Han’s TED talk (the video above). Cool.

If you have any thoughts to share on multi-touch computational steering, please feel free to chime in.

One more thought … I would imagine that the gaming industry would be quite interested in such a capability – if it isn’t already!

Evolving Semantic Frameworks into Platforms: Unpublished ms.

I learned yesterday that the manuscript I submitted to HPCS 2008 was not accepted 😦
It may take my co-authors and I some time before this manuscript is revised and re-submitted.
This anticipated re-submission latency, along with the fact that we believe the content needs to be shared in a timely fashion, provides the motivation for sharing the manuscript online.
To whet your appetite, the abstract is as follows:

Evolving a Semantic Framework into a Network-Enabled Semantic Platform
A data-oriented semantic framework has been developed previously for a project involving a network of globally distributed scientific instruments. Through the use of this framework, the semantic expressivity and richness of the project’s ASCII data is systematically enhanced as it is successively represented in XML (eXtensible Markup Language), RDF (Resource Description Formal) and finally as an informal ontology in OWL (Web Ontology Language). In addition to this representational transformation, there is a corresponding transformation from data into information into knowledge. Because this framework is broadly applicable to ASCII and binary data of any origin, it is appropriate to develop a network-enabled semantic platform that identifies the enabling semantic components and interfaces that already exist, as well as the key gaps that need to be addressed to completely implement the platform. After briefly reviewing the semantic framework, a J2EE (Java 2 Enterprise Edition) based implementation for a network-enabled semantic platform is provided. And although the platform is in principle usable, ongoing adoption suggests that strategies aimed at processing XML via parallel I/O techniques are likely an increasingly pressing requirement.

Parsing XML: Commercial Interest

Over the past few months, a topic I’ve become quite interested in is parsing XML. And more specifically, parsing XML in parallel.

Although I won’t take this opportunity to expound in any detail on what I’ve been up to, I did want to state that this topic is receiving interest from significant industry players. For example, here are two data points:

Parsing of XML documents has been recognized as a performance bottleneck when processing XML. One cost-effective way to improve parsing performance is to use parallel algorithms and leverage the use of multi-core processors. Parallel parsing for XML Document Object Model (DOM) has been proposed, but the existing schemes do not scale up well with the number of processors. Further, there is little discussion of parallel parsing methods for other parsing models. The question is: how can we improve parallel parsing for DOM and other XML parsing models, when multi-core processors are available?

Intel Corp. released a new software product suite that is designed to enhance the performance of XML in service-oriented architecture (SOA) environments, or other environments where XML handling needs optimization. Intel XML Software Suite 1.0, which was announced earlier this month, provides libraries to help accelerate XSLT, XPath, XML schemas and XML parsing. XML performance was found to be twice that of open source solutions when Intel tested its product …

As someone with a vested interest in XML, I regard data points such as these as very positive overall.

HP Labs Innovates to Sustain Moore’s Law

I had the fortunate opportunity to attend a presentation by Dr. R. Stanley Williams at an HP user forum event in March 2000 in San Jose.

Subsequently, I ran across mention of his work at HP Labs in various places, including Technology Review.

So, when I read in PC World that

Hewlett-Packard researchers may have figured out a way to prolong Moore’s Law by making chips more powerful and less power-hungry

I didn’t immediately dismiss this as marketing hyperbole.

Because of Moore’s Law, transistor density has traditionally grabbed all the attention when it comes to next-generation chips. However power consumption, and the resulting heat generated, are also gating (sorry, I couldn’t resist that!) factors when it comes to chip design. This is why there is a well-established trend in the development of multicore chip architectures. With the multicore paradigm, it’s an aggregated responsibility to ensure that Moore’s Law is of ongoing relevance.

Williams has found a way to sustain the relevance of Moore’s Law without having to make use of a multicore architecture.

Working with Gregory S. Snyder, the HP team has redesigned the Field Programmable Gate Array (FPGA) by introducing a nano-scale interconnect (a field-programmable nanowire interconnect, FPNI). As hinted earlier in this posting, the net effect is to allow for significantly increased transistor density and reduced power consumption. A very impressive combination!

Although Sun executive Scott McNealy is usually associated with the aphorism “The network is the computer”, it may now be HP’s turn to recapitulate.

A Bayesian-Ontological Approach for Fighting Spam

When it comes to fighting spam, Bayesian and ontological approaches are not mutually exclusive.

They could be used together in a highly complimentary fashion.

For example you could use Bayesian approaches, as they are implemented today, to build a spam ontology. In other words, the Bayesian approach would be extended through the addition of knowledge-representation methods for fighting spam.

This is almost the opposite of the Wikipedia-based approach I blogged about recently.

In the Wikipedia-based approach, the ontology consists of ham-based knowledge.

In the alternative I’ve presented here, the ontology consists of spam-based knowledge.

Both approaches are technically viable. However, it’d be interesting to see which one actually works better in practice.

Either way, automated approaches for constructing ontologies, as I’ve outlined elsewhere, are likely to be of significant applicability here.

Another point is also clear: Either way, this will be a computationally intensive activity!