Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Incorporate the Cloud into Existing IT Infrastructure => Progress ( Life Sciences )

I still have lots to share after recently attending Bio-IT World in  Boston … The latest comes as a Bright Computing contributed article to the April 2013 issue of the IEEE Life Sciences Newsletter.

The upshot of this article is:

Progress in the Life Sciences demands extension of IT infrastructure from the ground into the cloud.

Feel free to respond here with your comments.

Synthetic Life and Evolution of Earth’s Second Atmosphere

I have the pleasure of teaching the science of weather and climate to non-scientists again this Fall/Winter session at Toronto’s York University. In the Fall 2011 Term, time was spent discussing the origin and evolution of Earth’s atmosphere. What follows is a post I just shared with the class via Moodle (our LMS):
Photosynthesizing anaerobic lifeforms in Earth’s oceans were likely responsible for systematically enriching Earth’s atmosphere with respect to O2. Through chemical reactions in Earth’s atmosphere, O3 and the O3 layer were systematically derived from this same source of O2. The O3 layer’s ability to minimize the impact of harmful UV radiation, in tandem with the ascent of [O2] to current values of about 21% by volume, were and remain crucial to life as we experience it today.

In tracing the evolution of Earth’s second atmosphere from a composition based on volcanic outgassing to its present state, the role of life was absolutely critical.

On my drive home tonight after today’s lecture, I happened upon a broadcast regarding synthetic life on CBC Radio‘s Ideas. Based upon annotated excerpts from a Craig Venter lecture, this broadcast is well worth the listen in and of itself. And although I’m no life scientist, I can’t help but predict that Venter’s work will ultimately lead to refinements, if not a complete rewrite, of life’s role in the evolution of Earth’s second atmosphere.
If you have any thoughts on this prediction, please feel free to share them here via a comment.

Foraging for Resources in the Multicore Present and Future

HPC consultant Wolfgang Gentzsch has thoughtfully updated the case of multicore architectures in the HPC context. Over on LinkedIn, via one of the HPC discussion groups, I responded with:

I also enjoyed your article, Wolfgang – thank you. Notwithstanding the drive towards cluster-on-a-chip architectures, HPC customers will require workload managers (WLMs) that interface effectively and efficiently with O/S-level features/functionalities (e.g., MCOPt Multicore Manager from eXludus for Linux, to re-state your example). To me, this is a need well evidenced in the past: For example, various WLMs were tightly integrated with IRIX’s cpuset functionality (http://www.sgi.com/products/software/irix/releases/irix658.html) to allow for topology-aware scheduling in this NUMA-based offering from SGI. In present and future multicore contexts, the appetite for petascale and exascale computing will drive the need for such WLM-O/S integrations. In addition to the multicore paradigm, what makes ‘this’ future particularly interesting, is that some of these multicore architectures will exist in a hybrid (CPU/GPU) cloud – a cloud that may compliment in-house resources via some bursting capability (e.g., Bright’s cloud bursting, http://www.brightcomputing.com/Linux-Cluster-Cloud-Bursting.php). As you also well indicated in your article, it is incumbent upon all stakeholders to ensure that this future is a friendly as possible (e.g., for developers and users). To update a phrase originally spun by Herb Sutter (http://www.gotw.ca/publications/concurrency-ddj.htm) in the multicore context, not only is the free lunch over, its getting tougher to find and ingest lunches you’re willing to pay for!

We certainly live in interesting times!

Multi-Touch Computational Steering

About 1:35 into

Jeff Han impressively demonstrates a lava-lamp application on a multi-touch user interface.

Having spent considerable time in the past pondering the fluid dynamics (e.g., convection) of the Earth’s atmosphere and deep interior (i.e., mantle and core), Han’s demonstration immediately triggered a scientific use case: Is it possible to computationally steer scientific simulations via multi-touch user interfaces?

A quick search via Google returns almost 20,000 hits … In other words, I’m likely not the first to make this connection 😦

In my copious spare time, I plan to investigate further …

Also of note is how this connection was made: A friend sent me a link to an article on Apple’s anticipated tablet product. Since so much of the anticipation of the Apple offering relates to the user interface, it’s not surprising that reference was made to Jeff Han’s TED talk (the video above). Cool.

If you have any thoughts to share on multi-touch computational steering, please feel free to chime in.

One more thought … I would imagine that the gaming industry would be quite interested in such a capability – if it isn’t already!

On Knowledge-Based Representations for Actionable Data …

I bumped into a professional acquaintance last week. After describing briefly a presentation I was about to give, he offered to broker introductions to others who might have an interest in the work I’ve been doing. To initiate the introductions, I crafted a brief description of what I’ve been up to for the past 5 years in this area. I’ve also decided to share it here as follows: 

As always, [name deleted], I enjoyed our conversation at the recent AGU meeting in Toronto. Below, I’ve tried to provide some context for the work I’ve been doing in the area of knowledge representations over the past few years. I’m deeply interested in any introductions you might be able to broker with others at York who might have an interest in applications of the same.

Since 2004, I’ve been interested in expressive representations of data. My investigations started with a representation of geophysical data in the eXtensible Markup Language (XML). Although this was successful, use of the approach underlined the importance of metadata (data about data) as an oversight. To address this oversight, a subsequent effort introduced a relationship-centric representation via the Resource Description Format (RDF). RDF, by the way, forms the underpinnings of the next-generation Web – variously known as the Semantic Web, Web 3.0, etc. In addition to taking care of issues around metadata, use of RDF paved the way for increasingly expressive representations of the same geophysical data. For example, to represent features in and of the geophysical data, an RDF-based scheme for annotation was introduced using XML Pointer Language (XPointer). Somewhere around this point in my research, I placed all of this into a framework.

A data-centric framework for knowledge representation.

A data-centric framework for knowledge representation.

 In addition to applying my Semantic Framework to use cases in Internet Protocol (IP) networking, I’ve continued to tease out increasingly expressive representations of data. Most recently, these representations have been articulated in RDFS – i.e., RDF Schema. And although I have not reached the final objective of an ontological representation in the Web Ontology Language (OWL), I am indeed progressing in this direction. (Whereas schemas capture the vocabulary of an application domain in geophysics or IT, for example, ontologies allow for knowledge-centric conceptualizations of the same.)  

From niche areas of geophysics to IP networking, the Semantic Framework is broadly applicable. As a workflow for systematically enhancing the expressivity of data, the Framework is based on open standards emerging largely from the World Wide Web Consortium (W3C). Because there is significant interest in this next-generation Web from numerous parties and angles, implementation platforms allow for increasingly expressive representations of data today. In making data actionable, the ultimate value of the Semantic Framework is in providing a means for integrating data from seemingly incongruous disciplines. For example, such representations are actually responsible for providing new results – derived by querying the representation through a ‘semantified’ version of the Structured Query Language (SQL) known as SPARQL. 

I’ve spoken formally and informally about this research to audiences in the sciences, IT, and elsewhere. With York co-authors spanning academic and non-academic staff, I’ve also published four refereed journal papers on aspects of the Framework, and have an invited book chapter currently under review – interestingly, this chapter has been contributed to a book focusing on data management in the Semantic Web. Of course, I’d be pleased to share any of my publications and discuss aspects of this work with those finding it of interest.

With thanks in advance for any connections you’re able to facilitate, Ian. 

If anything comes of this, I’m sure I’ll write about it here – eventually!

In the meantime, feedback is welcome.

April’s Contributions on Bright Hub

In April, I contributed two articles to the Web Development channel over on Bright Hub: