Preserving Content for Your Portfolio: Kudos to The Internet Archive

Preserving Science

I’ve been publishing articles since the last century.

In fact, my first, legitimate publication was a letter to science journal Nature with my then thesis supervisor (Keith Aldridge) in 1987 … that’s 31 years ago . Armed with nothing more than Google Scholar, searching for “aldridge lumb nature 1987” yields access to the article via Nature’s website in fractions of a second. Moreover, since the introduction of Digital Object Identifiers (DOIs) around the turn of the last century (circa 2000), articles such as this one are uniquely identifiable and findable via a URL – e.g., the URL for our Nature letter is http://dx.doi.org/10.1038/325421a0.

In this letter to Nature, Keith and I cite an 1880 publication authored by Lord Kelvin – whom, it appears, is known for fluid dynamics in addition to the temperature scale that bears his name … and, of course, much more! Through this and other citations, Keith and I explicitly acknowledged how the contributions of others enabled us to produce our letter – in other words, we made it clear how we have been able to stand on the shoulders of giants.

In addition to assigning intellectual credit where it is due, this personal reflection underscores the importance of preserving contributions over the long haul – make that 138 years in the case of Kelvin’s 1880 paper. Preservation is a well-established practice in the case of scientific journals, for example, even though it may be necessary to draw upon analog renditions captured via print or microfiche rather than some digital representation.

In self-curating portfolios recently, it’s been made increasingly clear to me that content preservation has not been a focal point in the digital realm.

Digital Properties

Let’s make use of Grid Computing for the purpose of providing an illustrative example. In its heyday, a popular and reputable online magazine was GRIDtoday: “DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY”. Other than a passing reference in pioneering publisher Tom Tabor’s BIO (you can search for his BIO here), I expect you’ll be hard pressed to locate very much at all regarding this once-thriving online property. Like Grid Computing itself: GRIDtoday, gone tomorrow; RIP GRIDtoday. Of course, Grid Computing Planet (GCP) suffered a similar fate.

My purpose here is not to question those extremely reasonable business decisions that resulted in closing down operations on GCP or GRIDtoday – Tabor Communications, for example, boasts three, prized ‘properties’ as of this writing … one of which (HPCwire) predates the inception of GRIDtoday, and remains a go-to source for all things HPC.

Grid Computing remains an important chapter in my professional life – especially given my claims for genetic imprinting via Distributed Computing. However, based upon my desire to assemble a portfolio of my work that includes Grid Computing, the /dev/null redirection of those bits that collectively represented GRIDtoday and GCP is problematical. In particular, and even though I collaborated upon articles and book chapters that have been preserved in analog and/or digital representations, articles contributed to GRIDtoday and GCP still retain value to me personally – value that I’d like to incorporate into my Portfolio.

Enter The Internet Archive

Fortunately, also since close to the end of the last century, The Internet Archive has been:

… building a digital library of Internet sites and other cultural artifacts in digital form. Like a paper library, [they] provide free access to researchers, historians, scholars, the print disabled, and the general public. Our mission is to provide Universal Access to All Knowledge.

I’m not intending to imply that those items I was able to have published via GRIDtoday and GCP carry ‘a Kelvin of clout’ however, for more than purely sentimental reasons it’s truly wonderful that The Internet Archive has attempted to preserve those artifacts that collectively comprised these publications in their heyday. Although I haven’t yet attempted to locate an article I wrote for GCP, I was able to retrieve two articles from the archive for GRIDtoday:

  • Towards The Telecosmic Grid – Published originally in December 2002, in this article I ‘channeled’ George Gilder is asserting that: “Isolating and manipulating discrete wavelengths of visible light across intelligent optical transport media results in the grid – a specific instance of The Telecosmic Grid. Several examples serve as beacons of possibility.” More on this soon (I hope) in a separate post that revisits this possibility.
  • Open Grid Forum: Necessary … but Sufficient? – Published originally in June 2006, this may be the most-opinionated article I’ve ever had appear in any media format! It generated a decent amount of traffic for GRIDtoday, as well as an interesting accusation – an accusation ‘leaked’, incidentally, through a mailing list archive.

Given that these two GRIDtoday articles are currently accessible via The Internet Archive means that I can include each of them directly in my Portfolio, and update my blog posts that make reference to them. Having laid intellectual claim (in 2002 I’ll have you know!!! 😉 to various possibilities telecosmic in nature, I’ll be able to soon revisit the same through the guise of hindsight. Whereas I fully appreciate that business decisions need to be made, and as consequence once-popular landing pages necessarily disappear, it’s truly fortunate that The Internet Archive has our collective backs on this. So, if this post has any key takeaways, it’s simply this:

Please donate to The Internet Archive.

Thanks Brewster!

Ian Lumb’s Cloud Computing Portfolio

When I first introduced it, it made sense to me (at the time, at least!) to divide my Data Science Portfolio into two parts; the latter part was “… intended to showcase those efforts that have enabled other Data Scientists” – in other words, my contributions as a Data Science Enabler.

As of today, most of what was originally placed in that latter part of my Data Science Portfolio has been transferred to a new portfolio – namely one that emphasizes Cloud computing. Thus my Cloud Computing Portfolio is a self-curated, online, multimedia effort intended to draw together into a cohesive whole my efforts in Cloud computing; specifically this new Portfolio is organized as follows:

  • Strictly Cloud – A compilation of contributions in which Cloud computing takes centerstage
  • Cloud-Related – A compilation of contributions drawn from clusters and grids to miscellany. Also drawn out in this section, however, are contributions relating to containerization.

As with my Data Science Portfolio, you’ll find in my Cloud Computing Portfolio everything from academic articles and book chapters, to blog posts, to webinars and conference presentations – in other words, this Portfolio also lives up to its multimedia billing!

Since this is intentionally a work-in-progress, like my Data Science Portfolio, feedback is always welcome as there will definitely be revisions applied !

Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Early Win Required for Partner-Friendly, Post-Acquisition Platform Computing

Further to the LinkedIn discussion on the relatively recent acquisition of Platfom by IBM, I just posted:

Platform CEO and Founder Songnian Zhou has this to say regarding the kernel of this discussion:

“IBM expects Platform to operate as a coherent business unit within its Systems and Technology Group. We got some promises from folks at IBM. We will accelerate our investments and growth. We will deliver on our product roadmaps. We will continue to provide our industry-best support and services. We will work even harder to add value to our partners, including IBM’s competitors. We want to make new friends while keeping the old, for one is silver while the other is gold. We might even get to keep our brand name. After all, distributed computing needs a platform, and there is only one Platform Computing. We are an optimistic bunch. We want to deliver to you the best of both worlds – you know what I mean. Give us a chance to show you what we can do for you tomorrow. Our customers and partners have journeyed with Platform all these years and have not regretted it. We are grateful to them eternally.”

Unsurprisingly upbeat, Zhou, Platform and IBM, really do require that customers and partners give them a chance to prove themselves under the new business arrangement. As noted in my previous comment in this discussion, this’ll require some seriously skillful stickhandling to skirt around challenging issues such as IP (Intellectual Property) – a challenge that is particularly exacerbated by the demands of the tightly coupled integrations required to deliver tangible value in the HPC context.

How might IBM-acquired Platform best demonstrate that it’s true to its collective word:

“Give us a chance to show you what we can do for you tomorrow.”

Certainly one way, is to strike an early win with a partner that demonstrates that they (Zhou, Platform and IBM) are true to their collective word. Aspects of this demonstration should likely include:

  • IP handling disclosures. Post-acquisition Platform and the partner should be as forthcoming as possible with respect to IP (Intellectual Property) handling – i.e., they should collectively communicate how business and technical IP challenges were handled in practice.
  • Customer validation. Self-explicit, such a demonstration has negligible value without validation by a customer willing to publicly state why they are willing to adopt the corresponding solution.
  • HPC depth. This demonstration has to be comprised of a whole lot more than merely porting a Platform product to a partner’s platform that would be traditionally viewed as a competitive to IBM. As stated previously, herein lies the conundrum: “To deliver a value-rich solution in the HPC context, Platform has to work (extremely) closely with the ‘system vendor’.  In many cases, this closeness requires that Intellectual Property (IP) of a technical and/or business nature be communicated …”

Thus, as the fullness of time shifts to post-acquisition Platform, trust becomes the watchword for continued success – particularly in HPC.

For without trust, there will be no opportunity for demonstrations such as the early win outlined here.

How else might Platform-acquired IBM demonstrate that it’s business-better-than-usual?

Feel free to add your $0.02.

IBM-Acquired Platform: Plan for Sustained, Partner-Friendly HPC Innovation Required

Over on LinkedIn, there’s an interesting discussion taking place in the “High Performance & Super Computing” group on the recently announced acquisition of Markham-based Platform Computing by IBM. My comment (below) was stimulated by concerns regarding the implications of this acquisition for IBM’s traditional competitors (i.e., other system vendors such as Cray, Dell, HP, etc.):

It could be argued:

“IBM groks vendor-neutral software and services (e.g., IBM Global Services), and therefore coopetition.”

At face value then, it’ll be business-as-usual for IBM-acquired Platform – and therefore its pre-acquisition partners and customers.

While business-as-usual plausibly applies to porting Platform products to offerings from IBM’s traditional competitors, I believe the sensitivity to the new business relationship (Platform as an IBM business unit) escalates rapidly for any solution that has value in HPC.

Why?

To deliver a value-rich solution in the HPC context, Platform has to work (extremely) closely with the ‘system vendor’. In many cases, this closeness requires that Intellectual Property (IP) of a technical and/or business nature be communicated – often well before solutions are introduced to the marketplace and made available for purchase. Thus Platform’s new status as an IBM entity, has the potential to seriously complicate matters regarding risk, trust, etc., relating to the exchange of IP.

Although it’s been stated elsewhere that IBM will allow Platform measures of post-acquisition independence, I doubt that this’ll provide sufficient comfort for matters relating to IP. While NDAs specific to the new (and independent) Platform business unit within IBM may offer some measure of additional comfort, I believe that technically oriented approaches offer the greatest promise for mitigating concerns relating to risk, trust, etc., in the exchange of IP.

In principle, one possibility is the adoption of open standards by all stakeholders. Such standards hold the promise of allowing for the integration between products via documented interfaces and protocols, while allowing (proprietary) implementation specifics to remain opaque. Although this may sound appealing, the availability of such standards remains elusive – despite various, well-intended efforts (by HPC, Grid, Cloud, etc., communities).

While Platform’s traditional competitors predictably and understandably gorge themselves sharing FUD, it obviously behooves both Platform and IBM to expend some effort allaying the concerns of their customers and partner ecosystem.

I’d be interested to hear of others’ suggestions as to how this new business relationship might allow for sustained innovation in the HPC context from IBM-acquired Platform.

Disclaimer: Although I do not have a vested financial interest in this acquisition, I did work for Platform from 1998-2005.

To reiterate here then:

How can this new business relationship allow for sustained, partner-friendly innovation in the HPC context from IBM-acquired Platform?

Please feel free to share your thoughts on this via comments to this post.

April’s Contributions on Bright Hub

In April, I contributed two articles to the Web Development channel over on Bright Hub:

ORION/CANARIE National Summit

Just in case you haven’t heard:

… join us for an exciting national summit on innovation and technology, hosted by ORION and CANARIE, at the Metro Toronto Convention Centre, Nov. 3 and 4, 2008.

“Powering Innovation – a National Summit” brings over 55 keynotes, speakers and panelist from across Canada and the US, including best-selling author of Innovation Nation, Dr. John Kao; President/CEO of Intenet2 Dr. Doug Van Houweling; chancellor of the University of California at Berkeley Dr. Robert J. Birgeneau; advanced visualization guru Dr. Chaomei Chen of Philadelphia’s Drexel University; and many more. The President of the Ontario College of Art & Design’s Sara Diamond chairs “A Boom with View”, a session on visualization technologies. Dr. Gail Anderson presents on forensic science research. Other speakers include the host of CBC Radio’s Spark Nora Young; Delvinia Interactive’s Adam Froman and the President and CEO of Zerofootprint, Ron Dembo.

This is an excellent opportunity to meet and network with up to 250 researchers, scientists, educators, and technologists from across Ontario and Canada and the international community. Attend sessions on the very latest on e-science; network-enabled platforms, cloud computing, the greening of IT; applications in the “cloud”; innovative visualization technologies; teaching and learning in a web 2.0 universe and more. Don’t miss exhibitors and showcases from holographic 3D imaging, to IP-based television platforms, to advanced networking.

For more information, visit http://www.orioncanariesummit.ca.

Cyberinfrastructure: Worth the Slog?

If what I’ve been reading over the past few days has any validity to it at all, there will continue to be increasing interest in cyberinfrastructure (CI). Moreover, this interest will come from an increasingly broader demographic.

At this point, you might be asking yourself what, exactly, is cyberinfrastructure. The Atkins Report defines CI this way:

The term infrastructure has been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function. … The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information, and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy. [p. 5]

[Cyberinfrastructure] can serve individuals, teams and organizations in ways that revolutionize what they can do, how they do it, and who participates. [p. 17]

If this definition leaves you wanting, don’t feel too bad, as anyone whom I’ve ever spoken to on the topic feels the same way. What doesn’t help is that the Atkins Report, and others I’ve referred to below, also bandy about terms like e-Science, Grid Computing, Service Oriented Architectures (SOAs), etc. Add to these newer terms such as Cooperative Computing, Network-Enabled Platforms plus Cell Computing and it’s clear that the opportunity for obfuscation is about all that’s being guaranteed.

Consensus on the inadequacy of the terminology aside, there is also consensus that this is a very exciting time with very interesting possibilities.

So where, pragmatically, does this leave us?

Until we collectively sort out the terminology, my suggestion is that the time is ripe for immediate immersion in what cyberinfrastructure and the like might feel like or are. In other words, I highly recommend reviewing the sources cited below in order:

  1. The Wikipedia entry for cyberinfrastructure – A great starting point with a number of references that is, of course, constantly updated.
  2. The Atkins Report – The NSF’s original CI document.
  3. Cyberinfrastructure Vision for 21st Century Discovery – A slightly more concrete update from the NSF as of March 2007.
  4. Community-specific content – There is content emerging on the intersection between CI and specific communities, disciplines, etc. These frontiers are helping to better define the transformative aspects and possibilities for CI in a much-more concrete way.

Frankly, it’s a bit of a slog to wade through all of this content for a variety of reasons …

Ultimately, however, I believe it’s worth the undertaking at the present time as the possibilities are very exciting.

Earth and Space Science Informatics at the 2007 Fall Meeting of the American Geophysical Union

In a previous post, I referred to Earth Science Informatics as a discipline-in-the-making.

To support this claim, I cited a number of data points. And of these data points, the 2006 Fall Meeting of the American Geophysical Union (AGU) stands out as a key enabler.

With 22 sessions posted, the 2007 Fall Meeting of the AGU is well primed to further enable the development of this discipline.

Because I’m a passionate advocate of this intersection between the Earth Sciences and Informatics, I’m involved in convening three of the 22 Earth and Space Science Informatics sessions:

I encourage you to take a moment to review the calls for participation for these three, as well as the other 19, sessions in Earth and Space Science Informatics at the 2007 Fall Meeting of the AGU.

CANARIE’s Network-Enabled Platforms Workshop: Follow Up

I spent a few days in Ottawa last week participating in CANARIE’s Network-Enabled Platforms Workshop.

As the pre-workshop agenda indicated, there’s a fair amount of activity in this area already, and much of it originates from within Canada.

Now that the workshop is over, most of the presentations are available online.

In my case, I’ve made available a discussion document entitled “Evolving Semantic Frameworks into Network-Enabled Semantic Platforms”. This document is very much a work in progress and feedback is welcome here (as comments to this blog post), to me personally (via email to ian AT yorku DOT ca), or via CANARIE’s wiki.

Although a draft of the CANARIE RFP funding opportunity was provided in hard-copy format, there was no soft-copy version made available. If this is of interest, I’d suggest you keep checking the CANARIE site.

Finally, a few shots I took of Ottawa are available online