Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Confronting the Fear of Public Speaking via Virtual Environments

Confession: In the past, I’ve been extremely quick to dismiss the value of Second Life in the context of teaching and learning.

Even worse, my dismissal was not fact-based … and, if truth be told, I’ve gone out of my way to avoid opportunities to ‘gather the facts’ by attending presentations at conferences, conducting my own research online, speaking with my colleagues, etc.

So I, dear reader, am as surprised as any of you to have had an egg-on-my-face epiphany this morning …

Please allow me to elaborate:

It was at some point during this morning’s brainstorming session that the egg hit me squarely in the face:

Why not use Nortel web.alive to prepare graduate students for presenting their research?

Often feared more than death and taxes, public speaking is an essential aspect of academic research – regardless of the discipline.

image004Enter Nortel web.alive with its virtual environment of a large lecture hall – complete with a podium, projection screen for sharing slides, and most importantly an audience!

As a former graduate student, I could easily ‘see’ myself in this environment with increasingly realistic audiences comprised of friends, family and/or pets, fellow graduate students, my research supervisor, my supervisory committee, etc. Because Nortel web.alive only requires a Web browser, my audience isn’t geographically constrained. This geographical freedom is important as it allows for participation – e.g., between graduate students at York in Toronto and their supervisor who just happens to be on sabbatical in the UK. (Trust me, this happens!)

As the manager of Network Operations at York, I’m always keen to encourage novel use of our campus network. The public-speaking use case I’ve described here has the potential to make innovative use of our campus network, regional network (GTAnet), provincial network (ORION), and even national network (CANARIE) that would ultimately allow for global connectivity.

While I busy myself scraping the egg off my face, please chime in with your feedback. Does this sound useful? Are you aware of other efforts to use virtual environments to confront the fear of public speaking? Are there related applications that come to mind for you? (As someone who’s taught classes of about 300 students in large lecture halls, a little bit of a priori experimentation in a virtual environment would’ve been greatly appreciated!)

Update (November 13, 2009): I just Google’d the title of this article and came up with a few, relevant hits; further research is required.

ORION/CANARIE National Summit

Just in case you haven’t heard:

… join us for an exciting national summit on innovation and technology, hosted by ORION and CANARIE, at the Metro Toronto Convention Centre, Nov. 3 and 4, 2008.

“Powering Innovation – a National Summit” brings over 55 keynotes, speakers and panelist from across Canada and the US, including best-selling author of Innovation Nation, Dr. John Kao; President/CEO of Intenet2 Dr. Doug Van Houweling; chancellor of the University of California at Berkeley Dr. Robert J. Birgeneau; advanced visualization guru Dr. Chaomei Chen of Philadelphia’s Drexel University; and many more. The President of the Ontario College of Art & Design’s Sara Diamond chairs “A Boom with View”, a session on visualization technologies. Dr. Gail Anderson presents on forensic science research. Other speakers include the host of CBC Radio’s Spark Nora Young; Delvinia Interactive’s Adam Froman and the President and CEO of Zerofootprint, Ron Dembo.

This is an excellent opportunity to meet and network with up to 250 researchers, scientists, educators, and technologists from across Ontario and Canada and the international community. Attend sessions on the very latest on e-science; network-enabled platforms, cloud computing, the greening of IT; applications in the “cloud”; innovative visualization technologies; teaching and learning in a web 2.0 universe and more. Don’t miss exhibitors and showcases from holographic 3D imaging, to IP-based television platforms, to advanced networking.

For more information, visit http://www.orioncanariesummit.ca.

Annotation Modeling: In Press

Our manuscript on annotation modeling is one step closer to publication now, as late last night my co-authors and I received sign-off on the copy-editing phase. The journal, Computers and Geosciences, is now preparing proofs.
For the most part then, as authors, we’re essentially done.
However, we may not be able to resist the urge to include a “Note Added in Proof”. At the very least, this note will allude to:

  • The work being done to refactor Annozilla for use in a Firefox 3 context; and
  • How annotation is figuring in OWL2 (Google “W3C OWL2” for more).

Stay tuned …

CANHEIT 2008: York Involvement

York University will be well represented at CANHEIT 2008
Although you’ll find the details in CANHEIT’s online programme, allow me to whet your appetite regarding our contributions:

Annotation Modeling: To Appear in Comp & Geosci

What a difference a day makes!
Yesterday I learned that my paper on semantic platforms was rejected.
Today, however, the news was better as a manuscript on annotation modeling was
accepted for publication.
It’s been a long road for this paper:

The abstract of the paper is as follows:

Annotation Modeling with Formal Ontologies:
Implications for Informal Ontologies

L. I. Lumb[1], J. R. Freemantle[2], J. I. Lederman[2] & K. D.
Aldridge[2]
[1] Computing and Network Services, York University, 4700 Keele Street,
Toronto, Ontario, M3J 1P3, Canada
[2] Earth & Space Science and Engineering, York University, 4700 Keele
Street, Toronto, Ontario, M3J 1P3, Canada
Knowledge representation is increasingly recognized as an important component of any cyberinfrastructure (CI). In order to expediently address scientific needs, geoscientists continue to leverage the standards and implementations emerging from the World Wide Web Consortium’s (W3C) Semantic Web effort. In an ongoing investigation, previous efforts have been aimed towards the development of a semantic framework for the Global Geodynamics Project (GGP). In contrast to other efforts, the approach taken has emphasized the development of informal ontologies, i.e., ontologies that are derived from the successive extraction of Resource Description Format (RDF) representations from eXtensible Markup Language (XML), and then Web Ontology Language (OWL) from RDF. To better understand the challenges and opportunities for incorporating annotations into the emerging semantic framework, the present effort focuses on knowledge-representation modeling involving formal ontologies. Although OWL’s internal mechanism for annotation is constrained to ensure computational completeness and decidability, externally originating annotations based on the XML Pointer Language (XPointer) can easily violate these constraints. Thus the effort of modeling with formal ontologies allows for recommendations applicable to the case of incorporating annotations into informal ontologies.

I expect the whole paper will be made available in the not-too-distant future …

QoS On My Mind …

QoS has been on my mind lately.
Why?
I suppose there are a number of reasons.
We’re in the process of re-architecting our data network at York. We’re starting off by adding redundancy in various ways, and anticipate the need to address QoS in preparing for our future deployment of a VoIP service.
Of course, that doesn’t mean we don’t already have VoIP or VoIP-like protocols already present on our existing undifferentiated network. In addition to Skype, there are groups that have already embraced videoconferencing solutions that make use of protocols like RTP. And given that there’s already a Top 50 list of Open Source VoIP applications to choose from, I’m sure these aren’t the only examples of VoIP-like applications on our network. 
At the moment, I have more questions about QoS than answers. 
For example:
  • If we introduce protocol-based QoS, won’t this provide any application using the protocol access to a differentiated QoS? I sense that QoS can be applied in a very granular fashion, but do I really want to turn my entire team of network specialists into QoS specialists? (From an operational perspective, I know I can’t afford to!)
  • When is the right time to introduce QoS? Users are clamoring for QoS ASAP, as it’s often perceived as a panacea – a panacea that often masks the root cause of what really ails them … From a routing and switching perspective, do we wait for tangible signs of congestion, before implementing QoS? I certainly have the impression that others managing Campus as well as regional networks plan to do this. 
  • And what about standards? QoS isn’t baked into IPv4, but there are some implementations that promote interoperability between vendors. Should MPLS, used frequently in service providers’ networks, be employed as a vehicle for QoS in the Campus network context? 
  • QoS presupposes that use is to be made of an existing network. Completely segmenting networks, i.e., dedicating a network to a VoIP deployment, is also an option. An option that has the potential to bypass the need for QoS. 
I know that as I dig deeper into the collective brain trust answers, and more questions, will emerge. 
And even though there are a number of successful deployments of VoIP that can be pointed to, there still seems to be a need to have a deeper discussion on QoS – starting from a strategic level. 
As I reflect more and more on QoS I’m thinking that a suitably targeted BoF, at CANHEIT 2008 for example, might provide a fertile setting for an honest discussion.     

Net@EDU 2008: Key Takeaways

Earlier this week, I participated in the Net@EDU Annual Meeting 2008: The Next 10 Years.   For me, the key takeaways are:

  • The Internet can be improved. IP, its transport protocols (RTP, SIP, TCP and UDP), and especially HTTP, are stifling innovation at the edges – everything (device-oriented) on IP and everything (application-oriented) on the Web. There are a number of initiatives that seek to improve the situation. One of these, with tangible outcomes, is the Stanford Clean Slate Internet Design Program.
  • Researchers and IT organizations need to be reunited. In the 1970s and 1980s, these demographics worked closely together and delivered a number of significant outcomes. Beyond the 1990s, these group remain separate and distinct. This separation has not benefited either group. As the manager of a team focused on operation of a campus network who still manages to conduct a modest amount of research, this takeaway resonates particularly strongly with me. 
  • DNSSEC is worth investigating now. DNS is a mission-critical service. It is often, however, an orphaned service in many IT organizations. DNSSEC is comprised of four standards that extend the original concept in security-savvy ways – e.g., they will harden your DNS infrastructure against DNS-targeted attacks. Although production implementation remains a future, the time is now to get involved.
  • The US is lagging behind in the case of broadband. An EDUCAUSE blueprint details the current situation, and offers a prescription for rectifying it. As a Canadian, it is noteworthy that Canada’s progress in this area is exceptional, even though it is regarded as a much-more rural nation than the US. The key to the Canadian success, and a key component of the blueprint’s prescription, is the funding model that shares costs evenly between two levels of government (federal and provincial) as well as the network builder/owner. 
  • Provisioning communications infrastructures for emergency situations is a sobering task. Virginia Tech experienced 100-3000% increases emergency-communications-panel-netedu-021008_2004.png in the demands on their communications infrastructure as a consequence of their April 16, 2007 event. Such stress factors are exceedingly difficult to estimate and account for. In some cases, responding in real time allowed for adequate provisioning through a tremendous amount of collaboration. Mass notification remains a challenge. 
  • Today’s and tomorrow’s students are different from yesterday’s. Although this may sound obvious, the details are interesting. Ultimately, this difference derives from the fact that today’s and tomorrow’s students have more intimately integrated technology into their lives from a very young age.
  • Cyberinfrastructure remains a focus. EDUCAUSE has a Campus Cyberinfrastructure Working Group. Some of their deliverables are soon to include a CI digest, plus contributions from their Framing and Information Management Focus Groups. In addition to the working-group session, Don Middleton of NCAR discussed the role of CI in the atmospheric sciences. I was particularly pleased that Middleton made a point of showcasing semantic aspects of virtual observatories such as the Virtual Solar-Terrestrial Observatory (VSTO).
  • The Tempe Mission Palms Hotel is an outstanding venue for a conference. Net@EDU has themed its annual meetings around this hotel, Tempe, Arizona and the month of February. This strategic choice is delivered in spades by the venue. From individual rooms to conference food and logistics to the mini gym and pool, The Tempe Mission Palms Hotel delivers. 

img_2462.jpg

    AGU Poster: Relationship-Centric Ontology Integration

    Later today in San Francisco, at the 2007 Fall Meeting of the American Geophysical Union (AGU), one of my co-authors will be presenting our poster entitled “Relationship-Centric Ontology Integration” (abstract).

    This poster will be in a session for which I was a co-convenor and described elsewhere.

    A PDF-version of the poster is available elsewhere (agu07_the_poster_v2.pdf).

    Cyberinfrastructure: Worth the Slog?

    If what I’ve been reading over the past few days has any validity to it at all, there will continue to be increasing interest in cyberinfrastructure (CI). Moreover, this interest will come from an increasingly broader demographic.

    At this point, you might be asking yourself what, exactly, is cyberinfrastructure. The Atkins Report defines CI this way:

    The term infrastructure has been used since the 1920s to refer collectively to the roads, power grids, telephone systems, bridges, rail lines, and similar public works that are required for an industrial economy to function. … The newer term cyberinfrastructure refers to infrastructure based upon distributed computer, information, and communication technology. If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for a knowledge economy. [p. 5]

    [Cyberinfrastructure] can serve individuals, teams and organizations in ways that revolutionize what they can do, how they do it, and who participates. [p. 17]

    If this definition leaves you wanting, don’t feel too bad, as anyone whom I’ve ever spoken to on the topic feels the same way. What doesn’t help is that the Atkins Report, and others I’ve referred to below, also bandy about terms like e-Science, Grid Computing, Service Oriented Architectures (SOAs), etc. Add to these newer terms such as Cooperative Computing, Network-Enabled Platforms plus Cell Computing and it’s clear that the opportunity for obfuscation is about all that’s being guaranteed.

    Consensus on the inadequacy of the terminology aside, there is also consensus that this is a very exciting time with very interesting possibilities.

    So where, pragmatically, does this leave us?

    Until we collectively sort out the terminology, my suggestion is that the time is ripe for immediate immersion in what cyberinfrastructure and the like might feel like or are. In other words, I highly recommend reviewing the sources cited below in order:

    1. The Wikipedia entry for cyberinfrastructure – A great starting point with a number of references that is, of course, constantly updated.
    2. The Atkins Report – The NSF’s original CI document.
    3. Cyberinfrastructure Vision for 21st Century Discovery – A slightly more concrete update from the NSF as of March 2007.
    4. Community-specific content – There is content emerging on the intersection between CI and specific communities, disciplines, etc. These frontiers are helping to better define the transformative aspects and possibilities for CI in a much-more concrete way.

    Frankly, it’s a bit of a slog to wade through all of this content for a variety of reasons …

    Ultimately, however, I believe it’s worth the undertaking at the present time as the possibilities are very exciting.