Data Science: Identifying My Professional Bias

Data Science: Identifying My Professional Bias

In the Summer of 1984, I arrived at Toronto’s York University as a graduate student in Physics & Astronomy. (Although my grad programme was Physics & Astronomy, my research emphasized the application of fluid dynamics to Earth’s deep interior.) Some time after that, I ran my first non-interactive computation on a cluster of VAX computers. I’m not sure if this was my first exposure to Distributed Computing or not not; I am, however, fairly certain that this was the first time it (Distributed Computing) registered with me as something exceedingly cool, and exceedingly powerful.

Even back in those days, armed with nothing more than a VT100 terminal ultimately connected to a serial interface on one of the VAXes, I could be logged in and able to submit a computational job that might run on some other VAX participating in the cluster. The implied connectedness, the innate ability to make use of compute cycles on some ‘remote’ system was intellectually intoxicating – and I wasn’t even doing any parallel computing (yet)!

More than a decade later, while serving in a staff role as a computer coordinator, I became involved in procuring a modest supercomputer for those members of York’s Faculty of Pure & Applied Science who made High Performance Computing (HPC) a critical component of their research. If memory serves me correctly, this exercise resulted in the purchase of a NUMA-architecture system from SGI powered by MIPS CPUs. Though isolated initially, and as a component of the overall solution, Platform LSF was included to manage the computational workloads that would soon consume the resources of this SGI system.

The more I learned about Platform LSF, the more I was smitten by the promise and reality of Distributed Computing – a capability to be leveraged from a resource-centric perspective with this Load Sharing Facility (LSF). [Expletive deleted], Platform founder Songnian Zhou expressed the ramifications of his technical vision for this software as Utopia in a 1993 publication. Although buying the company wasn’t an option, I did manage to be hired by Platform, and work there in various roles for about seven-and-a-half years.

Between my time at Platform (now an IBM company) and much-more recently Univa, over a decade of my professional experience has been spent focused on managing workloads in Distributed Computing environments. From a small handful of VAXes, to core counts that have reached 7 figures, these environments have included clusters, grids and clouds.

My professional bias towards Distributed Computing was further enhanced through the experience of being employed by two software vendors who emphasized the management of clusters – namely Scali (Scali Manage) and subsequently Bright Computing (Bright Cluster Manager). Along with Univa (Project Tortuga and Navops Launch), Bright extended their reach to the management of HPC resources in various cloud configurations.

If it wasn’t for a technical role at Allinea (subsequently acquired by ARM), I might have ended up ‘stuck in the middle’ of the computational stack – as workload and cluster management is regarded by the HPC community (at least) as middleware … software that exists between the operating environment (i.e., the compute node and its operating system) and the toolchain (e.g., binaries, libraries) that ultimately support applications and end users (e.g., Figure 5 here).

Allinea’s focuses on tools to enable HPC developers. Although they were in the process of broadening their product portfolio to include a profiling capability around the time of my departure, in my tenure there the emphasis was on a debugger – a debugger capable of handling code targeted for (you guessed it) Distributed Computing environments.

Things always seemed so much bigger when we were children. Whereas Kid Ian was impressed by a three-node VAX cluster, and later ‘blown away’ by a modest NUMA-architecture ‘supercomputer’, Adult Ian had the express privilege of running Allinea DDT on some of the largest supercomputers on the planet (at the time) – tracking down a bug that only showed up when more than 20K cores were used in parallel on one of Argonne’s Blue Genes, and demonstrating scalable, parallel debugging during a tutorial on some 700K cores of NCSA’s Blue Waters supercomputer. In hindsight, I can’t help but feel humbled by this impressive capability of Allinea DDT to scale to these extremes. Because HPC’s appetite for scale has extended beyond tera and petascale capabilities, and is eyeing seriously the demand to perform at the exascale, software like Allinea DDT needs also to match this penchant for extremely extreme scale.

At this point, suffice it to say that scalable Distributed Computing has been firmly encoded into my professional DNA. As with my scientifically based academic bias, it’s difficult not to frame my predisposition towards Distributed Computing in a positive light within the current context of Data Science. Briefly, it’s a common experience for the transition from prototype-to-production to include the introduction of Distributed Computing – if not only to merely execute applications and/or their workflows on more powerful computers, but perhaps to simultaneously scale these in parallel.

I anticipate the need to return to this disclosure regarding the professional bias I bring to Data Science. For now though, calling out the highly influential impact Distributed Computing has had on my personal trajectory, appears warranted within the context of my Data Science Portfolio.

Incorporate the Cloud into Existing IT Infrastructure => Progress ( Life Sciences )

I still have lots to share after recently attending Bio-IT World in  Boston … The latest comes as a Bright Computing contributed article to the April 2013 issue of the IEEE Life Sciences Newsletter.

The upshot of this article is:

Progress in the Life Sciences demands extension of IT infrastructure from the ground into the cloud.

Feel free to respond here with your comments.

Over “On the Bright side …” Cloud Use Cases from Bio-IT World

As some of you know, I’ve recently joined Bright Computing.

Last week, I attended Bio-IT World 2013 in Boston. Bright had an excellent show – lots of great conversations, and even an award!

During numerous conversations, the notion of extending on-site IT infrastructure into the cloud was raised. Bright has an excellent solution for this.

What also emerged during the conversations were two uses for this extension of local IT resources via the cloud. I thought this was worth capturing and sharing. You can read about the use cases I identified over “On the Bright side …

April’s Contributions on Bright Hub

In April, I contributed two articles to the Web Development channel over on Bright Hub:

Recent Articles on Bright Hub

I’ve added a few more articles over on Bright Hub:

Google Chrome for Linux on Bright Hub: Series Expanded

I recently posted on a new article series on Google Chrome for Linux that I’ve been developing over on Bright Hub. My exploration has turned out to be more engaging than I anticipated! At the moment, there are six articles in the series:

I anticipate a few more …

It’s also important to share that Google Chrome for Linux does not yet exist as an end-user application. Under the auspices of the Chromium Project, however, there is a significant amount of work underway. And because this work is taking place out in the open (Chromiun is an Open Source Project), now is an excellent time to engage – especially for serious enthusiasts.

Google Chrome for Linux Articles on Bright Hub

I’ve recently started an article series over on Bright Hub. The theme of the series is Google Chrome for Linux, and the series blurb states:

Google Chrome is shaking up the status quo for Web browsers. This series explores and expounds Chrome as it evolves for the Linux platform.

So far, there are the following three articles in the series:

I intend to add more … and hope you’ll drop by to read the articles.

Google Should Not Be Making Mac and Linux Users Wait for Chrome!

Google should not be making Mac and Linux users wait for Chrome.

I know:

  • There’s a significant guerrilla-marketing campaign in action – the officially unstated competition with Microsoft for ‘world domination’. First Apple (with Safari), and now Google (with Chrome), is besting Microsoft Internet Explorer on Windows platforms. In revisiting the browser wars of the late nineties, it’s crucial for Google Chrome to go toe-to-toe with the competition. And whether we like to admit it or not, that competition is Microsoft Internet Explorer on the Microsoft Windows platform.
  • The Mac and Linux ports will come from the Open Source’ing of Chrome … and we need to wait for this … Optimistically, that’s short-term pain, long-term gain.

BUT:

  • Google is risking alienating its Mac and Linux faithful … and this is philosophically at odds with all-things Google.
  • It’s 2008, not 1998. In the past, as an acknowledged fringe community, Mac users were accustomed to the 6-18 month lag in software availability. Linux users, on the other hand, were often satiated by me-too feature/functionality made available by the Open Source community. In 2008, however, we have come to expect support to appear simultaneously on Mac, Linux and Windows platforms. For example, Open Source Mozilla releases their flagship Firefox browser (as well as their Thunderbird email application) simultaneously on Mac and Linux as well as Windows platforms. Why not Chrome?

So, what should Google do in the interim:

  • Provide progress updates on a regular basis. Google requested email addresses from those Mac and Linux users interested in Chrome … Now they need to use them!
  • Continue to engage Mac/Linux users. The Chromium Blog, Chromium-Announce, Chromium-discuss, Chromium – Google Code, etc., comprise an excellent start. Alpha and beta programs, along the lines of Mozilla’s, might also be a good idea …
  • Commence work on ‘Browser War’ commercials. Apple’s purposefully understated commercials exploit weaknesses inherent in Microsoft-based PCs to promote their Macs. Microsoft’s fired back with (The Real) Bill Gates and comedian Jerry Seinfeld to … well … confuse us??? Shift to browsers. Enter Google. Enter Mozilla. Just think how much fun we’d all have! Surely Google can afford a few million to air an ad during Super Bowl XLII! Excessive? Fine. I’ll take the YouTube viral version at a fraction of the cost then … Just do it!

For now, the Pareto (80-20) principle remains in play. And although this drives a laser-sharp focus on Microsoft Internet Explorer on the Microsoft Windows platform at the outset, Google has to shift swiftly to Mac and Linux to really close on the disruptiveness of Chrome’s competitive volley.

And I, for one, can’t wait!

Notes 8.5 Public Beta 1 Client for Mac OS X and Linux

I stumbled across this announcement earlier today:

On 30 May, 2008 a newer version of the Notes 8.5 Mac OS X client was posted as part of the FULL Notes/Domino 8.5 Beta 1 release. 

Based on comments on a post I made March 2008, I decided to download the 355 MB tarball for Mac OS X.
On my first pass, I attempted to install over top of the private-beta client I described in that earlier post. Unfortunately, the provisioning step was partially successful. When I launched the Notes client, Eclipse started up … and shut down … 
I used the uninstall app that came with this latest tarball to remove the private-beta client. I then reinstalled the public-beta client, got acknowledgement that provisioning was successful, and ran the Notes client. 
In the words of Borat: “Great success!” 
The 500-MB-plus public beta client looks similar to the private-beta client, but it feels snappier. Your mileage may vary. 
Regardless, it’s encouraging to witness this progress. 
In addition to installing IBM Lotus Notes 8.5 Public Beta 1 on Mac OS X (Leopard), I also installed it on a Dell laptop running Ubuntu Hardy Heron – IBM offers a build of the Notes client packaged as a number of .deb files. This was my first experience with a native Notes client for Linux. So far, so good. 
Thanks IBM!
P.S. I expect the Release Notes cover off some of the sillyness I’ve shared here …

Book Review: Building Telephony Systems with Asterisk

Asterisk is a Open Source framework for building feature/functionality-rich telephony solutions. And although it’s able to compete solidly with offerings from commercial providers, the establishment of an Asterisk deployment is involved. Thus D. Gomillion and B. Dempster’s book Building Telephony Systems with Asterisk will be useful to anyone intending to delve into Asterisk. The book is comprised of nine chapters whose content can be summarized as follows:
  • In providing an introduction, Chapter 1 enumerates what Asterisk is and isn’t. Asterisk is a PBX plus IVR, voicemail and VoIP system. Asterisk is not an off-the-shelf phone system, SIP proxy or multi-platform solution. This enumeration leads to a welcome discussion on Asterisk’s fit for your needs. The authors quantify this fit in terms of flexibility vs. ease-of use, configuration-management UI, TCO and ROI. Although the latter to topics are covered briefly, the authors’ coverage will certainly serve to stimulate the right kinds of discussions.
  • Chapter 2 begins by enumerating the ways in which the Asterisk solution might connect to the PSTN. Next, in discussing the four types of terminal equipment (hard phones, soft phone, communications devices and PBXs), the major protocols supported by Asterisk are revealed – namely H.323, SIP and IAX. Whereas H.323 is well known to many of those who’ve delved in videoconferencing, and SIP to anyone who’s done any reading on VoIP, IAX is an interesting addition specific to Asterisk. The Inter-Asterisk eXchange (IAX) protocol attempts to address limitations inherent in H.323 and SIP relating to, e.g., NAT support, configurability, call trunking, sharing information amongst Asterisk servers plus codec support. IAX is not a standard and device support is somewhat limited but on the rise. As of this book’s writing, September 2005, IAX2 had deprecated IAX – and that still appears to be the case. Guidelines for device choice, compatibility with Asterisk, sound quality analysis and usability all receive attention in this chapter. The chapter closes with a useful discussion on the choice of extension length. Highly noteworthy, and already provided as a link above, is the voip-info.org wiki.
  • The installation of Asterisk is the focus of Chapter 3. After reviewing the required prerequisites, none of which are especially obscure, attention shifts to the Asterisk-specific components. In turn Zaptel (device drivers), libpri (PRI libraries) and Asterisk are installed from source. (I expect packaged versions of these components are now available for various Linux distributions.) Asterisk includes a plethora of configuration files, and these are given an overview in this chapter. And although it’s not mentioned, disciplined use of a revision control system like RCS is strongly advised. The chapter concludes with sections on running Asterisk interacting with its CLI to ensure correct operation, start/stop the service and so on.
  • With Asterisk installed, attention shifts to interface configuration in Chapter 4. In working through line and terminal configurations for Zaptel interfaces, one is humbled by the edifice that is the pre-IP world of voice. Our introduction to the intersection between the pre-IP and VoIP universes is completed by consideration of SIP and IAX configuration. Again humbling, the authors’ treatment affords us an appreciation of the application of acknowledged standards like SIP (which is itself based on RTP) through implementation. The final few sections of the chapter further emphasize the convergence capabilities of VoIP platforms by exposing us to voicemail, music-on-hold, message queues and conference rooms.
  • Through the creation of a dialplan, Asterisk’s functionalities and features can be customized for use. Dialplans are illustrated in Chapter 5 by establishing contexts, incoming/outgoing-call extensions, call queues, call parking, direct inward dialing, voicemail, automated phone directory and conference rooms. Customization is involved, and it is in chapters such as this one that the authors deliver significant value in their ability to move us swiftly towards a dialplan solution. Also evident from this chapter, and to paraphrase the authors, is Asterisk’s power and flexibility as a feature/functionality-rich telephony solution.
  • Under Asterisk, calls are tracked with Call Detail Records (CDRs). Data pertaining to each call can be logged locally to a flat file or to a database running on (preferably) a remote server. The database-oriented approach for managing CDR data is more flexible and powerful, even though it takes more effort to set up, as this solution is based on databases such as PostgreSQL or MySQL. CDR comprises the least-invasive approach for quality assurance. The remainder of the content in Chapter 6 focuses on more-invasive approaches such as monitoring and recording calls.
  • Based only on the context provided by this review, it is likely apparent that an Asterisk deployment requires considerable effort. Thus in Chapter 7, the authors introduce us to the turnkey solution known as Asterisk@Home. Asterisk@Home favors convenience at the expense of flexibility – e.g., the flavor of Linux (CentOS) as well as support components such as the database (MySQL) are predetermined. The Asterisk Management Portal (AMP), a key addition in Asterisk@Home, Webifies access to a number of user and administrator features/functionalities – voicemail, CRM, Flash Operator Panel (FOP, a real-time activity monitor), MeetMe control plus AMP (portal and server Asterisk@Home management) itself. Before completing the chapter with an introduction to the powerful SugarCRM component bundled with Asterisk@Home, the authors detail required steps to complete the deployment of Asterisk@Home for a simple use case. It’s chapters like this, that allow us to all-at-once appreciate the potential for the Asterisk platform. (Packt has recently released a book on AsteriskNOW. AsteriskNOW is the new name for Asterisk@Home.)
  • The SOHO, small business and hosted PBX are the three case studies that collectively comprise Chapter 8. Sequentially, the authors present the case-study scenario, some discussion, Asterisk configuration specifics, and conclusions. In taking this approach, the authors make clear the application of Asterisk to real-world scenarios of increasing complexity. In the SOHO case, the SIP shared object (chan_sip.so) is not loaded as this functionality is not required. This is but one example of how the authors attempt to convey best practices in the deployment of a production solution based on Asterisk.
  • Maintenance and security are considered in the final chapter of the book (Chapter 9). The chapter begins with a useful discussion on automating backups and system maintenance plus time synchronization. Those familiar with systems administration can focus on the Asterisk-specific pieces that will require their attention. This focus naturally leads to a discussion of recovering the Asterisk deployment in the event of a disaster. Security gets well-deserved consideration in this chapter from both the server and network perspective. For example, there is very useful and interesting content on securing the protocols used by Asterisk with a firewall. Before closing the chapter by identifying both the Open Source and commercial support offerings for Asterisk, the scalability of Asterisk is given attention.

This book was first published in September 2005 and is based on version 1.2.1 of Asterisk. As of this writing, Asterisk’s production version is 1.4.x, and the version 1.6 beta release is also available (see http://www.asterisk.org/ for more). Even though the book is somewhat dated, it remains useful in acquainting readers with Asterisk, and I have no reservations in strongly recommending it.
Disclaimer: The author was kindly provided with a copy of this book for review by the publisher.