Demonstrating Your Machine Learning Expertise: Optimizing Breadth vs. Depth

Developing Expertise

When it comes to developing your expertise in Machine Learning, there seem to be two schools of thought:

  • Exemplified by articles that purport to have listed, for example, the 10-most important methods you need to know to ace a Machine Learning interview, the School of Breadth emphasizes content-oriented objectives. By amping up with courses/workshops to programs (e.g., certificates, degrees) then, the justification for broadening your knowledge of Machine Learning is self-evident.
  • Find data that interests you, and work with it using a single approach for Machine Learning. Thus the School of Depth emphasizes skills-oriented objectives that are progressively mastered as you delve into data, or better yet, a problem of interest.

Depending upon whichever factors you currently have under consideration then (e.g., career stage, employment status, desired employment trajectory, …), breadth versus depth may result in an existential crisis when it comes to developing and ultimately demonstrating your expertise in Machine Learning – with a modicum of apologies if that strikes you as a tad melodramatic.

Demonstrating Expertise

Somewhat conflicted, at least, is in all honesty how I feel at the moment myself.

On Breadth

Even a rapid perusal of the Machine Learning specific artifacts I’ve self-curated into my online, multimedia Data Science Portfolio makes one thing glaringly evident: The breadth of my exposure to Machine Learning has been somewhat limited. Specifically, I have direct experience with classification and Natural Language Processing in Machine Learning contexts from the practitioner’s perspective. The more-astute reviewer, however, might look beyond the ‘pure ML’ sections of my portfolio and afford me additional merit for (say) my mathematical and/or physical sciences background, plus my exposure to concepts directly or indirectly applicable to Machine Learning – e.g., my experience as a scientist with least-squares modeling counting as exposure at a conceptual level to regression (just to keep this focused on breadth, for the moment).

True confession: I’ve started more than one course in Machine Learning in a blunt-instrument attempt to address this known gap in my knowledge of relevant methods. Started is, unfortunately, the operative word, as (thus far) any attempt I’ve made has not been followed through – even when there are options for community, accountability, etc. to better-ensure success. (Though ‘life got in the way’ of me participating fully in the fast.ai study group facilitated by the wonderful team that delivers the This Week in Machine Learning & AI Podcast, such approaches to learning Machine Learning are appealing in principle – even though my own engagement was grossly inconsistent.)

On Depth

What then about depth? Taking the self-serving but increasingly concrete example of my own Portfolio, it’s clear that (at times) I’ve demonstrated depth. Driven by an interesting problem aimed at improving tsunami alerting by processing data extracted from Twitter, for example, the deepening progression with co-author Jim Freemantle has been as follows:

  1. Attempt to apply existing knowledge-representation framework to the problem by extending it (the framework) to include graph analytics
  2. Introduce tweet classification via Machine Learning
  3. Address the absence of semantics in the classification-based approach through the introduction of Natural Language Processing (NLP) in general, and embedded word vectors in particular
  4. Next steps …

(Again, please refer to my Portfolio for content relating to this use case.) Going deeper, in this case, is not a demonstration of a linear progression; rather, it is a sequence of outcomes realized through experimentation, collaboration, consultation, etc. For example, the seed to introduce Machine Learning into this tsunami-alerting initiative was planted on the basis of informal discussions at an oil and gas conference … and later, the introduction of embedded word vectors, was similarly the outcome of informal discussions at a GPU technology conference.

Whereas these latter examples are intended primarily to demonstrate the School of Depth, it is clear that the two schools of thought aren’t mutually exclusive. For example, in delving into a problem of interest Jim and I may have deepened our mastery of specific skills within NLP, however we have also broadened our knowledge within this important subdomain of Machine Learning.

One last thought here on depth. At the outset, neither Jim nor I had as an objective any innate desire to explore NLP. Rather, the problem, and more importantly the demands of the problem, caused us to ‘gravitate’ towards NLP. In other words, we are wedded more to making scientific progress (on tsunami alerting) than a specific method for Machine Learning (e.g., NLP).

Next Steps

Net-net then, it appears to be that which motivates us that dominates in practice – in spite, perhaps, of our best intentions. In my own case, my existential crisis derives from being driven by problems into depth, while at the same time seeking to demonstrate a broader portfolio of expertise with Machine Learning. To be more specific, there’s a part of me that wants to apply LSTMs (foe example) to the tsunami-alerting use case, whereas another part knows I must broaden (at least a little!) my portfolio when it comes to methods applicable to Machine Learning.

Finally then, how do I plan to address this crisis? For me, it’ll likely manifest itself as a two-pronged approach:

  1. Enrol and follow through on a course (at least!) that exposes me to one or more methods of Machine Learning that compliments my existing exposure to classification and NLP.
  2. Identify a problem, or problems of interest, that allow me to deepen my mastery of one or more of these ‘newly introduced’ methods of Machine Learning.

In a perfect situation, perhaps we’d emphasize breadth and depth. However, when you’re attempting to introduce, pivot, re-position, etc. yourself, a trade off between breadth versus depth appears to be inevitable. An introspective reflection, based upon the substance of a self-curated portfolio, appears to be an effective and efficient means for roadmapping how gaps can be identified and ultimately addressed.

Postscript

In many settings/environments, Machine Learning and Data Science in general, are team sports. Clearly then, a viable way to address the challenges and opportunities presented by depth versus breadth is to hire accordingly – i.e., hire for depth and breadth in your organization.

Data Scientist: Believe. Behave. Become.

A Litmus Test

When do you legitimately get to call yourself a Data Scientist?

How about a litmus test? You’re at a gathering of some type, and someone asks you:

So, what do you do?

At which point can you (or me, or anyone) respond with confidence:

I’m a Data Scientist.

I think the responding-with-confidence part is key here for any of us with a modicum of humility, education, experience, etc. I don’t know about you, but I’m certainly not interested in this declaration being greeted by judgmental guffaws, coughing spasms, involuntary eye motion, etc. Instead of all this overt ‘body language’, I’m sure we’d all prefer to receive an inquiring response along the lines of:

Oh, just what the [expletive deleted] is that?

Or, at least:

Dude, seriously, did you like, just make that up?

Responses to this very-legitimate, potentially disarming question, will need to be saved for another time – though I’m sure a quick Google search will reveal a just-what-the-[expletive deleted]-is-Data-Scientist elevator pitch.

To return to the question intended for this post however, let’s focus for a moment on how a best-selling author ‘became’ a writer.

“I’m a Writer”

I was recently listening to best-selling author Jeff Goins being interviewed by podcast host Srini Rao on an episode of the Unmistakable Creative. Although the entire episode (and the podcast in general, frankly) is well worth the listen, my purpose here is to extract the discussion relating to Goins’ own process of becoming a writer. In this episode of the podcast, Goins recalls the moment when he believed he was a writer. He then set about behaving as a writer – essentially, the hard work of showing up every single day just to write. Goins continues by explaining how based upon his belief (“I am writer”) and his behavior (i.e., the practice of writing on a daily basis), he ultimately realized his belief through his actions (behavior) and became a writer. With five, best selling books to his credit, plus a high-traffic-blog property, and I’m sure much more, it’s difficult now to dispute Goins’ claim of being a writer.

Believe. Behave. Become. Sounds like a simple enough algorithm, so in the final section of this post, I’ll apply it to the question posed at the outset – namely:

When do you legitimately get to call yourself a Data Scientist?

I’m a Data Scientist?

I suppose, then, that by direct application of Goins’ algorithm, you can start the process merely by believing you’re a Data Scientist. Of course, I think we all know that that’ll only get you so far, and probably not even to a first interview. More likely, I think that most would agree that we need to have some Data Science chops before we would even entertain such an affirmation – especially in public.

And this is where my Data Science Portfolio enters the picture – in part, allowing me to self-validate, to legitimize whether or not I can call myself a Data Scientist in public without the laughing, choking or winking. What’s interesting though is that in order to work through Goins’ algorithm, engaging in active curation of a Data Science portfolio is causing me to work backwards – making use of hindsight to validate that I have ‘arrived’ as a Data Scientist:

  • Become – Whereas I don’t have best sellers or even a high-traffic blog site to draw upon, I have been able to assemble a variety of relevant artifacts into a Portfolio. Included in the Portfolio are peer-reviewed articles that have appeared in published journals with respectable impact factors. This, for a Data Scientist, is arguably a most-stringent validation of an original contribution to the field. However, chapters in books, presentations at academic and industry events, and so on, also serve as valuable demonstrations of having become a Data Scientist. Though it doesn’t apply to me (yet?), the contribution of code would also serve as a resounding example – with frameworks such as Apache Hadoop, Apache Spark, PyTorch, and TensorFlow serving as canonical and compelling examples.
  • Behave – Not since the time I was a graduate student have I been able to show up every day. However, recognizing the importance of deliberate practice, there have been extended periods during which I have shown up every day (even if only for 15 minutes) to advance some Data Science project. In my own case, this was most often the consequence of holding down a full-time job at the same time – though in some cases, as is evident in the Portfolio, I have been able to work on such projects as a part of my job. Such win-win propositions can be especially advantageous for the aspiring Data Scientist and the organization s/he represents.
  • Believe – Perhaps the most important outcome of engaging in the deliberate act of putting together my Data Science Portfolio, is that I’m already in a much more informed position, and able to make a serious ‘gut check’ on whether or not I can legitimately declare myself a Data Scientist right here and right now.

The seemingly self-indulgent pursuit of developing my own Data Science Portfolio, an engagement of active self-curation, has (quite honestly) both surprised and delighted me; I clearly have been directly involved in the production of a number of artifacts that can be used to legitimately represent myself as ‘active’ in the area of Data Science. The part-time nature of this pursuit, especially since the completion of grad school (though with a few notable exceptions), has produced a number of outcomes that can be diplomatically described as works (still) in progress … and in some cases, that is unfortunate.

Net-net, there is some evidence to support a self-declaration as a Data Scientist – based upon artifacts produced, and implied (though inconsistent) behaviors. However, when asked the question “What do you do?”, I am more likely to respond that:

I am a demonstrably engaged and passionate student of Data Science – an aspiring Data Scientist, per se … one who’s actively working on becoming, behaving and ultimately believing he’s a Data Scientist.

Based on my biases, that’s what I currently feel owing to the very nature of Data Science itself.

Teaching/Learning Weather and Climate via Pencasting

I first heard about it a few years ago, and thought it sounded interesting … and then, this past Summer, I did a little more research and decided to purchase a Livescribe 8 GB Echo(TM) Pro Pack. Over the Summer, I took notes with the pen from time-to-time and found it to be somewhat useful/interesting.

Just this week, however, I decided it was time to use the pen for the originally intended purpose: Making pencasts for the course I’m currently teaching in weather and climate at Toronto’s York University. Before I share some sample pencasts, please allow me to share my findings based on less than a week’s worth of `experience’:

  • Decent-quality pencasts can be produced with minimal effort – I figured out the basics (e.g., how to record my voice) in a few minutes, and started on my first pencast. Transferring the pencast from the pen to the desktop software to the Web (where it can be shared with my students) also requires minimal effort. “Decent quality” here refers to both the visual and audio elements. The fact that this is both a very natural (writing with a pen while speaking!) and speedy (efficient/effective) undertaking means that I am predisposed towards actually using the technology whenever it makes sense – more on that below. Net-net: This solution is teacher-friendly.
  • Pencasts compliment other instructional media – This is my current perspective … Pencasts compliment the textbook readings I assign, the lecture slides plus video/audio captures I provide, the Web sites we all share, the Moodle discussion forums we engage in, the Tweets I issue, etc. In the spirit of blended learning it is my hope that pencasts, in concert with these other instructional media, will allow my TAs and I to `reach’ most of the students in the course.
  • Pencasts allow the teacher to address both content and skills-oriented objectives – Up to this point, my pencasts have started from a blank page. This forces me to be focused, and systematically develop towards some desired content (e.g., conceptually introducing the phase diagram for H2O) and/or skills (e.g., how to calculate the slope of a line on a graph) oriented outcome. Because students can follow along, they have the opportunity to be fully engaged as the pencast progresses. Of course, what this also means is that this technology can be as effective in the first-year university level course I’m currently teaching, but also at the academic levels that precede (e.g., grade school, high school, etc.) and follow (senior undergraduate and graduate) this level.
  • Pencasts are learner-centric – In addition to be teacher-friendly, pencasts are learner-centric. Although a student could passively watch and listen to a pencast as it plays out in a linear, sequential fashion, the technology almost begs you to interact with it. As noted previously, this means a student can easily replay some aspect of the pencast that they missed. Even more interestingly, however, students can interact with pencasts in a random-access mode – a mode that would almost certainly be useful when they are attempting to apply the content/skills conveyed through the pencast to a tutorial or assignment they are working on, or a quiz or exam they are actively studying for. It is important to note that both the visual and audio elements of the pencast can be manipulated with impressive responsiveness to random-access input from the student.
  • I’m striving for authentic, not perfect pencasts – With a little more practice and some planning/scripting, I’d be willing to bet that I could produce an extremely polished pencast. Based on past experience teaching today’s first-year university students, I’m fairly convinced that this is something they couldn’t care less about. Let’s face it, my in-person lectures aren’t perfectly polished, and neither are my pencasts. Because I can easily go back to existing pencasts and add to them, I don’t need to fret too much about being perfect the first time. Too much time spent fussing here would diminish the natural and speedy aspects of the technology.

Findings aside, on to samples:

  • Calculating the lapse rate for Earth’s troposphere – This is a largely a skills-oriented example. It was my first pencast. I returned twice to the original pencast to make changes – once to correct a spelling mistake, and the second time to add in a bracket (“Run”) that I forgot. I communicated these changes to the students in the course via an updated link shared through a Moodle forum dedicated to pencasts. If you were to experience the updates, you’d almost be unaware of the lapse of time between the original pencast and the updates, as all of this is presented seamlessly as a single pencast to the students.
  • Introducing the pressure-temperature phase diagram for H2O – This is largely a content-oriented example. I got a little carried away in this one, and ended up packing in a little too much – the pencast is fairly long, and by the time I’m finished, the visual element is … a tad on the busy side. Experience gained.

Anecdotally, initial reaction from the students has been positive. Time will tell.

Next steps:

  • Monday (October 1, 2012), I intend to use a pencast during my lecture – to introduce aspects of the stability of Earth’s atmosphere. I’ll try to share here how it went. For this intended use of the pencast, I will use a landscape mode for presentation – as I expect that’ll work well in the large lecture hall I teach in. I am, however, a little concerned that the lines I’ll be drawing will be a little too thin/faint for the students at the back of the lecture theatre to see …
  • I have two sections of the NATS 1780 Weather and Climate course to teach this year. One section is taught the traditional way – almost 350 students in a large lecture theatre, 25-student tutorial groups, supported by Moodle, etc. In striking contrast to the approach taken in the meatspace section, is the second section where almost everything takes place online via Moodle. Although I have yet to support this hypothesis with any data, it is my belief that these pencasts are an excellent way to reach out to the students in the Internet-only section of the course. More on this over the fullness of time (i.e., the current academic session.)

Feel free to comment on this post or share your own experiences with pencasts.

Remembering Steve Jobs

I was doing some errands earlier this evening (Toronto time) … While I was in the car, the all-news station (680news) I had on played some of Steve Jobs’ 2005 commencement address to Stanford grads. As I listened, and later re-read my own blog post on discovering the same address, I’m struck on the event of his passing by the importance of valuing every experience in life. In Jobs’ case, he eventually leveraged his experience with calligraphy to design the typography for the Apple Mac – after a ten-year incubation period!

I think it’s time to read that Stanford commencement address again …

RIP Steve – and thanks much.

Targeting Public Speaking Skills via Virtual Environments

Recently I shared an a-ha! moment on the use of virtual environments for confronting the fear of public speaking.

The more I think about it, the more I’m inclined to claim that the real value of such technology is in targeted skills development.

Once again, I’ll use myself as an example here to make my point.

If I think back to my earliest attempts at public speaking as a graduate student, I’d claim that I did a reasonable job of delivering my presentation. And given that the content of my presentation was likely vetted with my research peers (fellow graduate students) and supervisor ahead of time, this left me with a targeted opportunity for improvement: The Q&A session.

Countless times I can recall having a brilliant answer to a question long after my presentation was finished – e.g., on my way home from the event. Not very useful … and exceedingly frustrating.

I would also assert that this lag, between question and appropriate answer, had a whole lot less to do with my expertise in a particular discipline, and a whole lot more to do with my degree nervousness – how else can I explain the ability to fashion perfect answers on the way home!

image006Over time, I like to think that I’ve approved my ability to deliver better-quality answers in real time. How have I improved? Experience. I would credit my experience teaching science to non-scientists at York, as well as my public-sector experience as a vendor representative at industry events, as particularly edifying in this regard.

Rather than submit to such baptisms of fire, and because hindsight is 20/20, I would’ve definitely appreciated the opportunity to develop my Q&A skills in virtual environments such as Nortel web.alive. Why? Such environments can easily facilitate the focused effort I required to target the development of my Q&A skills. And, of course, as my skills improve, so can the challenges brought to bear via the virtual environment.

All speculation at this point … Reasonable speculation that needs to be validated …

If you were to embrace such a virtual environment for the development of your public-speaking skills, which skills would you target? And how might you make use of the virtual environment to do so?

On Knowledge-Based Representations for Actionable Data …

I bumped into a professional acquaintance last week. After describing briefly a presentation I was about to give, he offered to broker introductions to others who might have an interest in the work I’ve been doing. To initiate the introductions, I crafted a brief description of what I’ve been up to for the past 5 years in this area. I’ve also decided to share it here as follows: 

As always, [name deleted], I enjoyed our conversation at the recent AGU meeting in Toronto. Below, I’ve tried to provide some context for the work I’ve been doing in the area of knowledge representations over the past few years. I’m deeply interested in any introductions you might be able to broker with others at York who might have an interest in applications of the same.

Since 2004, I’ve been interested in expressive representations of data. My investigations started with a representation of geophysical data in the eXtensible Markup Language (XML). Although this was successful, use of the approach underlined the importance of metadata (data about data) as an oversight. To address this oversight, a subsequent effort introduced a relationship-centric representation via the Resource Description Format (RDF). RDF, by the way, forms the underpinnings of the next-generation Web – variously known as the Semantic Web, Web 3.0, etc. In addition to taking care of issues around metadata, use of RDF paved the way for increasingly expressive representations of the same geophysical data. For example, to represent features in and of the geophysical data, an RDF-based scheme for annotation was introduced using XML Pointer Language (XPointer). Somewhere around this point in my research, I placed all of this into a framework.

A data-centric framework for knowledge representation.

A data-centric framework for knowledge representation.

 In addition to applying my Semantic Framework to use cases in Internet Protocol (IP) networking, I’ve continued to tease out increasingly expressive representations of data. Most recently, these representations have been articulated in RDFS – i.e., RDF Schema. And although I have not reached the final objective of an ontological representation in the Web Ontology Language (OWL), I am indeed progressing in this direction. (Whereas schemas capture the vocabulary of an application domain in geophysics or IT, for example, ontologies allow for knowledge-centric conceptualizations of the same.)  

From niche areas of geophysics to IP networking, the Semantic Framework is broadly applicable. As a workflow for systematically enhancing the expressivity of data, the Framework is based on open standards emerging largely from the World Wide Web Consortium (W3C). Because there is significant interest in this next-generation Web from numerous parties and angles, implementation platforms allow for increasingly expressive representations of data today. In making data actionable, the ultimate value of the Semantic Framework is in providing a means for integrating data from seemingly incongruous disciplines. For example, such representations are actually responsible for providing new results – derived by querying the representation through a ‘semantified’ version of the Structured Query Language (SQL) known as SPARQL. 

I’ve spoken formally and informally about this research to audiences in the sciences, IT, and elsewhere. With York co-authors spanning academic and non-academic staff, I’ve also published four refereed journal papers on aspects of the Framework, and have an invited book chapter currently under review – interestingly, this chapter has been contributed to a book focusing on data management in the Semantic Web. Of course, I’d be pleased to share any of my publications and discuss aspects of this work with those finding it of interest.

With thanks in advance for any connections you’re able to facilitate, Ian. 

If anything comes of this, I’m sure I’ll write about it here – eventually!

In the meantime, feedback is welcome.

Survey on How Scientists Use Their Computers

How do scientists actually use computers in their day-to-day work?

A Canadian team is conducting a survey to find out:

Computers are as important to modern scientists as test tubes, but we know surprisingly little about how scientists develop and use software in their research. To find out, the University of Toronto, Simula Research Laboratory, and the National Research Council of Canada have launched an online survey in conjunction with “American Scientist” magazine. If you have 20 minutes to take part, please go to:

http://softwareresearch.ca/seg/SCS/scientific-computing-survey.html

Thanks in advance for your help!

Jo Hannay (Simula Research Laboratory)
Hans Petter Langtangen (Simula Research Laboratory)
Dietmar Pfahl (Simula Research Laboratory)
Janice Singer (National Research Council of Canada)
Greg Wilson (University of Toronto)

The results of the survey will be shared via American Scientist.