Data Scientist: Believe. Behave. Become.

A Litmus Test

When do you legitimately get to call yourself a Data Scientist?

How about a litmus test? You’re at a gathering of some type, and someone asks you:

So, what do you do?

At which point can you (or me, or anyone) respond with confidence:

I’m a Data Scientist.

I think the responding-with-confidence part is key here for any of us with a modicum of humility, education, experience, etc. I don’t know about you, but I’m certainly not interested in this declaration being greeted by judgmental guffaws, coughing spasms, involuntary eye motion, etc. Instead of all this overt ‘body language’, I’m sure we’d all prefer to receive an inquiring response along the lines of:

Oh, just what the [expletive deleted] is that?

Or, at least:

Dude, seriously, did you like, just make that up?

Responses to this very-legitimate, potentially disarming question, will need to be saved for another time – though I’m sure a quick Google search will reveal a just-what-the-[expletive deleted]-is-Data-Scientist elevator pitch.

To return to the question intended for this post however, let’s focus for a moment on how a best-selling author ‘became’ a writer.

“I’m a Writer”

I was recently listening to best-selling author Jeff Goins being interviewed by podcast host Srini Rao on an episode of the Unmistakable Creative. Although the entire episode (and the podcast in general, frankly) is well worth the listen, my purpose here is to extract the discussion relating to Goins’ own process of becoming a writer. In this episode of the podcast, Goins recalls the moment when he believed he was a writer. He then set about behaving as a writer – essentially, the hard work of showing up every single day just to write. Goins continues by explaining how based upon his belief (“I am writer”) and his behavior (i.e., the practice of writing on a daily basis), he ultimately realized his belief through his actions (behavior) and became a writer. With five, best selling books to his credit, plus a high-traffic-blog property, and I’m sure much more, it’s difficult now to dispute Goins’ claim of being a writer.

Believe. Behave. Become. Sounds like a simple enough algorithm, so in the final section of this post, I’ll apply it to the question posed at the outset – namely:

When do you legitimately get to call yourself a Data Scientist?

I’m a Data Scientist?

I suppose, then, that by direct application of Goins’ algorithm, you can start the process merely by believing you’re a Data Scientist. Of course, I think we all know that that’ll only get you so far, and probably not even to a first interview. More likely, I think that most would agree that we need to have some Data Science chops before we would even entertain such an affirmation – especially in public.

And this is where my Data Science Portfolio enters the picture – in part, allowing me to self-validate, to legitimize whether or not I can call myself a Data Scientist in public without the laughing, choking or winking. What’s interesting though is that in order to work through Goins’ algorithm, engaging in active curation of a Data Science portfolio is causing me to work backwards – making use of hindsight to validate that I have ‘arrived’ as a Data Scientist:

  • Become – Whereas I don’t have best sellers or even a high-traffic blog site to draw upon, I have been able to assemble a variety of relevant artifacts into a Portfolio. Included in the Portfolio are peer-reviewed articles that have appeared in published journals with respectable impact factors. This, for a Data Scientist, is arguably a most-stringent validation of an original contribution to the field. However, chapters in books, presentations at academic and industry events, and so on, also serve as valuable demonstrations of having become a Data Scientist. Though it doesn’t apply to me (yet?), the contribution of code would also serve as a resounding example – with frameworks such as Apache Hadoop, Apache Spark, PyTorch, and TensorFlow serving as canonical and compelling examples.
  • Behave – Not since the time I was a graduate student have I been able to show up every day. However, recognizing the importance of deliberate practice, there have been extended periods during which I have shown up every day (even if only for 15 minutes) to advance some Data Science project. In my own case, this was most often the consequence of holding down a full-time job at the same time – though in some cases, as is evident in the Portfolio, I have been able to work on such projects as a part of my job. Such win-win propositions can be especially advantageous for the aspiring Data Scientist and the organization s/he represents.
  • Believe – Perhaps the most important outcome of engaging in the deliberate act of putting together my Data Science Portfolio, is that I’m already in a much more informed position, and able to make a serious ‘gut check’ on whether or not I can legitimately declare myself a Data Scientist right here and right now.

The seemingly self-indulgent pursuit of developing my own Data Science Portfolio, an engagement of active self-curation, has (quite honestly) both surprised and delighted me; I clearly have been directly involved in the production of a number of artifacts that can be used to legitimately represent myself as ‘active’ in the area of Data Science. The part-time nature of this pursuit, especially since the completion of grad school (though with a few notable exceptions), has produced a number of outcomes that can be diplomatically described as works (still) in progress … and in some cases, that is unfortunate.

Net-net, there is some evidence to support a self-declaration as a Data Scientist – based upon artifacts produced, and implied (though inconsistent) behaviors. However, when asked the question “What do you do?”, I am more likely to respond that:

I am a demonstrably engaged and passionate student of Data Science – an aspiring Data Scientist, per se … one who’s actively working on becoming, behaving and ultimately believing he’s a Data Scientist.

Based on my biases, that’s what I currently feel owing to the very nature of Data Science itself.

Remembering a Supportive Sibling

Remembering a Supportive Sibling

Less than a week before I was scheduled to deliver my first presentation on a novel way for approaching an outstanding challenge in seismic processing, my younger sister Deborah passed away. She was only 50. Thanks to medical care that included extensive chemotherapy, Debbie recovered from lymphoma once, and was declared cancer free. However, a second wave of lymphoma accompanied by leukemia, proved to be more than she could handle – and we lost her during a procedure that (ironically) was attempting to provide more information about the cancers that had literally taken over her body.

Between Debbie’s passing and her funeral, was not only a about a week’s lapse of time, but the need for me to make a decision – a decision to present as scheduled at the 2015 Rice University Oil & Gas Conference in Houston or miss the event entirely. A complicating factor in my ability to make this decision was that I truly was the only person who could deliver it. That’s more a pragmatic statement than a boastful one, as I had combined my background in geophysics with an increasing knowledge of Big Data Analytics; in so doing, I’d arrived at a submission for the RiceU Conference that was as uniquely of my creation as it was a disruptive suggestion – in other words, something I felt strongly to be well suited to the Conference’s Disruptive Technology Track. With the Conference being less than a week away, most of the real work had already been completed; in other words, all I needed to do was show up, make a two-minute presentation, and discuss the poster I’d prepared with those who expressed an interest.

Debbie was always highly supportive and encouraging when it came to ‘things’ like this – the expression of something worth sharing. This, despite the fact that she and I were on completely different trajectories when it came to our intellectual interests and pursuits – me in the physical sciences and technology, while Debbie favoured English literature. Despite these differences, Debbie often made a point of trying to understand and appreciate what I was working on – no matter how geekily obscure.

In recalling these traits of hers, her sincere interest in what I was doing (I suppose) just because we were siblings, my decision to follow through with the presentation at the RiceU Conference was a relatively easy one. Executing it, however, was at times challenging … and I could not have followed through without the support of my colleagues from Bright Computing.

You can still review my two-minute presentation here thanks to the wonderful people who run this industry-leading event on an annual basis at Rice. The poster I alluded to is available here. The ideas hatched through these 2015 communications proved instrumental in spinning off additional contributions. Equally important, were those interactions initiated at this 2015 RiceU Conference. Some of these interactions resulted in relationships that persist through today – relationships that have, for example, resulted me applying Machine Learning to problems of scientific interest.

And so it is, on the occasion of what would’ve been Debbie’s 54th birthday, that I wistfully remember my sister. Without knowing that I’d have had her support and encouragement, I likely wouldn’t have followed through with that March 2015 presentation at the RiceU Conference – a decision that had immediate plus long-lasting implications to my progression as a Data Scientist.

Targeting Public Speaking Skills via Virtual Environments

Recently I shared an a-ha! moment on the use of virtual environments for confronting the fear of public speaking.

The more I think about it, the more I’m inclined to claim that the real value of such technology is in targeted skills development.

Once again, I’ll use myself as an example here to make my point.

If I think back to my earliest attempts at public speaking as a graduate student, I’d claim that I did a reasonable job of delivering my presentation. And given that the content of my presentation was likely vetted with my research peers (fellow graduate students) and supervisor ahead of time, this left me with a targeted opportunity for improvement: The Q&A session.

Countless times I can recall having a brilliant answer to a question long after my presentation was finished – e.g., on my way home from the event. Not very useful … and exceedingly frustrating.

I would also assert that this lag, between question and appropriate answer, had a whole lot less to do with my expertise in a particular discipline, and a whole lot more to do with my degree nervousness – how else can I explain the ability to fashion perfect answers on the way home!

image006Over time, I like to think that I’ve approved my ability to deliver better-quality answers in real time. How have I improved? Experience. I would credit my experience teaching science to non-scientists at York, as well as my public-sector experience as a vendor representative at industry events, as particularly edifying in this regard.

Rather than submit to such baptisms of fire, and because hindsight is 20/20, I would’ve definitely appreciated the opportunity to develop my Q&A skills in virtual environments such as Nortel web.alive. Why? Such environments can easily facilitate the focused effort I required to target the development of my Q&A skills. And, of course, as my skills improve, so can the challenges brought to bear via the virtual environment.

All speculation at this point … Reasonable speculation that needs to be validated …

If you were to embrace such a virtual environment for the development of your public-speaking skills, which skills would you target? And how might you make use of the virtual environment to do so?

On Knowledge-Based Representations for Actionable Data …

I bumped into a professional acquaintance last week. After describing briefly a presentation I was about to give, he offered to broker introductions to others who might have an interest in the work I’ve been doing. To initiate the introductions, I crafted a brief description of what I’ve been up to for the past 5 years in this area. I’ve also decided to share it here as follows: 

As always, [name deleted], I enjoyed our conversation at the recent AGU meeting in Toronto. Below, I’ve tried to provide some context for the work I’ve been doing in the area of knowledge representations over the past few years. I’m deeply interested in any introductions you might be able to broker with others at York who might have an interest in applications of the same.

Since 2004, I’ve been interested in expressive representations of data. My investigations started with a representation of geophysical data in the eXtensible Markup Language (XML). Although this was successful, use of the approach underlined the importance of metadata (data about data) as an oversight. To address this oversight, a subsequent effort introduced a relationship-centric representation via the Resource Description Format (RDF). RDF, by the way, forms the underpinnings of the next-generation Web – variously known as the Semantic Web, Web 3.0, etc. In addition to taking care of issues around metadata, use of RDF paved the way for increasingly expressive representations of the same geophysical data. For example, to represent features in and of the geophysical data, an RDF-based scheme for annotation was introduced using XML Pointer Language (XPointer). Somewhere around this point in my research, I placed all of this into a framework.

A data-centric framework for knowledge representation.

A data-centric framework for knowledge representation.

 In addition to applying my Semantic Framework to use cases in Internet Protocol (IP) networking, I’ve continued to tease out increasingly expressive representations of data. Most recently, these representations have been articulated in RDFS – i.e., RDF Schema. And although I have not reached the final objective of an ontological representation in the Web Ontology Language (OWL), I am indeed progressing in this direction. (Whereas schemas capture the vocabulary of an application domain in geophysics or IT, for example, ontologies allow for knowledge-centric conceptualizations of the same.)  

From niche areas of geophysics to IP networking, the Semantic Framework is broadly applicable. As a workflow for systematically enhancing the expressivity of data, the Framework is based on open standards emerging largely from the World Wide Web Consortium (W3C). Because there is significant interest in this next-generation Web from numerous parties and angles, implementation platforms allow for increasingly expressive representations of data today. In making data actionable, the ultimate value of the Semantic Framework is in providing a means for integrating data from seemingly incongruous disciplines. For example, such representations are actually responsible for providing new results – derived by querying the representation through a ‘semantified’ version of the Structured Query Language (SQL) known as SPARQL. 

I’ve spoken formally and informally about this research to audiences in the sciences, IT, and elsewhere. With York co-authors spanning academic and non-academic staff, I’ve also published four refereed journal papers on aspects of the Framework, and have an invited book chapter currently under review – interestingly, this chapter has been contributed to a book focusing on data management in the Semantic Web. Of course, I’d be pleased to share any of my publications and discuss aspects of this work with those finding it of interest.

With thanks in advance for any connections you’re able to facilitate, Ian. 

If anything comes of this, I’m sure I’ll write about it here – eventually!

In the meantime, feedback is welcome.

Blended Learning Panel

York University’s Institute for Research on Learning Technologies is sponsoring a panel discussion on blended learning:

“A recent workplace survey reported by Brandon Hall Publishing (2008) indicates that employing a mix of web-technologies with face-to-face learning is more effective than either e-learning or face-to-face instructional approaches alone. To explore the use and potential of “blended learning” further, please join us for a panel discussion featuring experts from various fields …”

This event has been re-scheduled for April 2, 2009 at 12:15 pm in TEL 1009 at York’s Keele Campus. I anticipate a lively and interesting discussion!

(Please check the IRLT Web site for the latest updates on the event.)

An Eight Pack of Leadership Traits

I recently came across an article by Hank Marquis on effective leadership traits for those in IT

Marquis distills the following eight pack of traits:
  1. Leadership means focusing on the needs of others, not yourself
  2. Leadership comes from your actions, not your title
  3. Leadership makes you accountable, even if it’s not your fault
  4. Leadership is not a 9-to-5 activity
  5. Leadership takes trust from your followers
  6. Leaders get their best ideas from their team
  7. Leadership thrives on diversity
  8. Leadership comes from continuous communication
Marquis elaborates on each of these traits in the article.
And as two final nuggets to further whet your appetite, consider the following two quotes:

Effective leaders build a trusted team and then follow the team’s advice.

… always give the credit to the team. The leader’s credit comes only by crediting the team he or she leads.