Joost H. van der Linden bio photo

Joost H. van der Linden

Applied Mathematician and Data Scientist.

Email Facebook Google+ LinkedIn Github

I presented this talk at the Data Science Melbourne Meetup and at the RMIT kick-off for the Master of Data Science, Master of Analytics and Master of Statistics and Operations Research in February 2017. Having been employed as a data scientist for about one year, while simultaneously pursuing a PhD, gave me some perspectives of “both sides of the fence” (academia and industry) which I shared in this talk. The differences between studying and working as a data scientist that I personally experienced were:

  • Communication. The Harvard Business Review wrote “The Best Data Scientists Get Out and Talk to People.” Wow. Rocket science? Not really! While you can get away with hiding behind your computer as a student, when you reach the real world, you’ll obviously have to start talking to other people. I was there too. They teach you a lot in Applied Mathematics, but communicating well was definitely not in the curriculum. I learned that the best way to start talking is to start listening. Prepare questions about the data, put them on the table, and let the conversation go from there. Be curious. Ask more questions. Work your way from ambiguity to clarity.
  • More specifically, communicating complexity. Say that you are doing a technical assignment in university. Your professor is smarter than you. While you still have to explain yourself properly, you can assume quite a bit of prior knowledge, write a technical report, maybe even try a few start-of-the-art methods and get a good grade. As a data scientist, that is obviously not going to work. A lot of times, you are going to have explain what you did to a non-technical audience. Those fancy state-of-the-art methods might be tempting to use, but if you can’t explain them to your boss, they are going to be hard to justify.
  • Assignments When it comes to figuring out what you need to do, there are clear differences between being a student and working as a data scientist. In assignments, questions and steps to answer them are usually well-defined. In real life, there are 2 extremes: either nobody tells you anything, or everybody has an opinion and tells you everything. In reality, I often find myself starting with nothing, then I slowly gather information, work my way from ambiguity to clarity in the business case, and end up with a plan. Not coincidentally, this is also the format of the Melbourne Datathon, in which you get very little initial guidance on the dataset. You have to discover the questions, goals and value yourself.
  • Working with real data. Another obvious one. In my PhD, I mostly work with data from numerical models and experiments. Physical laws and equations dictate most of what I do with that data. In a PhD, or in your assignment, if you are a student, the “environment” that we work in is much more controlled. Things are usually well-defined with the challenges laid out and the steps prescribed. In my job, on the other hand, all the data I work with stems from people, or has at least a strong human element in it. As a result, because the real world is messy, the data is inherently much more unstructured. Challenges are ill-defined and the data doesn’t come in a nice little package. Understanding the people involved, the assumptions made and the data itself easily takes up more than half of my time.
  • Producing actionable insights. I found this quote by Andrew Means in one of the Our Community newsletters: “It is easy to get caught up in paralysis by analysis, where we go down these rabbit holes of interesting factoids that can never be translated into action.” At the time I read this, I was analysing 500k+ donations from GiveNow. I looked at my browser and I started counting the number of Python notebooks I had open: 12, plus an instance of RStudio. I was looking at the GiveNow donations from every conceivable angle, without having a clear plan of what I was looking for. I realized that Andrew was literally talking to me. I was in that rabbit hole and I wasn’t separating “interesting” from “actionable”. Because we wanted this study to be useful and actionable for not-for-profits, I brought in my colleague to sort through what I had at the time. We stripped down the initial analysis to keep only the (potentially) actionable stuff. Perhaps this is a natural progression from exploration to action, but it was definitely a useful lesson.
  • Preventing over-engineering. As a student, if you enjoy what you are learning, you often want to go the extra mile for your assignments. If you implement that state-of-the-art method or try lots of different approaches, you’ll probably be rewarded with some extra credit. I went into one of the projects at my job with this mindset. I was presented with 30,000 text fields and asked if I could find the common topics. Having been wanting to learn more about text analytics, I found a fantastic NLP tutorial on phrase modeling, LDA, Word2Vec (word algebra!) and t-SNE and I went all out. The results were… OK, but not great. I got some insights but it wasn’t really enough to see the commonalities in the text. So I had to take a step back and simplify. I found that calculating word frequencies, plotting some word clouds and reading the text itself (wouldn’t you know!) was useful too, if not more useful! Moreover, these things were much easier to communicate as well. Another lesson learned.
  • No right answers. Often times, unlike assignments, business cases have no “right” or “wrong” answers. You find that there are 10 answers instead, each with upsides and downsides, and the goal is to find the “least-worst” option with the most benefits. This is obviously a universal statement, going well beyond data science, but I reckon we have to deal with this a lot, in ways that you don’t always expect. In the not-for-profit sector, for example, there is always a big push to make projects quantifiable. People want to see where the money is going and what the outcomes are. Data science can help. However, an unexpected side effect of quantifying projects is that you increase the risk of funders focusing only on those easily quantifiable, short-term, quick success projects (e.g. 10 mosquito nets delivered), rather than those long-term, messy projects (e.g. big policy overhaul, big impact). As a data scientist working for a not-for-profit, you might find yourself in the middle of these counter-acting forces.

I also shared some advice one getting started in the sector: