Data Science at ODSC West 2019

Not long ago, Robert Chen from Software Engineering wrote a blog post about the Strata Data Conference.   As he mentions, one of the nice learning opportunities at Enova includes conference travel.  I’m here to discuss the Open Data Science Conference that several of us from Enova’s Analytics team attended.

The first two days focused on a Business Summit and longer workshops/trainings.  The last two days focused on shorter talks. (There was also one pre-conference day of bootcamps and a career fair.)  Talks had a variety of focus areas, which I’d categorize as: ML/DS techniques, platforms, ethics, MLOps, and business needs.  I tried to attend talks from all the focus areas, and below I’ll try to aggregate key takeaways; it’s worth noting that there were ~200 talks, so other attendees may have left with very different experiences.

It also seemed like a great lineup of talks. When I exported my first-draft schedule to my calendar, I had a number of time slots that were quadruple-booked!

Data Ethics

All three keynote talks had a significant focus on ethics in ML/AI.  Several other talks and workshops also had a focus on ethics, and even more talks at least mentioned ethical concerns.  We were provided with several famous and less-famous examples of where AI tools have failed ethical or security tests:

  • learning bias from training data
  • being susceptible to adversarial attacks (poisoning a training set, or just probing [even a black-box] model in production and constructing adversarial inputs [video])

    Innocuous-looking graffiti on stop sign (right) tricks computer vision model into thinking it’s a speed limit sign.
  • failing to protect private data (reverse-engineering training data from black-box model [BAIR blog])
    xkcd comic: person at computer types 'Long live the revolution. Our next meeting will be at' and autocomplete suggests '...the docks at midnight on june 28'. Caption 'When you train predictive models on input from your users, it can leak information in unexpected ways'

There has been a push, then, to create tools to monitor models for bias and better anonymize data.  There is also a more general push for more interpretable models or explainability wrappers for black-box models, to make the decisioning more transparent and auditable.

ML Techniques and Best Practices

Talks ran the gamut here, from missing-value imputation to bleeding-edge computer vision research.
Some favorites:

  • CatBoost – I’ve personally been skeptical of some of the methods used in the package, but it’s probably time to do some experiments on our datasets.
  • adversarial noise for testing/improving model robustness – It’s not clear to me whether training on the noised data is sound (pun unintended, but enjoyed), but at least this seems useful for understanding where your model’s decisions are highly volatile.  [IBM open-source package]
  • graph/network applications – My academic research was in graph theory, so it’s great to look into applications at Enova.  That’s been on my to do list for a while, but some of the conversations at ODSC have generated new ideas and bumped the topic a little higher on the list.

    Graph data may make more complex relationships easier to explore.
  • dabl, a(n extremely) young human-in-the-loop ML package by sklearn core dev Andreas Mueller.  It seems promising, and we’re already duplicating some simple internal tools.
  • Shanghang Zhang gave a really great talk on neural networks that can generalize to new domains and unseen categories [one relevant paper].  Most of this work is advanced computer vision and not immediately useful at to what we do, but it was super interesting.

MLOps

A number of talks leaned in the direction of software engineering, with tools and processes for standardizing and productionalizing the machine learning workflow.

MLFlow, CookieCutter, and Project Orbyter stood out as worth using or at least adopting pieces of into our data science workflow.
Also of interest was the presentation on Uber’s experimentation tracking tool.

schematic of the Orbyter development setup

The Business Side

I tried to make it to a few of the Business Summit talks, though the schedules for those talks didn’t line up well with the main event talks.

  • Cassie Kozyrkov, Google’s Chief Decision Scientist and excellent public speaker, gave a great talk describing her taxonomy of the (famously ill-defined) field of Data Science. There’s a blog version at TowardsDataScience.
  • Michael Xiao of BCBS gave a nice talk on data science Centers of Excellence and the extent to which different processes should be centralized into such a team.  [video of a similar talk]

Repositories and slides are available from ODSC.

Go forth and Science some Data!  (And, if you’re in the San Francisco area, go grab yourself an It’s It.)