Data Science at ODSC West 2019
Not long ago, Robert Chen from Software Engineering wrote a blog post about the Strata Data Conference. As he mentions, one of the nice learning opportunities at Enova includes conference travel. I’m here to discuss the Open Data Science Conference that several of us from Enova’s Analytics team attended.
The first two days focused on a Business Summit and longer workshops/trainings. The last two days focused on shorter talks. (There was also one pre-conference day of bootcamps and a career fair.) Talks had a variety of focus areas, which I’d categorize as: ML/DS techniques, platforms, ethics, MLOps, and business needs. I tried to attend talks from all the focus areas, and below I’ll try to aggregate key takeaways; it’s worth noting that there were ~200 talks, so other attendees may have left with very different experiences.
It also seemed like a great lineup of talks. When I exported my first-draft schedule to my calendar, I had a number of time slots that were quadruple-booked!
Data Ethics
All three keynote talks had a significant focus on ethics in ML/AI. Several other talks and workshops also had a focus on ethics, and even more talks at least mentioned ethical concerns. We were provided with several famous and less-famous examples of where AI tools have failed ethical or security tests:
- learning bias from training data
- being susceptible to adversarial attacks (poisoning a training set, or just probing [even a black-box] model in production and constructing adversarial inputs [video])
- failing to protect private data (reverse-engineering training data from black-box model [BAIR blog])
There has been a push, then, to create tools to monitor models for bias and better anonymize data. There is also a more general push for more interpretable models or explainability wrappers for black-box models, to make the decisioning more transparent and auditable.
ML Techniques and Best Practices
Talks ran the gamut here, from missing-value imputation to bleeding-edge computer vision research.
Some favorites:
- CatBoost – I’ve personally been skeptical of some of the methods used in the package, but it’s probably time to do some experiments on our datasets.
- adversarial noise for testing/improving model robustness – It’s not clear to me whether training on the noised data is sound (pun unintended, but enjoyed), but at least this seems useful for understanding where your model’s decisions are highly volatile. [IBM open-source package]
- graph/network applications – My academic research was in graph theory, so it’s great to look into applications at Enova. That’s been on my to do list for a while, but some of the conversations at ODSC have generated new ideas and bumped the topic a little higher on the list.
- dabl, a(n extremely) young human-in-the-loop ML package by sklearn core dev Andreas Mueller. It seems promising, and we’re already duplicating some simple internal tools.
- Shanghang Zhang gave a really great talk on neural networks that can generalize to new domains and unseen categories [one relevant paper]. Most of this work is advanced computer vision and not immediately useful at to what we do, but it was super interesting.
MLOps
A number of talks leaned in the direction of software engineering, with tools and processes for standardizing and productionalizing the machine learning workflow.
MLFlow, CookieCutter, and Project Orbyter stood out as worth using or at least adopting pieces of into our data science workflow.
Also of interest was the presentation on Uber’s experimentation tracking tool.
The Business Side
I tried to make it to a few of the Business Summit talks, though the schedules for those talks didn’t line up well with the main event talks.
- Cassie Kozyrkov, Google’s Chief Decision Scientist and excellent public speaker, gave a great talk describing her taxonomy of the (famously ill-defined) field of Data Science. There’s a blog version at TowardsDataScience.
- Michael Xiao of BCBS gave a nice talk on data science Centers of Excellence and the extent to which different processes should be centralized into such a team. [video of a similar talk]
Repositories and slides are available from ODSC.
Go forth and Science some Data! (And, if you’re in the San Francisco area, go grab yourself an It’s It.)