October 31, 2019

“I’m in a Data State of Mind” – Thoughts from the Strata Data Conference in NYC

by Robert Chen
in Data

“You are here not to learn Kafka, but to learn how to build data products that add value to your company.”

With these first words from the instructor in the “Professional Kafka Development” class, I knew that the O’Reilly Strata Data Conference in NYC would be a different experience. That proved to be the case. Over the next 4 days, I learned a lot about what other companies are doing to make valuable data products.

At Enova, people have the ability to attend one conference a year of their choosing (one of the things I especially like about working at Enova). I chose this one and wanted to share a few thoughts on what the conference was like, some of the most interesting / useful takeaways from the conference, and a few tips for those who might attend next year.

The conference started with a 2 day course on Kafka, a technology we use here at Enova. One of the first things that became immediately apparent was the reach that Kafka has — my fellow classmates were from Azerbaijan, Brazil, China, Germany, Mexico, Saudi Arabia, and Switzerland. Companies represented included BMW, Capital One, Costco, Petrobras, PWC, Quicken, Steelcase, Uline, Union Pacific, and Visa. From Chicago there was a trading company and folks from ComEd — apparently over 1 terabyte of data comes in each day from those smart meters in the Chicagoland area.

The class was a good introduction to the basic Producer – Topic – Partition – Broker – Consumer Group – Consumer concepts of Kafka with hands-on programming exercises interspersed so you could see what is “really going on.” It was a good opportunity to break out a VirtualBox image and use the Eclipse IDE to write Java — reminding me of those days writing Hadoop map reduce queries in Java and working with Hive and Pig. It was a very useful class to this relative Kafka newbie. Perhaps even more valuable was getting various insights on using Kafka from the instructor (Jesse Anderson), who is a major voice in the Data Engineering world today.

Days 3 and 4 consisted of 7-10 keynote speeches in the morning followed by 5 to 6 breakout sessions. For each session you had a choice of 15 different topics . I focused on threads involving Data Engineering, Compliance / PII , Data Catalogs / Data Lineage, and a few general/miscellaneous talks. Here are what I found to be the most interesting thoughts / takeaways from those 2 days:

(1) “Basketball Play Diagrams as a Query Language”

The most interesting keynote for me was a talk on “Interactive Sports Analytics” by Pat Lucey, Chief Scientist of AI at StatsPerform. One thing he talked about was how best to access a video library of plays in a game — often times a coach will want to find all similar plays against a certain team. This was traditionally done with a long series of English words (example: “dribble, pass@upper three point arc, 3Pshot@arc”). In the talk, he described how you can use the tracking diagram of the play itself as an input to find similar plays:

There is a brief video demonstrating this here. The whole idea of using something visual as the basis to query a database was fascinating (and, now that I think of it, is an extension to the sports world of what is being done with things like facial recognition and general Machine Learning efforts to solve visual problems …).

Other interesting things he talked about were capturing tendencies of players and teams to simulate what games / one-on-one matchups would be like (ever wonder who would win that 1-on-1 between Michael Jordan and Lebron James …) and also capturing limb movements along with player location to refine game analysis. For example, you know that James Harden normally makes shots 93% of the time from a location — why did he miss on a particular play? You can see in that play that his hands were unusually low when he got the ball — so he probably got a bad pass.

(2) “‘Visualizing Data with Sound”

The second most fascinating moment was the last keynote of the second day — when Alan Smith, the “Chart Doctor” columnist for the Financial Times, talked about work he did to use sound to communicate trends in the US bond yield curve. The yield curve is basically a series of points that describe the going interest rate for certain periods (often 1, 5, 7, 10, 20, and 30 year bonds are used):

It is of particular interest because it has been shown to be a very reliable indicator of whether the economy is headed into a recession. The issue is with 6 data points for each day, it’s hard to capture clearly trends over multiple days/weeks/months/years. Usually the best one can do visually is pick 2 points from that and graph the trend of that difference over time. Mr. Smith talked about his experiments assigning values to a specific musical tone so that each day is a “musical series” — I found the results to be remarkably interesting and a potentially useful way to “visualize” multi-point data in the future. The best way to understand this is by seeing (hearing?) it in action. A brief video can be seen/heard at this link. As a cool aside, here is an example of the Chart Doctor’s work (and also one of my favorite slides from the conference) [link to expandable version]:

(3) “Everything is Connected and the Clock is Ticking”

This was the title of a keynote presentation by Gro Intelligence, which uses data and machine learning to answer global agricultural questions. They talked about how they are using “55 million data streams and 650 trillion data points” to answer questions such as “what will be the consumption of pork in China 6, 12, and 24 months from now.” This is a harder and more tumultuous question than you’d think, because China lost 30% of their pigs last year to swine fever. Furthermore, the main thing pigs in China eat — soybeans — is very volatile this year due to floods in the US Midwest and soybeans being right in the middle of the US-China trade dispute. With China eating 60% of the world’s pork, any effect is magnified. One reason the # of data points is so high is it includes “every pixel on satellite photos” to help determine what areas in each region in the world are farmland. it was a fascinating story told compellingly (I will probably never look at my plate of mu shu pork the same again …) [link].

Other interesting data stories included a breakout session by the Data Science team at Major League Baseball to show how they use modeling to determine what next year’s schedule will be and on what day of the week that bobblehead promotion should be, a startup called iKure looking to use Machine Learning to improve the health care of 840 million people in rural India, and Project Debater — an effort by IBM to use AI to ingest data on a topic and then have Watson successfully debate a human.

(4) “Kafka, Kafka, everywhere”

One of the major takeaways from the conference was the prevalence of Kafka in enterprises to handle data movement needs. Wal-mart gave a presentation on how they use it as the basis for a model inferencing platform. Uber uses it to process over 2 trillion (!) messages a day. AppsFlyer gave an especially good presentation, talking about how Kafka has been their one constant in their “hockey stick” growth. They showed the following slide:

and talked about how they use 20 clusters running on 400 AWS instances to process over 1 million messages a second. He went through how their Kafka architecture evolved and examples of things they did to improve performance. The presenter (a tech lead at Apps Flyer), summed it up well by saying “If your product is a body, Kafka is the circulatory system, and data is the blood.” Yelp, Pinterest, and Yahoo all use Kafka (and in fact offer open source Kafka tools we can explore using). It is clear that Kafka is being used as the backbone for many data pipelines out there.

Tempering this was a note from the Kafka instructor that, while Kafka is very versatile, one should be careful about using it where really a simpler solution would be just fine. He has seen in his consulting engagements many times when a company used Kafka when a simple queueing system would do the trick — the key is really whether the scalability and concurrency Kafka offers is really needed. As we start to use Kafka more and more (while still using SQS/SNS when appropriate), these were very helpful guidelines and thoughts to hear.

(5) “Data Catalogs, Data Lineage, and the Data Revolution”

One of the surprises at the conference was the high prevalence of talks on Data Catalogs and Data Lineage. It turns out there are currently 26 companies offering products in this space. Uber gave a presentation on the infrastructure they’ve developed to help track who is using what data and how much that data is being used. The Chief Data Scientist at O’Reilly talked about the common industry problem of Data Lakes being a victim of “Garbage In, Garbage Out”, how increasing the ability to discover data is a key goal to address this, how the equally important goal of ensuring everything is compliant works in direct opposition to this, and how data catalogs (having a mechanism to store what the data is and where it’s from) is a key to reconciling both. Comcast gave what I felt was a really creative keynote speech comparing data accessibility trends to the American Revolution, saying that Data Catalogs are a key way to move forward from the “chaos of the revolution” (suddenly everyone has access to the data) to a more regulated, thriving organization:

They then took the analogy further, stating that by “Providing for the common defence” (providing data catalogs and information on data lineage) and “securing the blessings of liberty” (ensuring schemas are kept in the hands of the people”), we are in a position to “insure domestic tranquility and promote the general welfare” (discover and integrate data across silos, trace data’s journey throughout the enterprise):

Deutsche Bank, among other organizations, mentioned it as a key pillar of their data roadmap to make sure compliance needs are met.

On the Data Engineering team one of the initiatives we are working on is a cooperative initiative with the Analytics team and the Data Services team to form an internal service we call DSED (Data Set Element Dictionary) — a rudimentary data catalog that can be used as a source of truth for what variables are extracted from third party sources such as credit reports. It was good to confirm that we are on the right track with this.

(6) “Machine Learning is growing but let’s be careful how we go about it — Ethical ML, MLOps, and MDLC”

One of the best speakers was Cassie Kozyrkov, the Chief Data Scientist from Google. She gave a really good talk on how, even though ML makes it easy to not pay attention to how the data is being created, it is all the more important to put thought into this — because it can have real-world consequences — especially when used at the scale companies such as Google are using it today:

This has created a whole “Ethical ML” movement — being conscious of how the data is being gathered and how the models are being used (and making sure privacy is respected). There was a sense that the industry has in many ways mastered the process of making Machine Learning models — and now we need to take a step back and make sure they are being used in a way that helps people. Another keynote I found fascinating along these lines was a talk by Microsoft on the considerations they made when deciding how Cortana would respond to certain questions — and how they made a conscious effort to include their core values of inclusivity, sensitivity, and transparency into those responses (feel free to ping me if you’re interested in finding out specific examples of this).

Also for Machine Learning, another large trend was the whole concept of “MLOps” (an extension of DevOps to the Machine Learning world) and “MDLC” (Machine Learning Development Life Cycle — applying concepts of the Software Development Life Cycle to Machine Learning Development to add structure and simplify model development). These were buttressed by the stat that the typical Data Scientist today only spends about 25% of their time doing algorithm / Machine Learning work, with the rest of the time being spent on getting the data they need / making sure the model is the right version etc:

The idea behind MLOps / MDLC is to decrease the “non-algorithm” time so Data Scientists can focus on the activities that really add value to the organization.

This was exciting to me because in many ways I view working with our Analytics team at Enova to make “the rest” more easy and seamless to be the underlying mission of what the Data Engineering Team does.

Final Thoughts

Needless to say, it was a very exciting conference. A few other random quotes and ideas I found interesting from the conference:

“There can be no AI without IA (Information Architecture)” — IBM
“Natural Language Processing / AI will transform the relationships financial companies have with their customers as much as ATMs did” — American Express
“The new fast is quick time to value by making decisions now” — memSQL
AppsFlyer’s mention of “Sleep-driven design” (how they rearchitected their Kafka implementation to reduce PagerDuty calls …)
Periscope took approximately 2 quarters to transition to Kubernetes.
IBM offers “quantum safe” security for its tape drives.
“If you want your kids to have a job, have them learn Python.”
In 1 mile of autonomous car driving, 10GB of data is generated” — Cisco
Brent Spiner’s favorite hat shop in Chicago is Optimo — just down the street from Enova! (SAP had an ingenious “meet Data at the data conference” promotion in the evening)

Finally, a few tips for those that might consider going next year:

Do attend the speed networking sessions in the morning — it’s a great way to quickly meet a lot of interesting people. One person I met created a Data Science business in Bangladesh and dedicates himself to bringing up Data Science talent there (even he is getting poached by Google and Amazon 🙂 ).
At lunch consider sitting at the “financial topics” table. On the days I went I sat next to the Data Engineering Manager at Chartered Bank and next to a Data Engineer at Bloomberg. It was really useful bouncing thoughts off them and talking about their experience with AWS / Google Cloud / Kafka etc.
For break-out sessions you really want to go to get there 5 minutes early — there is a possibility of overflow and people being turned away at the door.
There is a large area in the convention center on the last day to store your luggage — so flying out on the last day of the conference straight from there is a possibility.
The conference takes place in the Javits Center in Hudson Yards in Manhattan near Madison Square Garden, Hells Kitchen, and Chelsea (picture below). The High Line is a really terrific hiking trail that was a former elevated train line that is now filled with greenery-filled sitting areas and art work. Also, “the Vessel” nearby (picture below) is well worth checking out and climbing to the top of as the sun sets behind the Hudson River.

I usually gauge a conference by 2 things — how much it “wows me” with new revelations and ideas, and how many practical tips it can offer in my day-to-day. Honestly, I thought it would be hard for any conference to match what I experienced at AWS Re:Invent last year. But I have to say, in terms of how much it expanded my mind and the practical considerations it gave me, Strata was able to match that. I am very grateful to Enova for the opportunity to be able to attend.

About