2018 was a big year for our data science team, internally known as RAP (Research-Architecture-Platform), because we advanced our machine learning capabilities significantly by addressing most of the challenges mentioned here. With these additional capabilities, our analysts can now deploy almost any advanced machine learning algorithm they want into production. The flexibility of our platform encourages them to think about the pros and cons of both traditional and advanced algorithms carefully when they develop predictive models that power our business (i.e. credit, conversion, response, collection, lifetime value models, etc.)
This is not an easy feat, though, because there is no free lunch. That means there is no single algorithm that works best for every problem; it is up to us to determine the right tool (model / algorithm) for the job (problem).
We see traditional and advanced Machine Learning (ML) algorithms as different points along a spectrum of tools; the best option depends on the problem at hand. Technical texts usually treat model / algorithm selection as a single-objective optimization problem. The assumption in these texts is that analysts pick the model that minimizes or maximizes a performance metric such as accuracy, mean squared error, or area under the curve (AUC). In reality, however, it is a multi-objective optimization problem with objectives far more diverse than the straightforward and easily quantifiable performance measures.
So, in this post we will explore the factors that influence model / algorithm selection decisions and share some of our insights.
Thinking About Error and Cost More Broadly
If you are familiar with predictive modeling, there is a good chance that you know what the bias-variance trade-off is.
If your model / algorithm is too flexible, then you learn too much from training data and you end up with high variance (see the graph on the left). That’s not good!
If your model / algorithm is not flexible, then you don’t learn much from training data and you end up with high bias (see the graph in the middle). That’s not good either!
This suggests that there must be a sweet spot somewhere in the middle. You want to learn just enough from training data so that the model generalizes well (see the graph on the right).
Here is an alternative way to look at this trade-off:
In sum, the levels of bias and variance change as the flexibility of the model / algorithm changes. Our job is to find that sweet spot—the optimum model complexity—where total error is minimal.
The graphs above are generated using relatively straightforward performance measures such as the mean squared error, accuracy, or the AUC in mind. But we cannot restrict ourselves to these measures because we live in the real world with complex dependencies. We should therefore think of error in a broader sense. Error can be thought of as any cost related to the model. That means in real life curves other than those shown on the graph exist. Adding those curves to the existing ones in the graph might shift the location of optimal model complexity. Like I said, deciding which model / algorithm to use is no easy feat.
Factors to Consider
Let’s look at the main factors that influence our decisions on where we should settle along the model complexity spectrum:
1 – Lift
Lift denotes the difference between straightforward and easily quantifiable performance measures mentioned above. For instance, if model B has a higher accuracy than model A, then we say model B provides a lift. It is certainly one of the most important factors in the selection of models/algorithms, but it’s not the only one. One key point to keep in mind is to translate the lift into dollar value or a key business metric. Just because the difference between the two accuracy values is statistically significant does not mean that this difference has practical significance. Lift is therefore a necessary but not sufficient condition to justify higher model complexity.
2 – Concept Drift
The real world is messy. Whether we like it or not, things change all the time. That is especially true with business problems. Marketing mix, operations, customer behavior and the overall economy never stand still. The stability of the underlying problem is very important, however, because the representativeness of historical data is a key assumption that we make when we build models. Concept drift is a fancy term that captures the change in relationships between input variables and the target variable over time. If there is a significant risk of concept drift, then it is usually better to stick with less flexible models so that we don’t learn too much from (rely too much on) historical data. We consider concept drift often while working with new products.
3 – Sample Size
The number of observations in training data is another factor that informs our decisions on model complexity. If we only have a few hundred observations, then building a deep neural network is probably not a good idea. When the training data is small, in other words, it is usually better to go with high-bias/low-variance classifiers. Put differently, work with simpler models.
4 – Deployment
How easily can you deploy models into production? Do you have an analytics platform where you can deploy your models, or do you rely on other teams to embed your models into other applications? Are those applications written in the same language as the one you used for model-building? How do you test the models before they go live?Deploying models into production deserves its own blog post. It’s a big and important topic. The point here is that your technology stack might restrict you. That was the case for us six years ago. We were using a domain-specific language that was developed internally. It got the job done, but it was limited to rule-based and regression models. Developing xgboost models with a few hundred trees was not practical at all in such an environment. Today we are at a point where the deployment cost is almost fixed regardless of model / algorithm complexity. So, deployment cost is no longer a major constraint for us.
5 – Run-time Performance
Even if you have a solid tech stack, that doesn’t mean you can run all types of models for all kinds of problems in production. In certain cases, speed can be a competitive advantage. We have models that run in milliseconds; we don’t want to make our customers wait. We also have models where the execution time can go up to a couple of seconds. That’s fine, too, because those models answer different types of questions and don’t necessarily impact the customer experience directly. If the run-time is a critical consideration for you, then it’s a good idea to simplify your model without sacrificing accuracy. “Simplifying” doesn’t necessarily mean using a different algorithm, however. Using smaller trees and/or fewer trees in your ensemble models, or using fewer features, are also simplifications that can make your model run faster.
6 – Time to Build the Model
Not all models are equally important, and no team has an unlimited number of analysts. So, the time that we can spend on a specific model is usually limited. It can vary from days to weeks or even months depending on the importance of the problem. If we have only a few days to build and deploy a model, then there is a pretty good chance that we would go with a regularized regression model and keep things relatively simple. Building more complex models might require more time and effort because we usually need to spend more time on tuning hyperparameters and building explainer models.
7 – Explainability
In cases like credit underwriting, we might have to explain decisions made by our models and communicate them back to the customer. That means we need to generate decline reasons that are accurate and can easily be understood by our customers. This is one aspect of explainability. The other aspect has to do with working with our compliance teams and getting them on-board with the methodology that we used to generate these reasons. When it comes to explainability, the audience is the key. In some cases, algorithms like shap or lime might work fine. In other cases, you might need to develop your custom models / algorithms. As you can guess, models with different complexities have different types of explainability challenges. Simpler models usually don’t require a separate model to explain their decisions.
Lift is important, but it is not the only metric that we use to choose a model. Concept drift, sample size, deployment, run-time performance, the time needed to build the model, and explainability are other crucial factors that we take into account. The cost-benefit profile changes based on these factors as we move along the model complexity axis.
As a rule of thumb, it’s a good idea to start with traditional methods and justify increasing model complexity based on performance metrics and costs. Regularized logistic regression and linear regression are great for baseline models since they are fast to train, easy to interpret, and usually have decent performance.
If there is business value to it, then go for more complex models aggressively. That would justify the complexity of the model and the effort that goes into it. That’s how you move the needle.
Don’t be this guy
Don’t be this guy either