Can Loan Defaults Be Predicted?

In my last post (The Problem With Most Investment Strategies) I discussed a few of the problems associated with manually built static filters used in finding loans for potential investment. At the core, all of them are trying to find loans that achieve a certain return (by targeting an interest rate) and then applying additional filters to remove the loans that may default. But what if we could skip the process of building a filter to find "bad" loans and let an algorithm figure this out from historical data? Thankfully for us, there are techniques to do just that. It's a scientific discipline known as "machine learning". And it is concerned with building algorithms that learn to classify objects based on historical data. Leveraging this technique will allow us to circumvent many of the disadvantages mentioned in my last post.

Machine Learning

Machine learning is a type of artificial intelligence that gives computers the ability to learn the solution to a problem without explicit programming. Think of it as automated programming by example. Often we have a specific task in mind, such as email spam filtering. Rather than programming the computer to solve the task directly by entering in patterns associated with spam emails, in machine learning, the computer will come up with its own program based on examples that we provide. We start by gathering a large collection of email messages from the past and label each one as "good" or "spam". These are then used as input to a general-purpose learning algorithm, which outputs a model. This model is a representation of our classification logic. New never before seen emails are then matched against this model and the algorithm predicts whether it is good or spam.

Maching Learning Applied To Lending Club

From this example, it should be easy to see how machine learning can be useful in helping us pick loans. Historical data from Lending Club already tells us which loans are "good" and "bad". Therefore, it’s not necessary to manually label any examples. The good ones are those that were fully paid. Those that defaulted are obviously bad. Using this data a classification model is built that can tell us which new loans currently in funding should be considered for possible investment. A simplified diagram representing the flow of data is shown below.

Benefits

This approach helps us avoid the problems previously mentioned. We don’t have to figure out which attributes are important or what the threshold should be. These values are learned from the historical data. Its also easy for us to efficiently and automatically retrain our model to account for any underlying changes in the economy or lending standards. Lastly, depending on which machine learning algorithm we use, complex non-linear models can be learned that take into account interactions of variables.

Support Vector Machines

Based on some research into various machine learning algorithms I decided to select Support Vector Machines (SVM) as the basis for my first attempt at the loan default classification engine. During training a support vector machine finds the hyperplane in a high-dimensional space that maximizes the distance to the nearest training data points. This hyperplane forms the boundary between classes and the classification of new data is simply determining which side of the hyperplane it resides on. SVM models have similar functional form to neural networks, another popular type of machine learning. However, the quality of generalization, ability to work with large datasets that have many attributes, and ease of training of makes this approach a good one for problems of this type. There is also some prior academic research on loan credit scoring that indicates this technique would perform well.

In my next post, I’ll go into a few more details on the specifics of adapting SVM to the Lending Club data.

Comments

  1. Hi, thank you for all your work. I did a similar (but much smaller) experiment; but I've used logistic regression instead of svm. You can see my webpage for details. I would be interested to know the reasons for choosing svm over other methods?

  2. @Alex I didn't necessarily chose SVM over logistic regression. I've tried both of these techniques and several others over the past 6 months. This one has proven to perform very well. And I found that a classification model is easier to understand and use versus one based on probabilities. At some point I may publish other techniques. But right now I am focused on enhancing the tools and techniques peer-to-peer investors need to give them an edge on the Lending Club platform.

Leave a Comment

Your email address will not be published. Required fields are marked *
Comments are moderated and will not be displayed until approved.