Loan Default Predictor

Machine-learning model that estimates the probability a loan applicant defaults.

Python
scikit-learn
pandas
XGBoost

Overview

In this project I made a binary classifier that predicts whether a loan applicant will default, using applicant demographics, financial features, and credit history. The model is a RandomForestClassifier wrapped in a scikit-learn pipeline with imputation, ordinal encoding for education, and one-hot encoding for categoricals. Hyperparameters are tuned with RandomizedSearchCV over a stratified 5-fold CV

Results

Train Scores from the best CV-selected model: • F1: 0.82 · Precision: 0.89 · Recall: 0.75 Test Scores from the best CV-selected model: • F1: 0.82 · Precision: 0.90 · Recall: 0.76 Both the train and test scores are very similar, it is safe to assume the model generalises quite well, and has a good bias-variance trade-off. The imbalanced split between loan defaulters and non-defaulters matters for how precision and recall are evaluated. And which metric is more important is dependant on the purpose of the model/business objective. For example, if the primary business goal is to avoid losses from defaults, recall is the more important metric as it measures false-negatives. In this case (0.76 recall), depending on the cost of a default, the decision threshold may need to be shifted downward to trade some precision for more recall. Top features: (features which affect whether a customer will default a loan or not) • loan_percent_income: loan size relative to income. Higher ratios mean the repayment burden eats more of the applicant's earnings, so default risk goes up. • previous_loan_defaults_on_file: People who have defaulted before are more likely to default again • loan_int_rate: Higher rates both reflect the lender's prior risk assessment and make repayment harder, so it correlates strongly with default. • person_income: Higher income gives more cushion against shocks and therefore likely to lower default probability.