Loan Default Predictor
Machine-learning model that estimates the probability a loan applicant defaults.
- Python
- scikit-learn
- pandas
- XGBoost

Overview
In this project I made a binary classifier that predicts whether a loan applicant will default, using applicant demographics, financial features, and credit history. The model is a RandomForestClassifier wrapped in a scikit-learn pipeline with imputation, ordinal encoding for education, and one-hot encoding for categoricals. Hyperparameters are tuned with RandomizedSearchCV over a stratified 5-fold CV
Results
Train Scores from the best CV-selected model:
• F1: 0.82 · Precision: 0.89 · Recall: 0.75
Test Scores from the best CV-selected model:
• F1: 0.82 · Precision: 0.90 · Recall: 0.76
Both the train and test scores are very similar, it is safe to assume the model generalises quite well, and has a good bias-variance trade-off.
The imbalanced split between loan defaulters and non-defaulters matters for how precision and recall are evaluated. And which metric is more important is dependant on the purpose of the model/business objective.
For example, if the primary business goal is to avoid losses from defaults, recall is the more important metric as it measures false-negatives. In this case (0.76 recall), depending on the cost of a default, the decision threshold may need to be shifted downward to trade some precision for more recall.
Top features: (features which affect whether a customer will default a loan or not)
• loan_percent_income: loan size relative to income. Higher ratios mean the repayment burden eats more of the applicant's earnings, so default risk goes up.
• previous_loan_defaults_on_file: People who have defaulted before are more likely to default again
• loan_int_rate: Higher rates both reflect the lender's prior risk assessment and make repayment harder, so it correlates strongly with default.
• person_income: Higher income gives more cushion against shocks and therefore likely to lower default probability.