The effects of data preprocessing on probability of default model fairness

Di Wu *

Naveen Jindal School of Management, The University of Texas at Dallas, 800 W Campbell Rd Richardson Texas 75080 USA.
 
Research Article
World Journal of Advanced Engineering Technology and Sciences, 2024, 12(02), 872–878.
Article DOI: 10.30574/wjaets.2024.12.2.0354
Publication history: 
Received on 08 July 2024; revised on 20 August 2024; accepted on 22 August 2024
 
Abstract: 
In the context of financial credit risk evaluation, the fairness of machine learning models has become a critical concern, especially given the potential for biased predictions that disproportionately affect certain demographic groups. This study investigates the impact of data preprocessing, with a specific focus on Truncated Singular Value Decomposition (SVD), on the fairness and performance of probability of default models. Using a comprehensive dataset sourced from Kaggle, various preprocessing techniques, including SVD, were applied to assess their effect on model accuracy, discriminatory power, and fairness.
The findings reveal that while SVD effectively reduces the dimensionality of the data, it does not necessarily enhance the fairness of the models. Specifically, the application of SVD resulted in a deterioration in the model’s ability to correctly classify loan defaults, particularly for minority classes. This outcome suggests that critical information pertinent to fair predictions may be lost during the dimensionality reduction process. Furthermore, the analysis of fairness across different demographic groups, such as age and marital status, indicated that SVD did not contribute positively to reducing disparate impacts or balancing error rates.
These results underscore the complexities of using dimensionality reduction techniques in fair lending applications and highlight the need for more tailored approaches to preprocessing that prioritize both accuracy and fairness. Future research should explore alternative methods that preserve the integrity of sensitive information while enhancing the equitable performance of credit risk models.
 
Keywords: 
Data preprocessing; Machine learning; Probability of default; Credit risk; Fair lending; Safe AI
 
Full text article in PDF: