Predicting loan default in peer‐to‐peer lending using narrative data

Published date01 March 2020
Date01 March 2020
AuthorLingyun He,Yanlin Ding,Yufei Xia,Yinguo Li,Nana Liu
DOIhttp://doi.org/10.1002/for.2625
RESEARCH ARTICLE
Predicting loan default in peertopeer lending using
narrative data
Yufei Xia
1
| Lingyun He
2
| Yinguo Li
1
| Nana Liu
2
| Yanlin Ding
3
1
Business School, Jiangsu Normal
University, Xuzhou, Jiangsu, China
2
School of Management, China University
of Mining and Technology, Xuzhou,
Jiangsu, China
3
Faculty of Economics and Management,
Shangluo University, Shangluo, Shaanxi,
China
Correspondence
Yufei Xia, Business School, Jiangsu
Normal University. Xuzhou, Jiangsu
221116, China.
Email: 6020180093@jsnu.edu.cn
Funding information
National Natural Science Foundation of
China, Grant/Award Numbers: 71874185
and 71573110; National Social Science
Fund Project of China, Grant/Award
Number: 15BTJ033; Ministry of Education
of Humanities and Social Science Project
of China, Grant/Award Number:
16YJAZH015; Research Support Project
for Doctoral Degree Teachers of Jiangsu
Normal University, Grant/Award Num-
ber: 18XWRX021
Abstract
Peertopeer (P2P) lending is facing severe information asymmetry problems
and depends highly on the internal credit scoring system. This paper provides
a novel credit scoring model, which forecasts the probability of default for each
applicant and guides the lenders' decisionmaking in P2P lending. The pro-
posal is expected to improve the existing credit scoring models in P2P lending
from two aspects, namely the classifier and the usage of narrative data. We uti-
lize an advanced gradient boosting decision tree technique (i.e., CatBoost) to
predict default loans. Moreover, a soft information extraction technique based
on keyword clustering is developed to compensate for the insufficient hard
credit data. Validated on three realworld datasets, the experimental results
demonstrate that variables extracted from narrative data are powerful features,
and the utilization of narrative data significantly improves the predictability
relative to solely using hard information. The results of sensitivity analysis
reveal that CatBoost outperforms the industry benchmark under different clus-
ter numbers of extracted soft information; meanwhile a small number of clus-
ters (e.g., three) is preferred for consideration of model performance,
computational cost, and comprehensibility. We finally facilitate a discussion
on practical implication and explanatory considerations.
KEYWORDS
credit scoring, narrative data, P2P lending, probability of default, soft information
1|INTRODUCTION
Peertopeer (P2P) lending, a type of crowdfunding in a
border sense, is emerging quickly across the world (Short,
Ketchen, McKenny, Allison, & Ireland, 2017). Bypassing
commercial banks, the borrowers and lenders are
matched via online platforms. Concretely, the borrowers
submit a loan application on the platforms and then the
potential lenders can freely browse the websites and
finally decide which one to invest in (bidding), as well
as the bidding amount. Owing to the absence of commer-
cial banks, P2P lending leads to lower overhead costs and
attracts individuals and small business (Guo, Zhou, Luo,
Liu, & Xiong, 2016).
Similar to traditional consumer lending; however, P2P
lending is facing severe information asymmetry problems
(Lee & Lee, 2012). The possible reasons are as follows:
First, in most P2P lending platforms, borrowers and
lenders are usually anonymous and matched via online
platforms. Although part of credit information may be
available to the public, the anonymity may cause infor-
mation failure in P2P lending (Freedman & Jin, 2017).
Second, the unskilled and unprofessional lenders in P2P
lending may fail to screen the creditworthiness of
Received: 2 January 2019 Revised: 3 June 2019 Accepted: 24 July 2019
DOI: 10.1002/for.2625
Journal of Forecasting. 2020;39:260–280.wileyonlinelibrary.com/journal/for© 2019 John Wiley & Sons, Ltd.260
borrowers properly by employing quantifiable informa-
tion (Mild, Waitz, & Wöckl, 2015). Finally, the P2P lend-
ing is usually not secured by collateral. Consequently, the
lenders would have little knowledge on the ability and
willingness to pay for a specific loan application.
To diminish the risk of information failure, the plat-
forms have employed several measures at present and
probably the most popular one is the internal credit
scoring system. From the perspective of method, the
credit scoring model inherently estimates probability of
default (PD) to predict loan default. In most platforms,
each borrower is assigned a credit grade based on the
hard credit data of the borrowers. These data mainly
include information concerning solvency, demographic
variables, and loan characteristics. The internal credit
scoring is regarded as an efficient decision support sys-
tem for lenders, and many studies have confirmed the
effects of internal credit grade on lenders' behavior and
predicting defaults (Iyer, Khwaja, Luttmer, & Shue,
2015; Malekipirbazari & Aksakalli, 2015; Xia, Liu, &
Liu, 2017).
Enhancing the efficiency of predicting default loans is
the permanent pursuit of credit risk management. Even a
slight improvement in predict potential default may lead
to huge economic benefits (G. Wang, Hao, Ma, & Jiang,
2011). To improve the predictive performance of the
credit scoring model, current literature mainly contrib-
utes from two aspects, namely method and data. Regard-
ing the classification algorithms, logistic regression (LR)
(Emekter, Tu, Jirasakuldech, & Lu, 2015), neural network
(NN) (Byanjankar, Heikkilä, & Mezei, 2015), random for-
ests (RF) (Malekipirbazari & Aksakalli, 2015; Z. Wang,
Jiang, Ding, Lv, & Liu, 2018), gradient boosting decision
tree (GBDT) (Liu & Wu, 2017), and deep learning
approach (Luo, Wu, & Wu, 2017) have been applied to
predict the default loans. Moreover, SerranoCinca and
GutiérrezNieto (2016), Guo et al. (2016), and Xia, Liu,
and Liu (2017) have proposed profit scoring models to
guide lenders' investment through portfolio optimization
strategy.
From the perspective of creditrelated data, most stud-
ies calculated credit scores based on hard information.
Some scholars highlighted the sample bias caused by
the large amount of reject application in P2P lending
and proposed reject inference techniques to infer the true
loan status of rejected applications and integrate them
into credit risk assessment (Li, Tian, Li, Zhou, & Yang,
2017; Xia, Yang, & Zhang, 2018). Some interesting studies
utilized soft information to boost the predictability of
credit scoring models. Though many studies have
discussed the usage of soft and hard information,
there is no precise definition of soft information at pres-
ent. Liberti and Petersen (2019) summarized the
characteristics of soft and hard information to make them
classifiable. These characteristics include the following:
1. Soft information is often communicated as text,
whereas hard information is typically recorded in
numbers.
2. The context in soft information is more important rel-
ative to that of hard information.
3. The collector and decision maker of soft information
is often the same person, whereas the hard informa-
tion collection and decision making are separated;
4. Soft information is typically gauged in terms of verifi-
ability, whereas some hard information is verifiable.
This is one of the defining characteristics of soft
information in P2P lending, since some information
selfreported by borrowers is usually unverifiable
(Michels, 2012).
Some scholars have attempted to use soft information in
the credit risk assessment of P2P lending. Social network
and descriptive text (narratives) remain two major
sources of soft information. Concerning social network,
Lin, Prabhala, and Viswanathan (2013) revealed that
friendships positively affected the probability of loan
funding and decreased PD. Moreover, Wei, Yildirim,
Van den Bulte, and Dellarocas (2015) developed a variety
of methods to examine the effect of social network data in
predicting potential default borrowers. The results
revealed that social network data can improve the perfor-
mances of credit scores in some cases. X. Chen, Zhou,
and Wan (2016) observed that the borrower's social capi-
tal had a considerable impact on repayment. Regarding
narratives, Dorfleitner et al. (2016) found text description
affected funding probability but hardly predict PD in P2P
lending. Utilizing data from a Chinese representative
platform, X. Chen, Huang, and Ye (2018) claimed that
the amount of punctuation in loan descriptions affected
both funding probability and interest rate. Jiang, Wang,
Wang, and Ding (2018) proposed a credit scoring model
integrating soft information (extracted from narratives)
and hard credit data. The empirical analysis showed that
the proposal boosted the predictive performance relative
to methods solely using hard credit data. Furthermore,
Yao, Chen, Wei, Chen, and Yang (2018) examined the
role of soft information contained in loan titles for online
lending.
Although abundant studies have focused on credit
scoring in P2P lending, some improvements can be made
from the following aspects. First, given that ensemble
techniques such as RF and GBDT have been applied in
credit scoring of P2P lending, powerful ensemble classi-
fiers are always welcome in this area. Second, some appli-
cants seldom have social media account, which hinders
XIA ET AL.261

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT