Abstract
Abstract: In the current climate of uncertainty, corporate management is extremely challenging, and the risk of bankruptcy is increasing due to deteriorating business performance. Corporate bankruptcies cause losses to stakeholders such as business partners, investors, and financial institutions, necessitating the development of models that can prevent or detect bankruptcies early. While many studies, including Altman's research, have been conducted on bankruptcy discrimination models, machine learning models, which offer objective and highly accurate predictions, have become mainstream in recent years. I focus on the number of feature and construct bankruptcy discriminant model using two machine learning methods, Random Forest(hereafter: RF) and Light Gradient Boosting Machine (hereafter: Light GBM) , and study the effect of differences in the number of features on the discrimination accuracy of machine learning methods. The target data consists of publicly traded companies in the six industries with the highest number of bankruptcies, based on the Tokyo Stock Exchange's 33-sector classification. These industries are Construction, Real Estate, Services, Retail, Electric Equipment, and Wholesale, excluding Electric Power and Gas, Banking, Securities and Commodity Futures, Insurance, and Other Financial industries. Financial indicators were obtained using Nikkei NEEDS-Financial QUEST, a comprehensive economic database service by Nikkei Inc. The dataset was created from the financial indicator data of companies in the six industries with the highest number of bankruptcies between 1990 and 2021.For features, financial indicators and NW indicators were used. Considering comprehensiveness, there were 161 financial indicators and 12 NW indicators, totaling 173 indicators. In this study, we create three types of datasets with various numbers of features to train machine learning models. In many cases, datasets are an essential element for improving the performance and accuracy of AI models with the development of AI technology, and the importance of datasets is increasing. The first is dataset using only financial indicators (hereafter: financial dataset). The second is dataset using both financial indicators and NW indicators (hereafter: investment dataset). In this case, companies for which NW indicators could not be calculated were excluded from the analysis. Therefore, the amount of data is smaller than that of financial data. The third is dataset excluding NW indicators from investment data (hereafter: comparative dataset). Next, I describe the overall picture of the constructed model. A total of 108 models combining six types of industries, three types of features, three resampling methods to address imbalanced data, and two machine learning methods were constructed. The resampling methods include k-means under sampling (hereafter: k-means), SMOTE, and SMOTE + Edited Nearest Neighbor (hereafter: SMOTE+ENN). I used RF and Light GBM as machine learning methods. Recall is particularly emphasized in the model evaluation. Recall is defined as Recall = TP / (TP + FN), using True Positive (TP) and False Negative (FN). This study focuses on the number of TPs to determine the features for bankruptcy discrimination. In other words, the model with the highest number of TPs is determined to be the optimal bankruptcy discrimination model. In doing so, the selected and useful financial indicators and NW indicators will be confirmed. Next, the combination of useful machine learning methods and resampling methods for each industry will be verified. To obtain a highly accurate bankruptcy discrimination mode, the number of features will be gradually reduced. Specifically, we conduct bankruptcy discriminant analysis while gradually reducing the number of indicators from 173 to 2, compare the accuracy after changing the number of features, and determine the optimal number of features. As a result of the bankruptcy discrimination analysis, the model with seven features using Random Forest had the best recall and achieved the highest number of true positives (TP). These seven features were cash flow to net interest-bearing debt ratio, net working capital amount, dividend to cash flow ratio, cash flow to long-term debt ratio, net interest burden to sales ratio, interest and discount fee to sales ratio, cash flow to fixed liabilities ratio, equity growth rate. The dataset with the highest recall used only financial data. The second highest used both financial and NW indicators, while the worst was the comparative data excluding NW indicators from the investment data. The method used was SMOTE+ENN, which proved to be the best, followed by the k-means method and SMOTE alone. Briefly touching on industry-specific results, the recall for the electric equipment industry was 82.61%, correctly identifying 19 out of 23 bankrupt companies, using a dataset with 9013 non-bankrupt and 23 bankrupt companies. These results confirm that creating appropriate datasets and determining the optimal number of features can improve the accuracy of bankruptcy discrimination.