Abstract
Almost all of statistical models and machine learning models are to be built based on a known dataset (learning dataset) that was collected from past or other regions/domains. The precondition for applying these models to new instances is that the data of the new instances are comparable to the learning dataset. The comparability between the new instances and the learning dataset has also a strong impact on the performance of statistical models and machine learning models. There are mainly three types of comparability: comparability over region (geographical comparability), comparability over other domains and comparability over time. Comparability over time refers to the extent to which statistics are comparable or reconcilable over time. In particular, some social or economic indexes are affected clearly by changes over time of economic or social phenomena, comparability over time is an important aspect to improve the quality of various models in the social and economic fields. As major studies have put the emphasis on analysis methods or algorithms, there are lacks of researches dealing with comparability over time.
This study intends to address the issue of comparability over time, our emphasis is put on (1) proposing a method to assess the comparability over time among datasets collected in different periods; (2) clarifying the impact of comparability over time on statistical models through examining how the performance of statistical models changes when the extent of comparability between the learning dataset and fitting datasets is different. As an example of statistical models, we consider the credit rating problem of Japanese regional banks. We collect financial indicators of Japanese regional banks in 2012 (R2012), 2015 (R2015) and 2017 (R2017) respectively, conduct the following examinations and make several new contributions to data analytics research:
- There are no researches published to assess comparability over time, this study is to demonstrate that the number of variables with significantly different means or/and variances can provide a measure to assess to which extent that the two datasets are comparable.
- We apply linear discriminant analysis (LDA) method to construct two credit rating models based on the R2012 (Model A) and R2015 (Model B) respectively, and then use these two models to obtain the credit rating for each bank in R2017. Through comparing the forecasting results of model A and model B, and examining the extent of the comparability among these three datasets, we can clarify the impact of comparability over time on the performance of credit rating models.
- As a data preprocessing tool, data normalization has been applied widely in various fields. The main goal of data normalization is to guarantee the quality of the data, scale the data in the same range of values for each feature in order to minimize bias for one feature to another. In this study, we propose a new viewpoint to apply data normalization as a tool of improving comparability over time between the datasets.