Track: Doctoral Dissertation Competition
Abstract
Recent advancements in supervised machine learning tools have proven their capability to act as accurate approximation surrogate models for complex the chemical production processes. In this approach, complex unit models are replaced with surrogate models built from actual chemical plant Nevertheless, real data should be handled with caution as it isn’t devoid of missing points, outliers, and faulty measurement, and using them without pre-processing could lead to inaccurate prediction models. Moreover, it is well-known that ideal real data without any outliers is almost nonexistence. Hence, cleaning data from outliers is very important step in data-driven modelling development Therefore, in this study different machine learning outlier detection method are implemented and compared to clean actual plant data before they are introduced to the data-driven surrogate models. Outliers are observations that do not follow bulk pattern of the data points and are unlikely observation of data. it is worth mentioning that identifying outliers by simple inspection and visualizing data set is challenging. There are different methods that can be used to identify outliers some of these methods are based on univariant statistical methods (Interquartile Range Method) and the others are based on unsupervised machine learning methods (Local outlier Factor, Isolation Forest, and One Class Support Vector Machine) The performances of these outlier detection methods on understudy data sets, are evaluated using linear regression that is used to predict certain process variables. Results show that removing outliers using these outlier detection methods before training the surrogate models can enhance the prediction accuracy of the machine learning approximation models.