Large complex systems often incorporate multiple systems and a multitude of sensors, and system monitoring is essential for discovering faults and facilitating diagnostics across systems and sensors. At the same time, exploring large systems with numerous multivariate data sets is challenging. The Large Hadron Collider (LHC) at CERN, the most powerful high-energy particle collider ever built worldwide, is not just a scientific marvel but a beacon of human curiosity and ingenuity. It provides thousands of researchers with unparalleled opportunities for groundbreaking discoveries by colliding physics particles near the speed of light and unveiling new physics to answer the grandest questions about the nature of matter. The scientific discoveries and technological advances at CERN have made tremendous contributions to our daily lives; some notable examples are the touchscreen technology, hadron tumor therapy, PET scanners for medical imaging, and the World Wide Web. The LHC has recently undergone crucial upgrades, leveraging new technologies to achieve the high luminosity program; its components are growing more complex to support high radiation exposure, strong magnetic fields, and extreme particle acquisition rates. Such extraordinary attributes pose tremendous challenges for the experiment and data processing of enormous amounts. Ensuring the quality of physics data requires timely monitoring and the resolution of system anomalies.
Machine Learning (ML) tools have gained immense popularity due to the proliferation of sensor data for monitoring, and diagnostic applications in various industrial domains. The growing complexity of the system and the monitoring of data volumes of the LHC accentuate the need for automation through advanced ML tools. Detection, identification, and resolution of anomalies are essential to generate more physics collision data of the highest quality. Developing ML tools for complex systems often involves expensive data curation and modeling efforts; it requires adequate, cleaned, and annotated data sets, and addresses the challenges of heterogeneity and curse-of-dimensionality of large data sets. The Compact Muon Solenoid (CMS) experiment---one of the large general-purpose colliders at the LHC---has dedicated substantial monitoring efforts for detector systems and particle data quality; the control and safety systems (DCS/DSS) actively monitor safety-critical problems, and the data quality monitoring (DQM) system mitigates data loss by identifying and diagnosing physics data problems. The existing monitoring systems need to incorporate a wide range of monitoring variables and adapt to the evolving conditions of the detectors.
This dissertation focuses on the development of unsupervised anomaly detection (AD), anomaly prediction (AP), and root-cause analysis (RCA) on multivariate time series data sets. We have developed deep learning models for frontend electronics of the Hadron Calorimeter (HCAL) of the CMS detector using diagnostic sensors and high-dimensional particle acquisition channel-monitoring data sets. The research has enhanced deep learning advancement and tackled the challenges of monitoring complex systems with thousands of sensors through a divide-and-conquer approach and modeling involving temporal, multivariate, explainable, adaptive, online real-time, and causal graph discovery methods. Our scientific contribution in tackling the challenges for complex system monitoring consists of the following: 1) enhancing multivariate sensor AD; 2) a promising AP approach; 3) context-aware high-dimensional spatio-temporal AD; 4) transfer learning on multi-network deep learning models; 5) lightweight interconnection and divergence discovery for multi-systems with multivariate sensors; and 6) enhancing computational efficiency of anomalies causality discovery on binary anomaly data. In addition to the study's scientific contributions, the monitoring tools developed during the PhD research have been deployed in the CMS production system and have detected and identified previously unknown and hard-to-monitor anomalies. The tools have extended the CMS detector's real-time monitoring, diagnostics, and prognostics automation over the multitude of sensors across several subsystems. The PhD study pushes a practical step toward enhancing the efficiency and reliability of the CMS detector, contributing to the broader scientific community and our understanding of the universe, leading to new technological advancements.