Unsupervised analytics for multi-source time series data: Enabling trend analytics, context-aware profiling and real-time state forecasting
 
Unsupervised analytics for multi-source time series data: Enabling trend analytics, context-aware profiling and real-time state forecasting 
 
 
Abstract 

There has been an explosion of impressive success stories recently with deep learning approaches in var-ious fields such as natural language processing, computer vision, healthcare, and robotics. The adventof transformers has further amplified the capabilities of deep learning models to understand and generatecomplex patterns, establishing them as a cornerstone of modern AI advancements across a broad spec-trum of applications. Initially, transformers revolutionised large language models (LLMs) like GPT-4and BERT, enabling them to process and generate human-like text with remarkable coherence and accu-racy. Now, their impressive performance is also being demonstrated in other domains, extending theirimpact beyond just language processing. Given sufficient high-quality labelled data and computationalresources, deep learning models are able to achieve levels of accuracy that were previously unattainable.Consequently, much of the AI research nowadays is devoted to improving deep learning architectures,leading to the creation of computational models that are increasingly precise, lighter, faster, etc.Unfortunately, most of the real-world application contexts (e.g., industrial asset operations, produc-tion processes, and mobility management) generate datasets which significantly diverge from the ide-alised benchmark datasets used to validate novel AI methodologies. Real-world data is typically charac-terised with presence of noise, missing values, complicated parameter names, different data types, lackof ground-truth, context-dependent features, etc. The latter makes it very challenging to immediatelydive into any AI model application since it is often not clear which modelling paradigm would best suitthe problem at hand. This PhD research is built around the conception and validation of a heuristic dataanalytics methodology with the primary aim to benefit maximum from the different facets, while at thesame time mitigating the {\textquoteleft}imperfections{\textquoteright}, of real-world datasets.Nowadays, most of the available datasets originating from industrial activities are composed of mul-titude of different parameters. The inherent multi-source nature of such datasets makes it impossible todirectly integrate different data types without information loss. For instance, the performance of an in-dustrial asset is impacted by a diverse set of factors (multiple views) such as different operating modesand settings concerned with the internal working of the asset, and many exogenous factors, such as hu-man operators or weather conditions. However, it is not always possible to directly link or trace backcertain performance to a distinct operating context due to numerous influencing factors, which are oftenalso highly interdependent. To address this challenge, a multi-view data integration approach has beendevised as a part of this PhD work, which identifies and considers different data views explicitly, allowingto fully harness the richness of heterogeneous datasets while retaining all the relevant information.The ongoing trend of increasingly more data being captured and stored, goes parallel with an in-creasing complexity of interpreting and extracting valuable insights from it. For instance, the remotemonitoring of infrastructures (e.g., roads, buildings, and power supplies) or portfolios of industrial assets(e.g., wind turbines, compressors, and pumps) typically generates complex spatio-temporal data streamscaptured at high sampling rate across multitude of different locations. Combining and making sense ofsuch data streams, while still being able to capture and preserve the temporal dynamics per spatial con-text, is not trivial. In this PhD research, an elegant spatio-temporal profiling methodology is proposed,allowing to uncover insightful spatial patterns and dependencies while taking full advantage of the tempo-ral dimension. However, it is crucial to acknowledge that solely relying on intelligent analysis techniquesoften falls short in fully uncovering pertinent patterns and relationships in real-world data. On the con-trary, the human eye can outperform algorithms in grasping and interpreting subtle patterns, providedit is supported by intelligent visualisations. Therefore, the exciting domain of visual analytics researchhas been also explored in this PhD thesis, resulting into the conception of several novel visualisation ap-proaches, blending advanced visualisation with intelligent analysis to effectively reveal key patterns andrelationships in the dataset of interest.By far, the hardest challenge associated with the analysis of real-world datasets is the lack of groundtruth, which limits the choice of learning paradigms to only unsupervised ones. Subsequently, the poten-tial of deriving meaningful insights from such datasets is far from being fully exploited since it requirescreative data science approaches beyond the mere application of AI algorithms. In this PhD research,a novel data mining and modelling framework is conceived, capable of extracting semantically inter-pretable states from unlabelled real-world datasets. The latter facilitates a better understanding of systembehaviour in terms of state transitions and also allows to convert the initially unsupervised data mod-elling problem into a supervised one, enabling the construction of forecasting models. Several differentneural and neuro-symbolic learning workflows have been proposed for this purpose in this PhD work.Thus, thanks to the creative data analysis phase preceding model construction, these paradigms are en-dowed with the capability to perform advanced supervised tasks such as modelling transition dynamics,forecasting future states, predicting forthcoming events, and identifying anomalies.To evaluate the effectiveness of the conceived methodologies, real-world datasets from two funda-mentally different application domains have been considered. The first domain relates to wind energyproduction, for which high-quality SCADA data collected from an onshore wind farm is leveraged. Thesecond domain pertains to mobility, where diverse datasets for vehicle detection are utilised, obtainedfrom ANPR cameras and inductive loops. These practical applications highlight the main contributionsof this PhD research: the development of innovative heuristic data mining methodologies that bridge thegap between the clean and perfect benchmark datasets used in research nowadays and the reality of noisyand complex data streams originating from diverse real-world applications.