Data Science & Soft Computing Lab




Dr Daniel Stamate, Data Scientist and the Lab Leader, University of London – Goldsmiths, and
University of Manchester

Prof Fionn Murtagh, Data Scientist, Director Centre for Mathematics and Data Science - University of Huddersfield, and University of London - Goldsmiths

Dr Ida Pu, Computer Scientist, University of London - Goldsmiths

Mr Rapheal Olaniyan, Data Scientist at Deutsche Bank, PT PhD Candidate Data Science at Goldsmiths, working in NLP and Machine Learning Approaches to Sentiment Analysis & Stock Market Prediction

Mr Mohamed Saber, Data Scientist at Deutsche Bank, PT PhD Candidate Data Science at Goldsmiths, working in Financial Fraud Detection

Dr Jiri Marek, Financial Trader, PT PhD Candidate Data Science at Goldsmiths, working in Behavioural Finance

Mr Richard Smith, Data Scientist at Deutsche Bank, PT PhD Candidate Data Science at Goldsmiths

Mr Mihai Ermaliuc, Machine Learning Engineer, PT PhD Candidate Data Science at Goldsmiths, working in Generative Adversarial Networks

Mr John Langham, Data Scientist, Goldsmiths, University of London

Associated members

Prof Daniel Stahl, Professor Medical Statistics and Statistical Learning, Lead of Precision Medicine and Statistical Learning Group, King's College London

Prof Doina Logofatu, Computer Scientist and Mathematician, Frankfurt University of Applied Sciences

Prof Alexander Zamyatin, Computer Scientist, Director Institute of Applied Mathematics and Computer Science, National Research Tomsk State University

Mr Frederic Marechal, Data Scientist, Santander Bank

Dr Mihaela Breaban, Computer Scientist, University of Iasi

Dr Sinan Guloksuz, Psychiatry researcher with interests in Machine Learning, Maastricht University, and Yale University

MSc students and interns

Rubaida Easmin, Gozde Orhan, Mazy Carneiro, Data Science MSc, University of London, Goldsmiths

Dinara Kanarina, Olga Khudoleeva, Ruslan Tsygankov, Erasmus+ interns, National Research Tomsk State University

Research Themes

NLP, text mining and sentiment analysis approaches to stock market forecasting and fraud detection

Machine Learning and Prediction Modelling Approaches to Medical Data Mining, eHealth and Precision Medicine

Soft Computing, Evolutionary Computing and Algorithms

Decision trees and ensemble based methods with parameterised impurity families and statistical pruning

NLP, text mining and sentiment analysis approaches to stock market forecasting and fraud detection
Participants: Daniel Stamate, Rapheal Olaniyan, Mohamed Saber, Frederic Marechal, Dinara Kanarina

There has been an increasing interest recently in examining the possible relationships between emotions expressed online and stock markets. Most of the previous studies claiming that emotions have predictive influence on the stock market do so by developing various machine learning predictive models, but do not validate their claims rigorously by analysing the statistical significance of their findings. In turn, the few works that attempt to statistically validate such claims suffer from important limitations of their approaches.

Growing research analyses the relationship between sentiment-filled online information and the stock market, and shows a tendency for the former to predict the latter. But little is known if this information's predictive power resolves uncertainty. Rather, it is believed that it induces volatility because investors over-react or under-react to new information as a result of sentimental contagion.

In particular, stock market data exhibit erratic volatility, and this time-varying volatility makes any possible relationship between these variables non-linear. Our work investigates and propose novel frameworks based on approaches that account for non-linearity and heteroscedasticity. We study also the asymmetric nature of influences of positive and negative sentiments on the stock market volatility.

Current research is extended also towards financial fraud detection with NLP and ML approaches, more details to follow.

Machine Learning and Prediction Modelling Approaches to Medical Data Mining, eHealth and Precision Medicine

A. Predicting risk of dementia and AD
Participants: Daniel Stamate, Fionn Murtagh, John Langham, Richard Smith, in collaboration with Dr David Reeves and team at the Centre for Primary Care in the Institute for Population Health, University of Manchester, and Ruslan Tsygankov, Rubaida Easmin, Gozde Orhan, Mazy Carneiro.

Our Lab's team leads on the Machine Learning aspects of the study based on our newly funded Alzheimer's Research UK project on Predicting the risk of dementia using routine primary care records, which is developed in collaboration with University of Manchester and other academic partners. The project got recent media coverage at BBC. The research work concerns the development of novel synergistic approaches to predicting dementia based on Machine Learning (AI) and Statistical methods, and the development of a prediction tool. There are currently almost 1 million people in UK living with dementia. There is currently no cure, and the condition has higher health and social care costs than cancer, stroke and chronic heart disease, taken together (dementia cost in UK being £26 billion per year). Current thinking suggests that 35% of cases of dementia could be prevented. Our research project aims to contribute to prevention, and to helping improve diagnosis rates (currently at least one third of expected patients don't receive a dementia diagnosis) through predicting risk of dementia with new machine learning and statistical based approaches. The main source of data to be analysed in this project is the Clinical Practice Research Datalink (CPRD).

B. Data-driven Computational Psychiatry Research in Predicting Mental Illness
Participants: Daniel Stamate, Daniel Stahl, Wajdi Alghamdi, Andrea Katrinecz, in collaboration with Institute of Psychiatry, Psychology & Neuroscience, King's College London, and Department of Psychiatry and Neuropsychology, Maastricht University Medical Centre

C. Machine and Statistical Learning Modelling to Understand Heterogeneous Manifestations of Asthma in Early Life
Participants: Daniel Stamate, in collaboration the team of Prof Adnan Custovic, Department of Medicine at Imperial College London

Wheezing is common among children and ~50% of those under 6 years of age are thought to experience at least one episode of wheeze. However, due to the heterogeneity of symptoms there are difficulties in treating and diagnosing these children. ‘Phenotype specific therapy’ is one possible avenue of treatment, whereby we use significant pathology and physiology to identify and treat pre-schoolers with wheeze. By performing feature selection algorithms and predictive modelling techniques, this study will attempt to determine if it is possible to robustly distinguish patient diagnostic categories among pre-school children. Univariate feature analysis identified more objective variables and recursive feature elimination a larger number of subjective variables as important in distinguishing between patient categories. Predicative modelling sees a drop in performance when subjective variables are removed from analysis, indicating that these variables are important in distinguishing wheeze classes. Current results show 90%+ performance in AUC, sensitivity, specificity, and accuracy, and 80%+ in kappa statistic, in distinguishing ill from healthy patients. Developed in a synergistic statistical - machine learning approach, our methodologies propose also a novel ROC Cross Evaluation method for model post-processing and evaluation. The predictive modelling's stability is assessed in computationally intensive Monte Carlo simulations.

Forthcoming work concerns proposing and expanding a novel methodology based on unsupervised learning / clustering to address the heterogeneity nature and the identification of sub-categories of asthma. This work is to be developed in collaboration with the team of Prof Adnan Custovic, Department of Medicine at Imperial College London.

Soft Computing, Evolutionary Computing and Algorithms
Participants: Doina Logofatu, Daniel Stamate, Ida Pu

Soft Computing involves various advances in Algorithmics which are specific to the nature of this computing paradigm. This theme addresses the need for efficiency in solving optimisation problems or the need for offering tractable solutions for specific NP-hard problems by employing Evolutionary Computing approaches, in particular using hybrid evolutionary approaches or parallel evolutionary approaches.

On the other hand, devising efficient algorithms for integrating, querying and performing inferences with imperfect information benefits of Soft Computing approaches as those based on multi-valued logics, and this is another direction we follow in our research. We provide algorithms for computing the semantics of the integrating, querying or inference rules that describes the result of these processes, and for deciding the query equivalence problem, which is useful in the query optimisation problem.

Moreover, statistical simulations are a useful Soft Computing tool that we employ for assessing new algorithms we propose for improving the time-efficiency in blocking expanding ring search for mobile ad hoc networks, or for various concurrency problems.

Decision trees and ensemble based methods with parameterised impurity families and statistical pruning
Participants: Daniel Stamate, Wajdi Alghambdi, Daniel Stahl, Doina Logofatu, Alexander Zamyatin

In the process of constructing decision trees, the criteria for selecting the splitting attributes influence the performance of the model produced by the decision tree algorithm. The most well-known criteria such as Shannon entropy and Gini index, suffer from the lack of adaptability to the datasets. This project investigates families of parameterised impurities that we propose, to be used in the construction of optimised decision trees. These criteria rely on families of strict concave functions that define the new generalised parameterised impurity measures which we applied in devising and implementing our PIDT novel decision tree algorithm. We investigate also novel statistical based approaches for preventing overfitting with pruning, and we proposed the so-called S-pruning procedure. The PIDT algorithm was evaluated on a number of simulated and benchmark datasets with good results. Experimental results suggest that by tuning the parameters of the impurity measures and by using our S-pruning method, we obtain better decision tree classifiers. Ongoing work investigates the extension of these techniques to ensemble based predictive models based on parametrised families of impurities.