Data Science & Soft Computing Lab



Current Members

Daniel Stamate (PhD University of Paris-Sud), Data Scientist, Computer Scientist and Mathematician, Goldsmiths, University of London, Data Science & Soft Computing Lab Leader

Rapheal Olaniyan, part time PhD candidate in Computer Science, working in NLP and Machine Learning Approaches to Sentiment Analysis & Stock Market Prediction. Currently Core Modeller Data Scientist at Deutsche Bank.

Mohamed Saber, part time PhD candidate in Computer Science, working in Financial Fraud Detection.
Currently Data Scientist at Deutsche Bank.

Mihai Ermaliuc, part time PhD candidate in Computer Science, working in Generative Adversarial Networks.

Jiri Marek, part time PhD candidate in Computer Science, working in Behavioural Finance.

Wajdi Alghamdi (PhD Goldsmiths, University of London) Data Scientist, and forthcoming Postdoc Researcher in AI and Prediction Modelling Applications in eHealth

Andy Page (PhD University of Bath), Data Science MSc intern in the Lab and Mizuho Investment Bank, dissertation work in Outlier Detection, supervisor Dr Daniel Stamate

Pedro Lopez, Data Science MSc intern in the Lab and ForceDecks, dissertation work in Sport Analytics, supervisor Dr Daniel Stamate

Jeremy Ogg, Data Science MSc intern, dissertation work in Predicting Risk of Dementia, supervisor Dr Daniel Stamate

Associated members

Ida Pu (PhD University of Warwick), Computer Scientist, Goldsmiths, University of London

Doina Logofatu (PhD University of Cluj), Computer Scientist and Mathematician, Frankfurt University of Applied Sciences

Sinan Guloksuz (MD, PhD Maastricht University), Psychiatrist with interests in Applications of Machine Learning in Psychiatry, Maastricht University, and Yale University

Alexander Zamyatin, (PhD Tomsk University) Computer Scientist, National Research Tomsk State University

Marc Atlan (PhD Pierre and Marie Curie University / Paris 6), Mathematician, finance industry, and Birkbeck, University of London

Mihaela Breaban, (PhD University of Iasi), Computer Scientist, University of Iasi

Frederic Marechal, (MSc Data Science Goldsmiths), Data Scientist Santander Bank

Andrea Katrinecz, (MSc Data Science Goldsmiths), Data Scientist

Research Themes

NLP, text mining and sentiment analysis approaches to stock market forecasting and fraud detection

Machine Learning and Prediction Modelling Approaches to Medical Data Mining, eHealth and Precision Medicine

Soft Computing, Evolutionary Computing and Algorithms

Decision trees and ensemble based methods with parameterised impurity families and statistical pruning

NLP, text mining and sentiment analysis approaches to stock market forecasting and fraud detection
Participants: Daniel Stamate, Rapheal Olaniyan, Frederic Marechal, Mohamed Saber, Alexander Zamyatin, Marc Atlan.

There has been an increasing interest recently in examining the possible relationships between emotions expressed online and stock markets. Most of the previous studies claiming that emotions have predictive influence on the stock market do so by developing various machine learning predictive models, but do not validate their claims rigorously by analysing the statistical significance of their findings. In turn, the few works that attempt to statistically validate such claims suffer from important limitations of their approaches.

Growing research analyses the relationship between sentiment-filled online information and the stock market, and shows a tendency for the former to predict the latter. But little is known if this information's predictive power resolves uncertainty. Rather, it is believed that it induces volatility because investors over-react or under-react to new information as a result of sentimental contagion.

In particular, stock market data exhibit erratic volatility, and this time-varying volatility makes any possible relationship between these variables non-linear. Our work investigates and propose novel frameworks based on approaches that account for non-linearity and heteroscedasticity. We study also the asymmetric nature of influences of positive and negative sentiments on the stock market volatility.

Current research is extended also towards financial fraud detection with NLP and ML approaches, more details to follow.

Machine Learning and Prediction Modelling Approaches to Medical Data Mining, eHealth and Precision Medicine

A. Predicting risk of dementia
Participants: Daniel Stamate, Jeremy Ogg, in collaboration with Dr David Reeves, and the project team he leads at the Centre for Primary Care in the Institute for Population Health, University of Manchester, and other academic partners.

Our Lab's team leads on the Machine Learning aspects of the study based on our newly funded Alzheimer's Research UK project on Predicting the risk of dementia using routine primary care records, which is developed in collaboration with University of Manchester and other academic partners. The project got recent media coverage at BBC. The research work concerns the development of novel synergistic approaches to predicting dementia based on Machine Learning (AI) and Statistical methods, and the development of a prediction tool. There are currently almost 1 million people in UK living with dementia. There is currently no cure, and the condition has higher health and social care costs than cancer, stroke and chronic heart disease, taken together (dementia cost in UK being £26 billion per year). Current thinking suggests that 35% of cases of dementia could be prevented. Our research project aims to contribute to prevention, and to helping improve diagnosis rates (currently at least one third of expected patients don't receive a dementia diagnosis) through predicting risk of dementia with new machine learning and statistical based approaches. The main source of data to be analysed in this project is the Clinical Practice Research Datalink (CPRD).

B. Data-driven Computational Psychiatry Research in Predicting Mental Illness
Participants: Daniel Stamate, Wajdi Alghamdi, Andrea Katrinecz in collaboration with Institute of Psychiatry, Psychology & Neuroscience, King's College London, and Department of Psychiatry and Neuropsychology, Maastricht University Medical Centre

C. Machine and Statistical Learning Modelling to Understand Heterogeneous Manifestations of Asthma in Early Life
Participants: Daniel Stamate, in collaboration the team of Prof Adnan Custovic, Department of Medicine at Imperial College London

Wheezing is common among children and ~50% of those under 6 years of age are thought to experience at least one episode of wheeze. However, due to the heterogeneity of symptoms there are difficulties in treating and diagnosing these children. ‘Phenotype specific therapy’ is one possible avenue of treatment, whereby we use significant pathology and physiology to identify and treat pre-schoolers with wheeze. By performing feature selection algorithms and predictive modelling techniques, this study will attempt to determine if it is possible to robustly distinguish patient diagnostic categories among pre-school children. Univariate feature analysis identified more objective variables and recursive feature elimination a larger number of subjective variables as important in distinguishing between patient categories. Predicative modelling sees a drop in performance when subjective variables are removed from analysis, indicating that these variables are important in distinguishing wheeze classes. Current results show 90%+ performance in AUC, sensitivity, specificity, and accuracy, and 80%+ in kappa statistic, in distinguishing ill from healthy patients. Developed in a synergistic statistical - machine learning approach, our methodologies propose also a novel ROC Cross Evaluation method for model post-processing and evaluation. The predictive modelling's stability is assessed in computationally intensive Monte Carlo simulations.

Forthcoming work concerns proposing and expanding a novel methodology based on unsupervised learning / clustering to address the heterogeneity nature and the identification of sub-categories of asthma. This work is to be developed in collaboration with the team of Prof Adnan Custovic, Department of Medicine at Imperial College London.

Soft Computing, Evolutionary Computing and Algorithms
Participants: Doina Logofatu, Daniel Stamate, Ida Pu

Soft Computing involves various advances in Algorithmics which are specific to the nature of this computing paradigm. This theme addresses the need for efficiency in solving optimisation problems or the need for offering tractable solutions for specific NP-hard problems by employing Evolutionary Computing approaches, in particular using hybrid evolutionary approaches or parallel evolutionary approaches.

On the other hand, devising efficient algorithms for integrating, querying and performing inferences with imperfect information benefits of Soft Computing approaches as those based on multi-valued logics, and this is another direction we follow in our research. We provide algorithms for computing the semantics of the integrating, querying or inference rules that describes the result of these processes, and for deciding the query equivalence problem, which is useful in the query optimisation problem.

Moreover, statistical simulations are a useful Soft Computing tool that we employ for assessing new algorithms we propose for improving the time-efficiency in blocking expanding ring search for mobile ad hoc networks, or for various concurrency problems.

Decision trees and ensemble based methods with parameterised impurity families and statistical pruning
Participants: Daniel Stamate, Wajdi Alghambdi, Doina Logofatu, Alexander Zamyatin, in collaboration with Daniel Stahl, Department of Biostatistics, King's College London

In the process of constructing decision trees, the criteria for selecting the splitting attributes influence the performance of the model produced by the decision tree algorithm. The most well-known criteria such as Shannon entropy and Gini index, suffer from the lack of adaptability to the datasets. This project investigates families of parameterised impurities that we propose, to be used in the construction of optimised decision trees. These criteria rely on families of strict concave functions that define the new generalised parameterised impurity measures which we applied in devising and implementing our PIDT novel decision tree algorithm. We investigate also novel statistical based approaches for preventing overfitting with pruning, and we proposed the so-called S-pruning procedure. The PIDT algorithm was evaluated on a number of simulated and benchmark datasets with good results. Experimental results suggest that by tuning the parameters of the impurity measures and by using our S-pruning method, we obtain better decision tree classifiers. Ongoing work investigates the extension of these techniques to ensemble based predictive models based on parametrised families of impurities.