Data Science & Soft Computing Lab



Regular members

Dr Daniel Stamate (PhD University of Paris-Sud), Data Scientist, Computer Scientist and Mathematician, Goldsmiths, University of London, Data Science & Soft Computing Lab Leader

Dr Sinan Guloksuz, MD , (PhD Maastricht University), Psychiatrist with interests in Applications of Machine Learning in Psychiatry, Maastricht University, and Yale University

Prof Doina Logofatu (PhD University of Cluj), Computer Scientist and Mathematician, Frankfurt University of Applied Sciences

Dr Ida Pu (PhD University of Warwick), Computer Scientist, Goldsmiths, University of London

Prof Alexander Zamyatin, Computer Scientist, National Research Tomsk State University

Dr Marc Atlan (PhD Pierre and Marie Curie University / Paris 6), Mathematician, finance industry, and Birkbeck, University of London

Dr Mihaela Breaban, (PhD University of Iasi), Computer Scientist, University of Iasi

Associated members

Andrea Katrinecz, Data Scientist, MSc graduate at Goldsmiths

Frederic Marechal, Data Scientist, MSc graduate at Goldsmiths

Lucia Acuna-Avendano, Data Scientist, MSc graduate at Goldsmiths

Anna Vashkel, Data Scientist, MSc graduate at National Research Tomsk State University

Research students

Rapheal Olaniyan, part time PhD candidate in Computer Science –
Machine Learning Approaches to Sentiment Analysis & Stock Market Forecasting at Goldsmiths, and Core Modeller Data Scientist at Deutsche Bank, supervisor Dr Daniel Stamate, co-supervisors Dr Ida Pu and Dr Rodger Kibble

Wajdi Alghamdi, PhD candidate in Computer Science –
Prediction Modelling Approaches to Data-driven Computational Psychiatry at Goldsmiths, supervisor Dr Daniel Stamate, co-supervisor Dr Daniel Stahl (Biostatistics and Health Informatics Department, IoPPN, King's College London)

Dr Jiri Marek, part time PhD candidate in Computer Science - Behavioural Finance at Goldsmiths, supervisor Dr Daniel Stamate, co-supervisors Prof Alan Pickering and Dr Caspar Addyman (Psychology Dept, Goldsmiths)

Mihai Ermaliuc, Data Science MSc student, Goldsmiths

Jeremy Ogg, Data Science MSc student, Goldsmiths

Seehyun Park, Data Science MSc student, Goldsmiths

Research Themes

Statistical & Machine Learning with Sentiment Analysis Modelling for Stock Market Trends Prediction

Machine Learning and Prediction Modelling Approaches in Medical Data Mining and Precision Medicine

Mobility Big Data Analytics

Soft Computing, Evolutionary Computing and Algorithms

Fuzzy Approaches to Reasoning with Imperfect Data, Imperfect Data Integration and Optimal Querying

Statistical & Machine Learning with Sentiment Analysis Modelling for Stock Market Trends Prediction
Participants: Daniel Stamate, Rapheal Olaniyan, Marc Atlan, Alexander Zamyatin, Anna Vashkel, Frederic Marechal.

There has been an increasing interest recently in examining the possible relationships between emotions expressed online and stock markets. Most of the previous studies claiming that emotions have predictive influence on the stock market do so by developing various machine learning predictive models, but do not validate their claims rigorously by analysing the statistical significance of their findings. In turn, the few works that attempt to statistically validate such claims suffer from important limitations of their approaches.

Growing research analyses the relationship between sentiment-filled online information and the stock market, and shows a tendency for the former to predict the latter. But little is known if this information's predictive power resolves uncertainty. Rather, it is believed that it induces volatility because investors over-react or under-react to new information as a result of sentimental contagion.

In particular, stock market data exhibit erratic volatility, and this time-varying volatility makes any possible relationship between these variables non-linear. Our work investigates and propose novel frameworks based on approaches that account for non-linearity and heteroscedasticity. We study also the asymmetric nature of influences of positive and negative sentiments on the stock market volatility.

Novel extensions of this research are currently developed with the team of Prof. Helyette Geman, Commodity Finance Centre at Birkbeck University of London, and Johns Hopkins University.

Machine Learning and Prediction Modelling Approaches to Medical Data Mining, eHealth and Precision Medicine

A. Predicting risk of dementia using routine primary care records
Participants: Daniel Stamate, one Research Associate (to be appointed), Jeremy Ogg, Seehyun Park, in collaboration with the team of Dr David Reeves, Centre for Primary Care in the Institute for Population Health, University of Manchester.

Our newly funded Alzheimer's Research UK project on Predicting the risk of dementia, in collaboration with University of Manchester and other partner universities, concerns the development of novel synergistic approaches to predicting dementia based on Machine Learning techniques and Statistical methods, and the development of a prediction tool. There are currently almost 1 million people in UK living with dementia. There is currently no cure, and the condition has higher health and social care costs than cancer, stroke and chronic heart disease, taken together (dementia cost in UK being £26 billion per year). Current thinking suggests that 35% of cases of dementia could be prevented. Our research project aims to contribute to prevention, and to helping improve diagnosis rates (currently at least one third of expected patients don't receive a dementia diagnosis) through predicting risk of dementia with new machine learning and statistical based approaches. Our team at Goldsmiths lead on the Machine Learning aspects of the research study.

B. Data-driven Computational Psychiatry Research in Predicting Mental Illness
Participants: Daniel Stamate, Wajdi Alghamdi, Andrea Katrinecz in collaboration with Institute of Psychiatry, Psychology & Neuroscience, King's College London, and Department of Psychiatry and Neuropsychology, Maastricht University Medical Centre

C. Machine and Statistical Learning Modelling to Understand Heterogeneous Manifestations of Asthma in Early Life
Participants: Daniel Stamate, in collaboration with Danielle Belgrave, Rachel Cassidy, and the team of Prof Adnan Custovic, Department of Medicine at Imperial College London

Wheezing is common among children and ~50% of those under 6 years of age are thought to experience at least one episode of wheeze. However, due to the heterogeneity of symptoms there are difficulties in treating and diagnosing these children. ‘Phenotype specific therapy’ is one possible avenue of treatment, whereby we use significant pathology and physiology to identify and treat pre-schoolers with wheeze. By performing feature selection algorithms and predictive modelling techniques, this study will attempt to determine if it is possible to robustly distinguish patient diagnostic categories among pre-school children. Univariate feature analysis identified more objective variables and recursive feature elimination a larger number of subjective variables as important in distinguishing between patient categories. Predicative modelling sees a drop in performance when subjective variables are removed from analysis, indicating that these variables are important in distinguishing wheeze classes. Current results show 90%+ performance in AUC, sensitivity, specificity, and accuracy, and 80%+ in kappa statistic, in distinguishing ill from healthy patients. Developed in a synergistic statistical - machine learning approach, our methodologies propose also a novel ROC Cross Evaluation method for model post-processing and evaluation. The predictive modelling's stability is assessed in computationally intensive Monte Carlo simulations.

Forthcoming work concerns proposing and expanding a novel methodology based on unsupervised learning / clustering to address the heterogeneity nature and the identification of sub-categories of asthma. This work is to be developed in collaboration with the team of Prof Adnan Custovic, Department of Medicine at Imperial College London.

Mobility Big Data Analytics
Participants: Daniel Stamate, Ida Pu, Mihai Ermaliuc, Lucia Acuna-Avendano, in collaboration with Fionn Murtagh

The research we have recently started to tackle in this project is based on building computational models to identify particular patterns in smart card (Oyster) big data and open data provided by Transport for London.

In particular we are looking into profiling segments of public transport users, and corroborating these segments with the patterns regarding adopted trajectories. Further work envisages an extension of the study by incorporating web mining to look for particular correlations between the public transport usage and the information extracted from news websites, blogs and social media such as Twitter. In particular we plan to look into and study how some significant events having happened in London in the past and reflected in the online social media and news, impacted on the public transport usage in certain areas at certain times. These insights can contribute to improving forecasting and optimising the public transport usage.

The particularly demanding computational tasks in this study are tackled by devising new software in Python for certain segments of the data pre-processing phase, and the usage of Big Data Analytics tools such as Spark, Hadoop and its ecosystem, Rapidminer, and R on the big data management and analytics cluster of servers.

Soft Computing, Evolutionary Computing and Algorithms
Participants: Doina Logofatu, Daniel Stamate, Ida Pu

Soft Computing involves various advances in Algorithmics which are specific to the nature of this computing paradigm. This theme addresses the need for efficiency in solving optimisation problems or the need for offering tractable solutions for specific NP-hard problems by employing Evolutionary Computing approaches, in particular using hybrid evolutionary approaches or parallel evolutionary approaches.

On the other hand, devising efficient algorithms for integrating, querying and performing inferences with imperfect information benefits of Soft Computing approaches as those based on multi-valued logics, and this is another direction we follow in our research. We provide algorithms for computing the semantics of the integrating, querying or inference rules that describes the result of these processes, and for deciding the query equivalence problem, which is useful in the query optimisation problem.

Moreover, statistical simulations are a useful Soft Computing tool that we employ for assessing new algorithms we propose for improving the time-efficiency in blocking expanding ring search for mobile ad hoc networks, or for various concurrency problems.

Fuzzy Approaches to Reasoning with Imperfect Data, Imperfect Data Integration and Optimal Querying
Participants: Daniel Stamate, Ida Pu

This theme addresses the problem of representing and reasoning with imperfect information using a logical approach based on fuzzy / multivalued logics, as well as the integration and querying of information coming from different sources, in a distributed environment.

Motivation comes from the area of knowledge acquisition, representation, and reasoning based on imperfect knowledge. Indeed, in the real world information may be incomplete or may have a bounded level of certainty and on the other hand contradictions may occur during the process of integrating information coming from various sources as it is the case of collecting knowledge from different experts. In multi-agent systems, different agents may give different answers to the same query. It is then important to be able to process the answers so as to extract the maximum of information on which the various agents agree, or to detect the items on which the agents give conflicting answers. Incompleteness, uncertainty and inconsistency of the information may be treated by using ready to employ hypotheses when information is completely missing, and multivalued logics with particular algebraic structures of semilattice, lattice and bilattice, when information is incomplete, uncertain or inconsistent.

In our framework the information concerns the truth values of information items, and is obtained through queries to the relevant sources. The answers of such queries are combined or integrated using a set of rules. In such a setting, imperfect information i.e. incomplete, uncertain information from a source, or contradictory information coming from different sources, can elegantly be expressed and dealt with using bilattices for instance, and an approach of reasoning based on rules whose semantics is natural in these bilattice based multivalued logics.

A connected research direction we tackle concerns the problem of optimal querying of knowledge bases and databases with imperfect information. Conventional techniques based on the concept of homomorphism have traditionally been used in database research to study the containment of queries evaluated against conventional data. We have extended and generalized these techniques such that the problem of query containment and equivalence (essential in optimal query evaluation) can be successfully studied in the context of sources containing imperfect information.

The applications of these approaches are in knowledge acquisition and representation, uncertain knowledge bases and databases, intelligent systems, and imperfect information integration and querying.