Interactive machine learning

In the RAPID-MIX API, inputs are processed by models to create outputs. These models are built using interactive machine learning.

Interactive machine learning provides a workflow for users to build these models by demonstrating human actions and computer responses, rather than by programming. This means you can quickly create complex and responsive interactive systems.

You could be using biosensor data to play a synthesiser, or using colour tracking in your webcam to control an avatar in a game: interactive machine learning lets you intuitively connect a diverse range of inputs to outputs without coding.

What is a Model? What is training and Training Data? 

In order for our interactive machine learning system to “learn”, it needs to be trained: Training Data is the set of examples that  our interactive machine learning system learns from. This could be a set of actions that are recorded, such as making an “open hand” or a “closed fist” to the webcam, so the computer learns what “open hand” or a “closed fist” look like. It could also be associating inputs with outputs, so making an “open hand” whilst playing a loud sound, and an “closed fist”  whilst playing a quiet sound; the system learns to play quietly when it detects “closed fist” gestures and loudly when it detects “open hand” gestures.


A Learning Algorithm uses the examples in the Training Data to output a Model, i.e. a function that performs Classification or Regression on live input data.

The RAPID-MIX API features both static and temporal models, performing classification and regression :

Classification vs Regression tasks

Classification identifies different types of inputs.  With classification, our system could learn, for instance, a “open hand” gesture and “closed fist” gesture. If someone gestured at our system with their hand, it could tell us if it thought they were making a “open hand” or a “closed fist” gesture, and respond accordingly.

With regression, our system can respond to new types of input with new types of output. Knowing what “open hand” and “closed fist” mean, with regression, our system could create a new output for “pointing hand” based upon what it has been shown.

Static vs Temporal data

What sort of things do you want to identify with interactive machine learning? If it is a pose, it is not moving, it’s static: for instance, a “closed fist” pose. If it is a gesture, it’s moving, it is temporal: for instance, a waving gesture.

Typically, a model that deals with temporal data will be able learn from a set of gestures, then estimate which one you are doing and follow it (e.g. give a percentage of its achievement) in real-time.



Machine Learning Algorithms

The RAPID-MIX API provides users with a number of different machine learning algorithms.

Algorithm Classification Regression Following Variation Tracking
Neural Networks
Dynamic Time Warping (DTW)
Hierarchical HMM (HHMM)
Multimodal HMM (MHMM)


KNN (static classification) K-Nearest Neighbours is a simple algorithm that stores all available examples and classifies new examples based on how close (i.e. Euclidian distance) they are to the training examples. It works well on basic recognition tasks. Training is quite quick. But, it can run slowly if the number of examples is large and it might not be good with noisy data.

SVM (static classification)  A Support Vector Machine is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorises new examples. SVM’s tend to work well on more complex problems that KNN can handle. However, the algorithm has a number of parameters that can strongly affect the accuracy of the output.

NN (static regression) The RAPID-MIX API implements a form of neural network called a multilayer perceptron. It is a feedforward neural network with back-propagation. This is generally a good algorithm for regression and mapping tasks. The training time scales up with the number of examples, while the running time scales with the number of inputs and outputs.

DTW (temporal classification) Dynamic Time Warping measures the similarity between different temporal sequences that might vary in speed. It can be used to recognise patterns over time. For instance, you might want to play one note every time you draw a circle in the air with your hand, and another note every time you draw a square. If you’re not drawing either one, or if you’re in the middle of drawing, you don’t want anything to happen.

GVF  Gesture Variation Follower allows for realtime gesture recognition and variations estimation. In other words, the library provides methods to easily learn a gesture vocabulary, recognise a gesture as soon as it is performed, and estimate its variations (e.g. in scale, orientation, or dynamic). It has been designed for human-computer interactions (HCI) mixing discrete and continuous commands, and specifically for creative applications such as controlling sounds and visuals.

GMM (static classification) Gaussian Mixture Models are instantaneous movement models. The input data associated to a class defined by the training sets is abstracted by a mixture (i.e. a weighted sum) of Gaussian distributions. This representation allows recognition in the performance phase: for each input frame the model calculates the likelihood of each class.

GMR (static regression) Gaussian Mixture Regression is a straightforward extension of Gaussian Mixture Models used for regression. Trained with multimodal data, GMR allows for predicting the features of one modality (e.g. sound) from the features of another (e.g. movement) through non-linear regression between both feature sets.

Hierarchical HMM (HHMM) integrates a high-level structure that governs the transitions between classical HMM structures representing the temporal evolution of — low-level — movement segments. In the performance phase of the system, the hierarchical model estimates the likeliest gesture according to the transitions defined by the user. The system continuously estimates the likelihood for each model, as well as the time progression within the original training phrases.

Multimodal Hierarchical HMM (MHMM) allows for predicting a stream of sound parameters from a stream of movement features. It simultaneously takes into account the temporal evolution of movement and sound as well as their dynamic relationship according to the given example phrases. In this way, it guarantees the temporal consistency of the generated sound, while realizing the trained temporal movement-sound mappings.