The incredible power of machine learning
I know what you are thinking!
Have you ever thought about how it would be like to have the ability to read people’s mind? Imagine that you can find out what your friends are thinking without the use of a single word. Wouldn’t it be nice to instantly know what a baby wants when it starts crying? A device that can read people’s mind would help the police to find the perpetrator of a crime in a split second. According to a research conducted by the Toyohashi University of Technology in Japan it is only a matter of time until brain decoding will be introduced to the world.
Researchers at the Toyohashi University of Technology in Japan are convinced that a device that can translate people’s thoughts into single words, sentences or pictures will be ready to use in the near-future. This device, which can be connected to a smart phone, will detect people’s brainwaves and decodes these into recognizable words, pictures or numbers. Besides the fact that it can be used for private purposes, it might potentially also have a big impact on several industries, such as the medical world and government institutions. This device will enable people who have lost the ability to speak to communicate, for example.
How does brain decoding work?
In order to translate people’s brainwaves into recognizable words or pictures, an extensively large database needs to be build initially. This database consists of matches between brain waves and the corresponding thoughts or images shown at the moment the brainwaves were monitored by an electroencephalogram (EEG). In terms of machine learning, a database in which both the inputs (brainwaves) and desired outcome (words/thoughts, pictures) are known is called the training set. The information gathered in this stage will be used to develop an algorithm or model that can make accurate data-driven predictions of people’s thoughts, given inputs for which the desired outcome is unknown. The model that is generated in the training phase will be used to eventually produce advanced mind-reading technologies. Further improvement of the model is possible when the training set is extended over time.
Machine learning is the way to go in the revolutionary world of brain decoding. Let’s talk about this subfield in the domain of Artificial Intelligence (AI) in more detail.
Machine learning is all about models build upon a training database that can be used to make accurate data-driven predictions and decisions without following strictly static program instructions. In other words, machine learning techniques can be used for those applications for which it is difficult or even impossible to explicitly program an algorithm. The only requirement for the training set: there needs to be a (hidden) relation between the data in the sample inputs in order to train the model. It is all about learning the structure in the data!
Since big data became more popular over the last few years, machine learning has never been more relevant than it is today. Having people in your company with extensive knowledge about machine learning and its’ possibilities is more important than ever. Imagine what insights can be extracted from the data gathered by your company or competitors. A few examples:
- Knowing your customers and their preferences will lead to a more effective advertising campaign, increased sales and customer satisfaction.
- Efficient stock management following from accurate sales predictions will result in higher profits.
- Optimization of vehicle routing in advanced logistics systems will result in lower operation costs and less emission of air pollutants.
In the remainder of this article, we describe the three categories of machine learning algorithms on a high level and we highlight commonly used machine learning methods. In addition to the methods listed below, many other machine learning methods exist that find applications in business. Many more techniques will be developed in the coming years, given the fact that it is becoming increasingly important to understand data. Making accurate data-driven predictions and decisions will result in cost savings or increased customer satisfaction.
The three categories of machine learning algorithms are supervised learning, unsupervised learning and reinforcement learning.
Supervised learning algorithms are those machine learning models that are build upon an input sample that contains both the uncorrelated inputs, also called independent variables, and their desired outputs. The desired output is often called the label or dependent variable. This learning technique generates a function based upon assigned labels that maps inputs to desired outputs. Generalization of the model is extremely important. High accuracy predictions for the input sample do not necessarily result in high accuracy predictions for unobserved datasets. A commonly used method to ensure that the model generated in the training phase is sufficiently robust is to split the input sample into two datasets, the training set and the test set. The training set is then used to build the model, while the test set is used to evaluate the accuracy of the model on unseen data. When the accuracy measure used increases significantly when comparing the test set to the training set, the model is not robust and it over-fits the training set.Read more on how to partition the input sample into a training and test set
How to partition the input sample into a training and test set?
There is no strict rule-of-thumb for how to best split the input sample into a training and test set. This depends, among others, on the amount of data available and the (un)hidden data patterns. It is undesirable that data pattern differences exist between the training set and the test set. The sets should be a random selection of the available data. There are two competing concerns when partitioning the data into non-overlapping datasets. To minimize the error rate on the training set, the training set should contain as much data points as possible. With less training data, parameters estimates will have greater variance. To minimize the validation error, the test set should contain as much data points as possible. With less testing data, the performance statistics will have greater variance. Broadly speaking you should be concerned with partitioning the data into non-overlapping datasets such that neither variance is too high.
The steps below are instructions to get a handle on variances, assuming there is enough data to do a proper division into training and test set (rather than cross-validation).
- Split the input sample into a training and test set with respectively X percent and Y percent of the data points. It should hold that X + Y = 100 and X >> Y. A commonly used partition is X = 80 and Y = 20;
- Split the training set into a training and validation set with P percent and Q percent of the data points, respectively. It should hold that P + Q = 100 and P >> Q, for example P = 80 and Q = 20;
- Subsample random selections of your training data, train the algorithm with this selection and record performance on the validation set. Try a series of runs with different amounts of training data;
- To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. The mean performance on small samples of your validation data should be roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples;
- When all parameters are set, run the algorithm on your test data. This shows the performance on unobserved data.
Regression analysis describes the relation between the dependent variable and an independent variable, given that the other independent variables (if any) are held fixed. It is assumed that the independent variables are linearly independent, i.e. none of the independent variables can be expressed as a linear combination of the other independent variables in the regression model. Regression analysis is most often used in those cases where the dependent variable is continuous and unconstrained.
Univariate regression analysis considers the relation between the dependent variable and a single independent variable. It is often more interesting to study the relation between a dependent variable and multiple independent variables. This is called multivariate regression analysis. The general multivariate regression formula is
y = α + β1x1 + β2x2 + … + βnxn + ε
where y is the value for the dependent variable, α is the intercept, βi is the coefficient for independent variable xi for i ϵ [1, n] and ε the unpredictable random disturbance term. The model parameters α and β are estimated using the sample input data. Let a be the estimate for α and bi the estimate for βi for i ϵ [1, n]. The estimated regression equation then equals
y* = a + b1x1 + b2x2 + … + bnxn
The vertical difference between the observed value for the dependent variable (y) and the prediction based on the regression line (y*) is called the residual (e), i.e. e = y – y*.
A regression model is often built in such way that the sum of squared differences between the observed values and the predicted values by the model is as small as possible. This is known as the method of least squares. The residuals are squared in order to avoid offsetting between positive and negative error terms in the summation. It is important to remove those independent variables from the equation that are not significantly correlated with the dependent variable, one by one and starting with the variable that is least correlated. Keeping in those uncorrelated variables will result in a regression equation that overfits the training data.
Example: predict monthly sales (continuous dependent variable) given the monthly advertising expenditures and GDP.
Support Vector Machines
Support Vector Machines (SVMs) are supervised learning models that analyze data for binary classification. Each data point in the training set belongs to one of two categories. The model build with an SVM training algorithm is used to assign new data points into one of the two categories, given the values for the independent variables, making it a non-probabilistic binary linear classifier.
Let N be the number of independent variables. SVM generates an (N-1) dimensional hyperplane to separate the points in the input sample into two classes. The hyperplane is constructed in such a way that it is situated as far as possible from all points. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
Let us assume that N equals two, so that the result of the SVM is a one-dimensional hyperplane (straight line). In the graph below, two hyperplanes are drawn that separate the two types (blue squares and green circles) perfectly. The SVM algorithm will select line A as the optimal hyperplane. Could you imagine why?
The light orange area around the hyperplane is called the margin. The margin is defined as two times the distance from the hyperplane to the nearest observation. As mentioned before, the optimal hyperplane is the hyperplane that divides the points in different categories by a clear gap that is as wide as possible. In other words, the highest margin will give you the optimal hyperplane.
Example: predict the outcome of an exam, given the number of times the student attends the classes and the number of hours studied.
In contrast to supervised learning, unsupervised learning algorithms can be used in case there is no real desired outcome. The training dataset given to the learning algorithm does only contain the inputs, i.e. the labels for the data are missing. It is up to the algorithm to find the (hidden) pattern(s) in the input dataset. Given the data, the model tries to find a general structure in the data, independent of any labeling.
Unsupervised learning algorithms can, for example, be used in the process of dividing the total market into groups of customers who have similar characteristics. The algorithm is built upon the raw data. A manual review for relevance and coding is important before the algorithm is brought to production. Model parameters need to be modified until the result achieves the desired properties. See the graphical representation below.
The aim of cluster analysis is to group the objects in such a way that objects with similar attributes are grouped in the same cluster. Objects in the same group have more in common with each other than with the objects marked into the other clusters. Cluster analysis is an iterative process of knowledge discovery or interactive multi-objective optimization. A commonly used technique is k-means clustering, which aim is to partition the observations into k clusters such that each observations is addressed to the cluster with the nearest mean.
The k-means clustering algorithm works as follows.
- Define the value of k;
- Random initialization of k cluster centroids. This can be done by using the first k observations as cluster centers, or using different seeds (randomly selected or explicitly defined);
- Assign all data points to the nearest cluster seed, i.e. a data point is assigned to the cluster with the minimum distance from the data point to the cluster center;
- Recompute the cluster centroids using the current cluster memberships;
- If the convergence criterion is not met, go to step 3. Examples of convergence criterion are: no reassignment of patterns or minimal change in cluster center.
Reinforcement learning is something we all use in our daily live without even noticing it. It is all about learning the most beneficial or efficient way of behavior within a particular environment. By ‘earning rewards’, we learn that the choice we made was the right one. Contrary, punishment will learn us that we should make a different choice in the future. Reinforcement learning can, for example, be used to find the shortest path in a maze from the entrance to the exit, by trial and error (i.e. running into a dead-end part). Examples of reinforcement learning techniques are Markov Decision process, Dynamic Programming and Genetic Algorithms. In the remainder of this article, we will focus on Genetic Algorithms.
Genetic Algorithms can be used to solve problems that can be written as mathematical formulations, with an objective function and one or more constraints. Depending on the problem, the objective functions should be either minimized or maximized. The overall goal of the Genetic Algorithm is to emulate the natural selection process. The population of candidate solutions, which are further denoted as the individuals of the population, will evolve to a population with better solutions, by repeatedly applying mutations and crossovers. Natural selection implies that individuals who are better adjusted to their environment have a higher change to survive than weaker individuals. The same mechanism can be used to find a solution for the Vehicle Routing Problem and its variations.
The figure below shows a simplified flow chart of the most general Genetic Algorithm. The first step is to create an initial population of candidate solutions, either manually or with a specific algorithm. The fitness of each of these solutions is evaluated. That is, the value of the objective function is calculated for each individual. A penalty is given to those candidate solutions that are not feasible. The group of candidate solutions needs to be well-diversified. For efficient search, it is important that there is a proper balance between genetic quality and diversity within the population.
New individuals are created using crossover between two or more candidates. Small mutations are applied to these children to keep the population sufficiently diversified. The resulting candidate solutions will only be considered if they are unique in the population. In general, the new offspring will replace the least adapted population member. Another option is to add the offspring to the population without any replacement. A Genetic Algorithm is an iterative process, which terminates when a pre-specified number of generations is produced, when the solution quality is sufficient or when the objective value improved barely over the previous X generations.
Example: Find reasonably good solutions for complex (capacitated) vehicle routing problems.
A glorious future for machine learning
Data storage costs are incredibly low nowadays. As a result, companies do store an extensive amount of data on a daily basis. This data can give insights that might lead to increased customer experience, company growth and higher profits. Despite the fact that machine learning is commonly used, there are still some sectors and companies that can benefit much from data analytics.
Machine learning is a powerful tool to apply advanced analytics to big datasets, to create truly meaningful intelligence that can help the company to grow and remain operational in the future. We expect that machine learning will develop rapidly in the coming years. Improved and new machine learning techniques will result in powerful data insights that can be used to optimize business processes. It is important to set up a business case when using machine learning. Describe what information you would like to gather from the dataset, what actions you will take given this information and what the benefits are for your company. Experts in this field of Artificial Intelligence can help you to select the machine learning technique that best fits your data and goals. Machine learning techniques are powerful, make sure that your company use it effectively and efficiently!