Machine learning is a technique that combines traditional mathematics with modern powerful computational processing to learn the patterns contained in data sets. In machine learning, the goal is to produce an algorithm that can use these patterns to perform some specified task.
In the case of supervised machine learning, the goal may be to develop a model that identifies which category or class of input the set belongs to or estimates a constant value such as the price of the house.
In this article, I am going to cover some key concepts in machine learning. If you are new to machine learning then it will give you a good understanding of some terminology and techniques used in this field.
Also read: How Image Processing and Machine Learning can be Linked together?
Top 6 Machine Learning Concepts
In machine learning, the inputs we talked about above are called features. Attributes are a set of attributes assigned to a data point.
The following example data set is a well-known data set commonly used for machine learning practice problems known as “Boston housing prices”.
It contains a set of features related to a house (highlighted in red below), such as those related to a house, the average number of rooms and property tax value and the price of a house.
In order for a machine learning model to be successful in performing its function, the statistical relationship must exist between at least some of these characteristics and the price of the home.
2. Feature selection and engineering
One important step in developing a machine learning model is optimization. The model we develop needs to be performed at its most optimal, and is a way to ensure that the model is trained using the best features.
It is not always useful to include every feature. Some features may not have any relation to the conceptual statistical relationship that the variable we are trying to predict, while others may be closely related to each other. Both these scenarios introduce noise in the training phase which may be different from model performance.
Feature selection is the process of selecting the optimal features to include in the training phase.
Similarly, features in their raw form may not provide enough meaningful data to train a performance model. In addition, some features cannot be used in their raw form, a good example of this would be date / time-based features. A machine learning model cannot use a date or timestamp, we first need to derive meaningful features to be able to include this information.
We can use parts of dates in their integer forms such as the number of months, days, or weeks, or build an understanding of the algorithm to provide patterns by calculating the difference between two dates. This is known as feature engineering.
Also read: Machine Learning Favors Smartphones: A Perfect Combination
Supervised machine learning is known as label data. This means data where each set has an identical label. These labels can be a category or type such as a cat or dog, or a constant value such as is the case in the Boston housing price data set where the label is priced.
When developing machine learning models, features are often referred to as X and labels as Y.
Supervised machine learning requires label data because algorithms use the corresponding labels for these example feature values and their require learn ‘patterns, which if successful will enable the model to accurately predict the labels on the newly unleaded data.
This phase of learning in the machine learning process is known as the training phase. At the end of this step, you have a model that can be used to predict the label or value for the new unlimited data. The training phase is often referred to as a model fitting.
I previously discussed an optimization process when describing feature selection in this post. Another part of this process is known as tuning and involves optimizing the parameters of an algorithm to find the best combination for your specific data set.
All machine learning models have parameters that have multiple options. For example, a random forest model has many tunable parameters. An example is n_estimators that determine the number of trees in a forest.
Usually, the better the number of trees the better the result, but a certain point (and it depends on the data set) improves as you add more trees. Finding the optimal number of trees for your data set is a way to tune the parameters for a random forest algorithm.
Each algorithm has several tunable parameters and each parameter has a potentially large number of options. Fortunately, there are automated methods to find the optimal combination of these parameters and this is known as hyperparameter optimization.
Also read: How to Use OpenCV for Machine Learning in Real-time Scenario
Once a model is constructed we need to determine how well it performs the given task. In our example data, we would like to understand that the model can correctly estimate the price of a house. In machine learning, it is important to establish the best performance metric and it will vary depending on the problem we are solving.
Usually when starting a machine learning project we first divide the datasets that we are working into two parts. One we use for training the model and the other is used for the testing phase.
Testing in machine learning is commonly referred to as validation. We use the model to make predictions on the prescribed test data set and measure the performance metrics chosen to determine how well the model is capable of performing the given task.