Models
In this section, we go through the AI models types available in MindsDB. These are regression models, classification models, time series models, and large language models (LLMs).
Disclaimer
In this section, we describe the default behavior using the Lightwood ML engine for regression, classification, and time series models. Other ML handlers may behave differently. For example, some may not perform validation automatically when creating a model, as numerous behaviors are handler-specific.
What is an AI models?
A machine learning (ML) model is a program trained using the available data in order to learn how to recognize patterns and behaviors to predict future data. There are various types of AI models that use different learning paradigms, but MindsDB models are all supervised because they learn from pairs of input data and expected output.
You input data into an AI models. It processes this input data, searching for patterns and correlations. After that, the AI models returns output data defined based on the input data.
Features
Features are variables that an AI models uses as input data to search for patterns and predict the target variable. In tabular datasets, features usually correspond to single columns.
Target
The target is a variable of interest that an AI models predicts based on the information fetched out of the features.
Training Dataset
The training dataset is used during the training phase of an AI models. It contains both feature variables and a target variable. As its name indicates, it is used to train an AI model.
The AI model takes the entire training dataset as input. It learns the patterns and relationships between feature variables and target values.
Once the training process is complete, one can move on to the validation phase.
Validation Dataset
The validation dataset is used during the validation phase of an AI models. It contains both feature variables and a target variable, like the training dataset. But as its name indicates, it is used to validate the predictions made by an AI models. It has no overlap with the training dataset, as it is a held-out set to simulate a real scenario where the model generates predictions for novel input data.
The AI models takes only the feature variables from the validation dataset as input. Based on what the model learns during the training process, it makes predictions for the values of a target variable.
Now comes the validation step. To assess the accuracy of the AI models, one compares the target variable values from the validation dataset with the target variable values predicted by the AI models. The closer these values are to each other, the better accuracy of the AI models.
Input Dataset
After completing the training and validation phases, one can provide the input dataset consisting of only the feature variables to predict the target variable values.
How is an AI Model Created?
In MindsDB, we use the CREATE MODEL statement to create, train, and validate a model.
Training Phase
Let’s look at our training dataset. It contains both features and a target.
On execution, we get:
Here, the features are companyId
, jobType
, degree
, major
, industry
, yearsExperience
, and milesFromMetropolis
.
And the target variable is salary
.
Let’s create and train an AI models using this training dataset.
On execution, we get:
Progress
Here is how to check whether the training process is completed:
Once the status is complete
, the training phase is completed.
Validation Phase
By default, the CREATE MODEL
statement performs validation of the model.
Additionally, we can validate the model manually by querying it and providing the feature values in the WHERE
clause like this:
On execution, we get:
By comparing the real salary values for the defined individuals and the predicted salary values, one can figure out the accuracy of the AI models.
Please note that MindsDB calculates the model’s accuracy by default while running the CREATE MODEL
statement. However, it is not guaranteed that all ML engines do this.
By default, the CREATE MODEL
statement does the following:
- it creates a model,
- it divides the input data into training and validation datasets,
- it trains a model using the training dataset,
- it validates a model using the validation dataset,
- it compares the true and predicted values of a target to define the model’s accuracy.
Let’s look at the basic types of AI models.
AI Model Types
Regression Models
Regression is a type of predictive modeling that analyses input data, including relationships between dependent and independent variables and the target variable that is to be predicted.
In the case of regression models, the target variable belongs to a set of continuous values. For example, having data on real estates, such as the number of rooms, location, and rental price, one can predict the rental price using regression. The rental price is predicted based on the input data, and its value is any value from a range between minimum and maximum rental price values from the training data.
Example
First, let’s look at our input data.
On execution, we get:
Here, the features are number_of_rooms
, number_of_bathrooms
, sqft
, location
, days_on_market
, and neighborhood
.
And the target variable is rental_price
.
Let’s create and train an AI models.
On execution, we get:
Once the training process is completed, we can query for predictions.
On execution, we get:
For details, check out this tutorial.
Classification Models
Classification is a type of predictive modeling that analyses input data, including relationships between dependent and independent variables and the target variable that is to be predicted.
In the case of classification models, the target variable belongs to a set of discrete values. For example, having data on each customer of a telecom company, one can predict the churn possibility using classification. The churn is predicted based on the input data, and its value is either Yes
or No
. This is a special case called binary classification.
Example
First, let’s look at our input data.
On execution, we get:
Here, the features are customerid
, gender
, seniorcitizen
, partner
, dependents
, tenure
, phoneservice
, multiplelines
, internetservice
, onlinesecurity
, onlinebackup
, deviceprotection
, techsupport
, streamingtv
, streamingmovies
, contract
, paperlessbilling
, paymentmethod
, monthlycharges
, and totalcharges
.
And the target variable is churn
.
Let’s create and train an AI models.
On execution, we get:
Once the training process is completed, we can query for predictions.
On execution, we get:
For details, check out this tutorial.
Time Series Models
Time series models fall under the regression or classification category. But what’s distinct about them is that we order data by date, time, or any value defining sequential order of events. Usually, predictions made by time series models are referred to as forecasts.
A time series model predicts a target that comes from a continuous set (regression) or a discrete set (classification).
There is a mandatory ORDER BY
clause followed by a sequential column, such as a date. It orders all the rows accordingly.
If you want to group your predictions, there is an optional GROUP BY
clause. By following this clause with a column name, or multiple column names, one can make predictions for partitions of data defined by these columns.
In the case of time series models, one should define how many data rows are used to train the model. The WINDOW
clause followed by an integer does just that.
There is an optional HORIZON
clause where you can define how many rows, or how far into the future, you want to predict. By default, it is one.
Example
First, let’s look at our input data.
On execution, we get:
Here, the features are saledate
, type
, and bedrooms
.
And the target variable is ma
.
Let’s create and train an AI models.
On execution, we get:
Once the training process is completed, we can query for predictions.
On execution, we get:
For details, check out this tutorial.
Large Language Models
Large language models are advanced artificial intelligence systems designed to process and generate human-like language. These models leverage deep learning techniques, such as transformer architectures, to analyze vast amounts of text data and learn complex patterns and relationships within the language.
Large language models have applications in chatbots, content generation, language translation, sentiment analysis, and various natural language processing tasks.
Example
Check out examples here:
How it Works in the Background
MindsDB uses the Lightwood ML engine by default. This section takes a closer look at how this package automatically chooses what type of model to use.
Models in Lightwood follow an encoder-mixer-decoder pattern, where refined, or encoded, representations of all features are mixed to produce target predictions. Here are the mixers used by Lightwood.
Please note that there is an ensembling step after training all mixers in case multiple mixers are used. Read on to learn more.
To give you some details on how MindsDB creates a model using different mixers, here is the full code.
And here comes the breakdown:
- This piece of code adds mixers to the
submodels
array depending on the model type and the data type of the target variable. - And here, we choose the best of submodels to be used to create, train, and validate our AI models.
Let’s dive into the details of how MindsDB picks the mixers.
Here is the piece of code being analyzed.
If we deal with a simple encoder/decoder pair performing the task, we use the Unit mixer that can be thought of as a bypass mixer.
A good example is the Spam Classifier model of Hugging Face because it uses a single column as input.
Otherwise, we choose from a range of other mixers depending on the following conditions:
-
If it is not a time series case, we use the Neural mixer. A good example is the Customer Churn model.
-
If it is a time series case, we use the NeuralTs mixer. A good example is the House Sales model.
MindsDB may use one or multiple mixers while preparing a model. Depending on the model type and the data type of the target variable, one mixer is chosen or a set of mixers are ensembled to create, train, and validate an AI models.
The three cases above describe how MindsDB chooses the mixer candidates and stores them in the submodels
array.
By default, after training all relevant mixers in the submodels
array, MindsDB uses the BestOf ensemble to single out the best mixer as the final model.
But you can always use a different ensemble that may aggregate multiple mixers per model, such as the MeanEnsemble, ModeEnsemble, StackedEnsemble, TsStackedEnsemble, or WeightedMeanEnsemble ensemble type. Here, you’ll find implementations of all ensemble types.
Next Steps
Below are the links to help you explore further.