Predicting Die Hard fans with ML.NET and C#

We have recently looked a lot more into ML.NET which will result in a range of new features on elmah.io. While the documentation from Microsoft is good, it is split up into multiple pieces which can make it hard to figure out how to build a real-world example with ML.NET. In this post, I will show you one of the pieces that I believe are missing: how to train and retrain a model.

Predicting Die Hard fans with ML.NET and C#

My previous blog post about ML.NET, Find anomalies with spike detection and ML.NET, was based on a simple fire and forget part of ML.NET where you don't need to train a model. In this post, the goal is to train a model to make movie recommendations. To simplify the example, I want to create a prediction if a person likes Die Hard or not, based on movie preferences.

I know, nobody hates Die Hard, but play along to learn something new (hopefully) about ML.NET 😂

To start building the example, create a new .NET console app:

dotnet new console

Next, install the Microsoft.ML NuGet package that we will need to train the model and make suggestions:

dotnet add package Microsoft.ML

Launch the project in Visual Studio or Code. In order to train a model and make predictions, we need a model to keep the input data:

class MoviePreferenceInput
{
    public float StarWarsScore { get; set; }
    public float ArmageddonScore { get; set; }
    public float SleeplessInSeattleScore { get; set; }

    public bool ILikeDieHard { get; set; }
}

That's a simply POCO with the users score from 0-10 on three popular movies (one of them featuring Bruce Willis), as well as a bool indicating if the user likes Die Hard or not.

Then add the following overall structure to the Main method:

var mlContext = new MLContext();
var trainingData = new List<MoviePreferenceInput>();
var dieHardLover = new MoviePreferenceInput
{
    StarWarsScore = 8,
    ArmageddonScore = 10,
    SleeplessInSeattleScore = 1,
    ILikeDieHard = true
};
var dieHardHater = new MoviePreferenceInput
{
    StarWarsScore = 1,
    ArmageddonScore = 1,
    SleeplessInSeattleScore = 9,
    ILikeDieHard = false
};

for (var i = 0; i < 100; i++)
{
    trainingData.Add(dieHardLover);
    trainingData.Add(dieHardHater);
}

IDataView trainingDataView = mlContext.Data.LoadFromEnumerable(trainingData);

if (!File.Exists("./diehard-model.zip") && !File.Exists("./diehard-pipeline.zip"))
{
    model = TrainNewModel(mlContext, trainingDataView);
}
else
{
    model = RetrainModel(mlContext, trainingDataView);
}

The first line contains the creation of the MLContext that every interaction with ML.NET needs (similar to the data context in EntityFramework). Next, I have hardcoded two types of people: one that loves Die Hard and one that doesn't. For a real-life sample, you want to get these data from a source like IMDb or similar. I'm adding 100 of each instance to the list of training data just to have something to train on. Again, in a real-life scenario, different users would have different ratings. Next, I've added an if/else to handle both the initial training of the model as well as retraining on subsequent runs. I'm saving the trained model on disk, but it could just as well be Azure Blob Storage, Amazon S3, or similar.

Let's start by looking at how to train the model on the first run:

private static ITransformer TrainNewModel(MLContext mlContext, IDataView trainingDataView)
{
    var dataPrepPipeline = mlContext
        .Transforms
        .Concatenate(
            outputColumnName:"Features",
            "StarWarsScore",
            "ArmageddonScore",
            "SleeplessInSeattleScore")
        .AppendCacheCheckpoint(mlContext);

    var prepPipeline = dataPrepPipeline.Fit(trainingDataView);

    mlContext.Model.Save(prepPipeline, trainingDataView.Schema, "./diehard-pipeline.zip");

    var trainer = dataPrepPipeline.Append(mlContext
        .BinaryClassification
        .Trainers
        .AveragedPerceptron(
            labelColumnName: "ILikeDieHard",
            numberOfIterations: 10,
            featureColumnName: "Features"));

    var preprocessedData = prepPipeline.Transform(trainingDataView);
    var model = trainer.Fit(preprocessedData);
    mlContext.Model.Save(model, trainingDataView.Schema, "./diehard-model.zip");
    return model;
}

If you haven't seen ML.NET code before, this can look complicated. Let's go through it line by line. I start by creating a data preparation pipeline for ML.NET to translate the input data to something that can later become a model to make predictions with. The Concatenate method takes the property names of the three movie scores from the MoviePreferenceInput class and concatenates them into a property (ML.NET lingua is column) named Features. There's also a call to AppendCacheCheckpoint which I won't go into details with. It's an optional method to increase performance.

In the following line, I call the Fit method on the data preparation pipeline that we just configured. This will vary from some of the other examples out there, but I do this to be able to save the data preparation pipeline for retraining the model. This is done by calling mlContext.Model.Save.

Next, I decide which trainer to use to train the model. Since we want to predict a boolean (ILikeDieHard) we dig into the list of binary classification trainers. A binary classification trainer is good at answering questions with two possible outcomes, which maps great with the prediction in this example. I'm using the Averaged Perceptron trainer since that supports retraining. When specifying the trainer, I tell ML.NET the property/columns containing the value we want to predict as well as the property/column containing the input data (in this case the Features column that contains the concatenated movie scores).

Finally, I train the model using the training data (the 200 hardcoded movie scores and ILikeDieHard boolean value) and save the model.

With the trained model I should be able to predict if a user likes Die Hard or not. Before doing this, let's fill in the RetrainModel method. If you remember the if/else from the Main method, this method will be called if a trained model is already found on disk:

private static ITransformer RetrainModel(MLContext mlContext, IDataView trainingDataView)
{
    DataViewSchema dataPrepPipelineSchema, modelSchema;
    var trainedModel = mlContext.Model.Load("./diehard-model.zip", out modelSchema);
    var dataPrePipeline =
        mlContext.Model.Load("./diehard-pipeline.zip", out dataPrepPipelineSchema);

    IDataView transformedData = dataPrePipeline.Transform(trainingDataView);
    IEnumerable<ITransformer> chain = trainedModel as IEnumerable<ITransformer>;
    ISingleFeaturePredictionTransformer<object> predictionTransformer =
        chain.Last() as ISingleFeaturePredictionTransformer<object>;
    var originalModelParameters = predictionTransformer.Model as LinearBinaryModelParameters;

    var model = dataPrePipeline
        .Append(mlContext
            .BinaryClassification
            .Trainers
            .AveragedPerceptron(
                labelColumnName: "ILikeDieHard",
                numberOfIterations: 10,
                featureColumnName: "Features")
            .Fit(transformedData, originalModelParameters));

    mlContext.Model.Save(model, trainingDataView.Schema, "./diehard-model.zip");

    return model;
}

The RetrainModel method in this example will be called on the next day, why new user ratings have been accumulated since the last run. For this example, I'm using the same training set for simplicity.

In the first lines, I load both the pipeline and the model that we prepared in the previous method. The next code block looks a bit strange. Figuring out how to obtain the original model parameters using a range of casts and LINQ code was definitely the hardest part figuring out when creating this code. Microsoft only shows how to do this with a single trainer type and the code you need to write vary from trainer to trainer. I hope the documentation there will improve over time.

Once we have the parameters from the data preparation pipeline, we can configure an Averaged Perceptron trainer similar to the one used in the TrainNewModel method and call Fit to retrain the model. Again, I save the retrained model to run on the newest data next time we run the console app.

The final thing missing is making the predictions. I'm creating two new MoviePreferenceInput objects to use for a test. Add them after the if/else in the Main method:

var input1 = new MoviePreferenceInput
{
    StarWarsScore = 7,
    ArmageddonScore = 9,
    SleeplessInSeattleScore = 0
};
var input2 = new MoviePreferenceInput
{
    StarWarsScore = 0,
    ArmageddonScore = 0,
    SleeplessInSeattleScore = 10
};

The first object represents someone who likes Star Wars, loves Armageddon, but hates Sleepless in Seattle. The second one hates the first two moves but loves the last one. Notice that the ILikeDieHard boolean isn't specified for these input since that is the boolean we want ML.NET to predict. The prediction isn't added to the MoviePreferenceInput but is specified in a new class:

class LikeDieHardPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Prediction { get; set; }
}

Then use the model to create a prediction engine and output the result:

PredictionEngine<MoviePreferenceInput, LikeDieHardPrediction> predictionEngine =
    mlContext.Model.CreatePredictionEngine<MoviePreferenceInput, LikeDieHardPrediction>(model);
var prediction = predictionEngine.Predict(input1);
Console.WriteLine($"First user loves Die Hard: {prediction.Prediction}");
prediction = predictionEngine.Predict(input2);
Console.WriteLine($"Second user loves Die Hard: {prediction.Prediction}");

I start by using the context to create a new prediction engine. This takes the input and output as generic types. Next, I call the Predict method with each input and output the value of the Prediction property:

First user loves Die Hard: True
Second user loves Die Hard: False

The first user is predicted to like Die Hard and the second one not to. By looking at the test data, it's pretty clear that the combination between existing users' high rating for Armageddon and the love for Die Hard, teaches ML.NET that other users that also have high ratings for Armageddon, probably will like Die Hard as well.

The full source code example for this blog post can be found in the ILikeDieHard GitHub repo.

elmah.io: Error logging and Uptime Monitoring for your web apps

This blog post is brought to you by elmah.io. elmah.io is error logging, uptime monitoring, deployment tracking, and service heartbeats for your .NET and JavaScript applications. Stop relying on your users to notify you when something is wrong or dig through hundreds of megabytes of log files spread across servers. With elmah.io, we store all of your log messages, notify you through popular channels like email, Slack, and Microsoft Teams, and help you fix errors fast.

See how we can help you monitor your website for crashes Monitor your website