Track your Experiments

Track your model training experiments by recording hyper-parameters and metrics like loss and accuracy

Model training is an extensive process where you go through cycles of tuning hyper-parameters and understanding how changes impact the convergence of your models, accuracy against the validation set, system resource utilization, etc. Understanding how a model trains against a set of hyper-parameters is crucial for optimizing experimentation results.

MarkovML facilitates this process by allowing you to record your model training Experiments using the Python SDK. You can then inspect metrics like loss and accuracy and system metrics like CPU and memory usage against the set of hyper-parameters provided.

Create Experiment Recorder

Markov uses Recorder objects to record data into the MarkovML backend. Recording data from a model training experiment with MarkovML is a three-step process.

  1. Create an experiment recorder object and then register it with MarkovML backend.
  2. Add experiment records to the recorder to send to the MarkovML backend.
  3. Call finish on recorder object to signal to finish of a recording.

📘

Integrations with ML Frameworks

To make the above process easier, Markov has easy integrations with many popular machine-learning frameworks. Please refer to Integrations page.

When you create an experiment recorder, you'll provide a name for the training session, the ID of the relevant Project, as well as any relevant hyper-parameter values used in the model training. You may also add any notes you'd like to save with the experiment. See the code example below for more details.

📘

Experiments are stored in Projects

Before creating the experiment recorder, get a reference to a Project (specify the project ID) and define the hyper-parameters used for your model training.

Follow the sample code below to create an experiment tracking recorder:

import markov

# Get the existing project by name or id
try:
    my_project = markov.Project.get_by_name(project_name="My first project")
    # my_project = markov.Project.get_by_id(project_id="vT6C9Qmddissku")
except markov.exceptions.ResourceNotFoundException:
    my_project = markov.Project(name="My first project")
    my_project.register()

# Define the model here
   # hyper_parameters = {"learning_rate":0.1, "n_input":100, ...}
   # model = torch.nn.Sequential()

# Alternatively, you can use a project's create_experiment_recorder method.
# In this case, the project_id argument will be inferred.
recorder = my_project.create_experiment_recorder(
    experiment_name = "Test_Experiment_Tracking_With_MarkovML",
    experiment_notes = "This is a test experiment",
    hyper_parameters ={
        "learning_rate": 0.1,
        "n_input": 100,
        "n_hidden": 50,
        "n_output": 10
    },

    # Optional Key/Value pair to store values for custom variables with MarkovML
    # for this experiment. For example
    meta_data = {
        "exp_code": "AGI",
        "config":{
                "feature_store": 1.2, 
                "lexicon":"/path/secret_store/table.id"
        }
    }
    # meta_data={"KEY": "VALUE_PAIR"} 
)

# Register the experiment recorder with the MarkovML backend. Only a registered
# experiment recorder can be used to add records.
recorder.register()

Note: The hyper_parameters field for the recorder can take any key-value pairs you want to register with the recorder as a hyper-parameter.

When an experiment recorder is instantiated, two new resources are created in MarkovML:

  1. An Experiment.

  2. A Model object.

This Model object is a placeholder to store a pointer to the artifact generated from the training session in the future. After creating the recorder, we can send the experimentation data to MarkovML.

Add Records to an Experiment Recorder

In this article, we will go over recording experiment data through the experiment recorder during the model training (experiment) step. The instructions to create an experiment recorder are here.

An ExperimentRecorder has the following key methods:

  • start(): start the model training and should be called just before the model training code begins executing.
  • stop(): end the model training and should be called right after the model training code has finished.
  • add_record({"key": value}): The API used to record metrics with MarkovML. Here key could be any string and value could be any numeric value. The add_record should be called the training loop to record metrics like loss and accuracy. These recorded metrics will be available as charts in the MarkovML app.

📘

Add-on Info

The recorder instance can also be utilized as a context manager using the with statement to avoid explicit calls to the start() and stop() methods.

Follow the sample code pattern below to add records to the experiment tracking recorder.

import markov

# Your model definition 
# model = ...

# create recorder
recorder = markov.ExperimentRecorder(
	name="MarkovML Experiment #1"
)
# The register method registers this recorder with the MarkovML backend. 
# You can record data with the backend only through a registered recorder.
recorder.register()

with recorder:
    for epoch in range(500):
        pred = model(x)  
        # Calculate and record loss
        loss = loss_function(pred, y)
        recorder.add_record({"loss": loss})
        
        # calculate and record accuracy
        accuracy = accuracy_function(pred, y)
        recorder.add_record({"accuracy": accuracy})
        
        # REST OF THE MODEL TRAINING CODE GOES HERE

📘

Integrations with ML Frameworks

Markov has easy integrations with many popular machine-learning frameworks. Please refer to Integrations page.

Add Summary to an Experiment Recorder

Summary helps in getting a glimpse of an experiment and helps in quickly comparing multiple experiments.
You can set the summary of an experiment using the Markov SDK.

For new experiment

import markov

# Your model definition 
# model = ...

# create recorder
recorder = markov.ExperimentRecorder(
    name="MarkovML experiment #1"
)
# The register method registers this recorder with the MarkovML backend. 
# You can record data with the backend only through a registered recorder.
recorder.register()

with recorder:
    for epoch in range(500):
        pred = model(x)  
        # Calculate and record loss
        loss = loss_function(pred, y)
        recorder.add_record({"loss": loss})
        
        # calculate and record accuracy
        accuracy = accuracy_function(pred, y)
        recorder.add_record({"accuracy": accuracy})
    
        # REST OF THE MODEL TRAINING CODE GOES HERE
            
    recorder.summary.add_training_loss(value=str(loss))
    recorder.summary.add_training_accuracy(value=str(accuracy))

Create a virtual environment with all experiment packages

We store all the packages that were used while running an experiment. You can create a virtual environment with all these packages already installed using the MarkovML CLI when you re-run your experiment.

virtualenv

To create a virtual environment using virtualenv, make sure virtualenv is installed on your system. If not, you can follow instructions here.

Install virtualenv environment in current directory

Note: virtualenv will be created with name <experiment_id>_venv by default

mkv experiment virtualenv -e <experiment_id>

Install virtualenv environment at a specified path

mkv experiment virtualenv -e <experiment_id> -p <path>

conda

To create a conda environment, make sure conda is installed in your system. If not, you can follow instructions here

Install conda environment in current directory

Note: conda environment will be created with name <experiment_id>_conda_env by default

mkv experiment conda-env -e <experiment_id>

Install conda environment at a specified path

mkv experiment conda-env -e <experiment_id> -p <path>

Complete sample code

This section provides an example of model training using a PyTorch convolutional neural net. We will be using the model described in the PyTorch documentation here for model training.

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import markov

# Get and load the dataset (CIFAR10 dataset)
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
           

# Define the model
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Fetch or create project
project_name = "my_markov_project"
try:
	project = markov.Project.get_by_name(project_name)
except markov.exceptions.ResourceNotFoundException:
  project = markov.Project(project_name)
  project.register()

# Create the ExperimentRecorder
recorder = markov.ExperimentRecorder(
    name="Image classifier using CNN for CIFAR10",
    note="We are using a convolutional neural net to "
         "train a classifier with the CIFAR10 dataset. "
         "CIFAR10 has the classes: ‘airplane’, ‘automobile’, "
         "‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, "
         "‘ship’, ‘truck’. The images in CIFAR-10 are of "
         "size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.",
    hyper_parameters={
        "lr": 0.001,
        "momentum": 0.9,
        "batch_size": 4,
        "optimizer": "SGD"
    },
    project_id=project.project_id
)

# Start model training
with recorder:
    running_loss = 0.0
    for epoch in range(10):  # Loop over the dataset multiple times
        data_points = 0
        for i, data in enumerate(trainloader, 0):
            # Get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
    
            # Zero the parameter gradients
            optimizer.zero_grad()
    
            # Forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step() 

            running_loss += loss.item()
            data_points += 1
        
        avg_running_loss = running_loss / data_points
        
        # Record running loss
        recorder.add_record({"average running loss": avg_running_loss}) 
   
   # Add summary
   recorder.summary.add_training_loss(value=str(running_loss))

What’s Next