Track your Experiments
Track your model training experiments by recording hyper-parameters and metrics like loss and accuracy
Model training is an extensive process where you go through cycles of tuning hyper-parameters and understanding how changes impact the convergence of your models, accuracy against the validation set, system resource utilization, etc. Understanding how a model trains against a set of hyper-parameters is crucial for optimizing experimentation results.
MarkovML facilitates this process by allowing you to record your model training Experiments using the Python SDK. You can then inspect metrics like loss and accuracy and system metrics like CPU and memory usage against the set of hyper-parameters provided.
Create Experiment Recorder
Markov uses Recorder objects to record data into the MarkovML backend. Recording data from a model training experiment with MarkovML is a three-step process.
- Create an experiment recorder object and then register it with MarkovML backend.
- Add experiment records to the recorder to send to the MarkovML backend.
- Call finish on recorder object to signal to finish of a recording.
Integrations with ML Frameworks
To make the above process easier, Markov has easy integrations with many popular machine-learning frameworks. Please refer to Integrations page.
When you create an experiment recorder, you'll provide a name for the training session, the ID of the relevant Project, as well as any relevant hyper-parameter values used in the model training. You may also add any notes you'd like to save with the experiment. See the code example below for more details.
Experiments are stored in Projects
Before creating the experiment recorder, get a reference to a Project (specify the project ID) and define the hyper-parameters used for your model training.
Follow the sample code below to create an experiment tracking recorder:
import markov
# Get the existing project by name or id
try:
my_project = markov.Project.get_by_name(project_name="My first project")
# my_project = markov.Project.get_by_id(project_id="vT6C9Qmddissku")
except markov.exceptions.ResourceNotFoundException:
my_project = markov.Project(name="My first project")
my_project.register()
# Define the model here
# hyper_parameters = {"learning_rate":0.1, "n_input":100, ...}
# model = torch.nn.Sequential()
# Alternatively, you can use a project's create_experiment_recorder method.
# In this case, the project_id argument will be inferred.
recorder = my_project.create_experiment_recorder(
experiment_name = "Test_Experiment_Tracking_With_MarkovML",
experiment_notes = "This is a test experiment",
hyper_parameters ={
"learning_rate": 0.1,
"n_input": 100,
"n_hidden": 50,
"n_output": 10
},
# Optional Key/Value pair to store values for custom variables with MarkovML
# for this experiment. For example
meta_data = {
"exp_code": "AGI",
"config":{
"feature_store": 1.2,
"lexicon":"/path/secret_store/table.id"
}
}
# meta_data={"KEY": "VALUE_PAIR"}
)
# Register the experiment recorder with the MarkovML backend. Only a registered
# experiment recorder can be used to add records.
recorder.register()
Note: The hyper_parameters field for the recorder can take any key-value pairs you want to register with the recorder as a hyper-parameter.
When an experiment recorder is instantiated, two new resources are created in MarkovML:
-
An Experiment.
-
A Model object.
This Model object is a placeholder to store a pointer to the artifact generated from the training session in the future. After creating the recorder, we can send the experimentation data to MarkovML.
Add Records to an Experiment Recorder
In this article, we will go over recording experiment data through the experiment recorder during the model training (experiment) step. The instructions to create an experiment recorder are here.
An ExperimentRecorder
has the following key methods:
start()
: start the model training and should be called just before the model training code begins executing.stop()
: end the model training and should be called right after the model training code has finished.add_record({"key": value})
: The API used to record metrics with MarkovML. Here key could be any string and value could be any numeric value. The add_record should be called the training loop to record metrics like loss and accuracy. These recorded metrics will be available as charts in the MarkovML app.
Add-on Info
The recorder instance can also be utilized as a context manager using the
with
statement to avoid explicit calls to thestart()
andstop()
methods.
Follow the sample code pattern below to add records to the experiment tracking recorder.
import markov
# Your model definition
# model = ...
# create recorder
recorder = markov.ExperimentRecorder(
name="MarkovML Experiment #1"
)
# The register method registers this recorder with the MarkovML backend.
# You can record data with the backend only through a registered recorder.
recorder.register()
with recorder:
for epoch in range(500):
pred = model(x)
# Calculate and record loss
loss = loss_function(pred, y)
recorder.add_record({"loss": loss})
# calculate and record accuracy
accuracy = accuracy_function(pred, y)
recorder.add_record({"accuracy": accuracy})
# REST OF THE MODEL TRAINING CODE GOES HERE
Integrations with ML Frameworks
Markov has easy integrations with many popular machine-learning frameworks. Please refer to Integrations page.
Add Summary to an Experiment Recorder
Summary helps in getting a glimpse of an experiment and helps in quickly comparing multiple experiments.
You can set the summary of an experiment using the Markov SDK.
For new experiment
import markov
# Your model definition
# model = ...
# create recorder
recorder = markov.ExperimentRecorder(
name="MarkovML experiment #1"
)
# The register method registers this recorder with the MarkovML backend.
# You can record data with the backend only through a registered recorder.
recorder.register()
with recorder:
for epoch in range(500):
pred = model(x)
# Calculate and record loss
loss = loss_function(pred, y)
recorder.add_record({"loss": loss})
# calculate and record accuracy
accuracy = accuracy_function(pred, y)
recorder.add_record({"accuracy": accuracy})
# REST OF THE MODEL TRAINING CODE GOES HERE
recorder.summary.add_training_loss(value=str(loss))
recorder.summary.add_training_accuracy(value=str(accuracy))
Create a virtual environment with all experiment packages
We store all the packages that were used while running an experiment. You can create a virtual environment with all these packages already installed using the MarkovML CLI when you re-run your experiment.
virtualenv
To create a virtual environment using virtualenv
, make sure virtualenv
is installed on your system. If not, you can follow instructions here.
Install virtualenv
environment in current directory
Note:
virtualenv
will be created with name<experiment_id>_venv
by default
mkv experiment virtualenv -e <experiment_id>
Install virtualenv
environment at a specified path
mkv experiment virtualenv -e <experiment_id> -p <path>
conda
To create a conda environment, make sure conda
is installed in your system. If not, you can follow instructions here
Install conda
environment in current directory
Note:
conda
environment will be created with name<experiment_id>_conda_env
by default
mkv experiment conda-env -e <experiment_id>
Install conda
environment at a specified path
mkv experiment conda-env -e <experiment_id> -p <path>
Complete sample code
This section provides an example of model training using a PyTorch convolutional neural net. We will be using the model described in the PyTorch documentation here for model training.
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import markov
# Get and load the dataset (CIFAR10 dataset)
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
batch_size = 4
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
# Define the model
class Net(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = torch.flatten(x, 1) # flatten all dimensions except batch
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# Fetch or create project
project_name = "my_markov_project"
try:
project = markov.Project.get_by_name(project_name)
except markov.exceptions.ResourceNotFoundException:
project = markov.Project(project_name)
project.register()
# Create the ExperimentRecorder
recorder = markov.ExperimentRecorder(
name="Image classifier using CNN for CIFAR10",
note="We are using a convolutional neural net to "
"train a classifier with the CIFAR10 dataset. "
"CIFAR10 has the classes: ‘airplane’, ‘automobile’, "
"‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, "
"‘ship’, ‘truck’. The images in CIFAR-10 are of "
"size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.",
hyper_parameters={
"lr": 0.001,
"momentum": 0.9,
"batch_size": 4,
"optimizer": "SGD"
},
project_id=project.project_id
)
# Start model training
with recorder:
running_loss = 0.0
for epoch in range(10): # Loop over the dataset multiple times
data_points = 0
for i, data in enumerate(trainloader, 0):
# Get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# Zero the parameter gradients
optimizer.zero_grad()
# Forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
data_points += 1
avg_running_loss = running_loss / data_points
# Record running loss
recorder.add_record({"average running loss": avg_running_loss})
# Add summary
recorder.summary.add_training_loss(value=str(running_loss))
Updated 9 months ago