Compare Datasets
Visualize and highlight similarities and differences amongst multiple datasets
MarkovML allows the comparison of a primary dataset against multiple datasets. The comparison consists of the following:
- Basic info includes the number of records in each dataset, data family names, dataset source, etc.
- Class distribution for the datasets
- Dataset segment similarity based on statistical measures. The following measures are computed:
Steps
- Follow Register Datasets before initiating dataset comparison. You will need dataset IDs (ds_id) to trigger a comparison. Only registered datasets can be compared.
- Trigger a comparison run between a primary dataset and multiple secondary datasets.
import markov
# Trigger dataset comparison from SDK
# Get primary dataset
primary_dataset = markov.dataset.get_by_id("ds_id1")
# comparison
primary_dataset.compare(compare_input=["ds_id2", "ds_id3", "ds_id4"])
It takes some time to finish the comparision. Once the comparision is complete, you can see the comparision results on the Runs Page. You'll get a notification in the email when the comparision is complete.
View Dataset Comparison Jobs
Any compute jobs that run for Dataset Comparison will be listed on the Runs page.
.
Updated 9 months ago
What’s Next