Dataset Comparison
Visualize and highlight similarities and differences amongst multiple datasets
MarkovML allows the comparison of a primary dataset against multiple datasets. The comparison consists of the following:
- Basic info includes the number of records in each dataset, data family names, dataset source, and more.
- Class distribution for the datasets.
- Dataset segment similarity based on statistical measures. The following measures are computed:
Run Dataset Comparison
If you have not already registered the dataset then refer to the Register Dataset page and register your dataset to MarkovML. You will need the dataset ID (ds_id
) to start a comparison.
Note
Only registered datasets can be compared.
Use the compare()
method to trigger comparison between the primary dataset and multiple secondary datasets as shown in the code below:
Sample Code
import markov
# Trigger dataset comparison from SDK
# Get primary dataset
primary_dataset = markov.dataset.get_by_id("ds_id1")
# comparison
primary_dataset.compare(compare_input=["ds_id2", "ds_id3", "ds_id4"])
Note
After initiating the comparison, it will take some time to complete. Once finished, you can view the results on the Runs Page in MarkovML UI. You will receive an email notification when the comparison is done.
View Dataset Comparison Jobs
Any compute jobs that run for Dataset Comparison will be listed on the Runs page.
.
Comparison Result
We have selected four datasets randomly and initiated a comparison. Here's what the comparison page will look like.
Updated about 2 months ago