What Happens During Dataset Registration?
Dataset registration with MarkovML
Dataset Type Identification
It begins by identifying the type of dataset that has been registered to MarkovML.
Analysis and Processing of Datasets
For datasets containing text, categorical, and numerical features, MarkovML conducts various analyses to extract insights. Users can customize these analyses using MarkovML user interface (UI).
Pre-processing for Text Datasets
Before analyzing a text dataset, we perform specific pre-processing steps to clean, normalize, and prepare the data. These steps include:
- Data Cleansing: Removing duplicates, numbers, URLs, etc.
- Data Normalization: Eliminating stop words, converting to lowercase, removing punctuation, stemming, etc.
It's important to note that these pre-processing steps do not alter the original dataset.
Data Analysis
Text Datasets
For text datasets, you can apply a range of analyzers from the MarkovML Analyzer library. These include N-gram analysis, keyword extraction, topic modeling, and data profiling.
Mixed-Categorical Datasets (Text + Categorical Values)
If a dataset contains both categorical and text columns, you can selectively apply analyzers to the text columns.
Numerical Datasets
Datasets primarily composed of numerical data undergo data profiling to extract extensive summary statistics.
AutoML-Baseline Model and Quality Analysis
For datasets with a categorical target column, MarkovML conducts two additional analyses:
Baseline Model Analyzer
We train a baseline model using AutoML to establish a benchmark for performance evaluation.
Markov Quality Analyzer
This tool assesses the labeling quality of the dataset using various label error estimation algorithms.
Details of Baseline Model Training:
When creating the baseline model, we (MarkovML) consider different scenarios:
- If train/test splits are specified during dataset registration, we use the train set for model training and the test set for evaluating final metrics.
- For datasets without specified splits, we automatically split them into train and test sets using stratified sampling to ensure representativeness. We employ k-fold cross-validation to prevent overfitting, keeping the test set isolated for final metric generation.
- The baseline model is trained via an AutoML pipeline, which autonomously selects the optimal model (e.g., XGBoost, LGBM, Random Forest, logistic regression) and determines the best hyper-parameters using techniques like CFO/Blend Search.
NOTE
For Enterprise Customers with a Hybrid deployment setup, MarkovML does not store datasets, and all analyzers run within the customer's VPC.
Updated about 1 month ago