How split data can affect you data evaluation

Random Split, Stratified Slit in 'How your metrics could be affect'

From a binary class pespective

1+1=3

As simple as Logistic Regression

Validation model

When to split

Preprocessing

import numpy as np
import seaborn as sns;
import matplotlib.pyplot as plt; plt.style.use('ggplot')
N = int(10e6)
log_normal = np.random.lognormal(size=N)
var_log, mean_log = log_normal.var(), log_normal.mean()
var_log, mean_log
(4.658509594268878, 1.6486913090041344)
dist = sns.distplot(log_normal)
Cover
Read more ArchLinux page.

png

log_normal_sample = np.random.choice(log_normal, size=int(0.3*N))
var_log_sample, mean_log_sample = log_normal_sample.var(), log_normal_sample.mean()
var_log_sample, mean_log_sample
(4.617970633233202, 1.6470042949353507)
dist_sample = sns.distplot(log_normal_sample)
Cover
Read more ArchLinux page.

png


Trainnig

Metrics

Fine Tunning