A simple K-Fold cross-validator with 'group' awareness. It will ensure that a group will not be on the train_set and test_set at the same time.
The best example is when working with medical data. When performing cross-validation, all data related to one patient muss be hold together in the same split and muss not leak to the other.
This class is inspired by scikit-learn's K-Fold and GroupKFold cross-validators, but (subjectively) easier to use and with better support for pandas DataFrame objects.
What libraries you need to install and how to install them.
numpy>=1.8.0You can install them either manually or through the command:
pip install -r requirements.txtIf you want to use this class, you have two options:
A) Simply copy and paste it in your project;
B) Or install it through pip following the command bellow:
pip install git+git://github.com/danilown/PatientKFold#egg=PatientKFoldThen, using it is as simples as:
from PatientKFold import PatientKFoldNote 1: As noted by David Winterbottom, if you freeze the environment to export the dependencies, note that this will add the specific commit to your requirements, so it might be a good idea to delete the commit ID from it.
Note 2: Due to the simplicity of this "package", this installation method was preferred over the more traditional PyPI.
The following examples are going to show how you could use this class.
First example is splitting a list of patient ids into 5 Folds.
Example 1:
from PatientKFold import PatientKFold
patients = [1,2,3,4,5,6,7,8,9,10,11,12,13]
p = PatientKFold(patients, random_state=42)
for train_patients, test_patients in p:
print(train_patients, test_patients)
print('===')Output:
# [10, 13, 6, 12, 9, 4, 5, 1, 2, 11] [8, 7, 3]
# ===
# [8, 7, 3, 12, 9, 4, 5, 1, 2, 11] [10, 13, 6]
# ===
# [8, 7, 3, 10, 13, 6, 5, 1, 2, 11] [12, 9, 4]
# ===
# [8, 7, 3, 10, 13, 6, 12, 9, 4, 2, 11] [5, 1]
# ===
# [8, 7, 3, 10, 13, 6, 12, 9, 4, 5, 1] [2, 11]
# ===In the second example we split a pd.DataFrame into 5 Folds informing which column represents the patient id.
Example 2:
from PatientKFold import PatientKFold
import pandas as pd
patient_df = {
'patient': [1,2,2,3,4,5,5,5],
'other_columns': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
}
patient_df = pd.DataFrame.from_dict(patient_df)
# patient_df =
# patient | other_columns
# 0 1 a
# 1 2 b
# 2 2 c
# 3 3 d
# 4 4 e
# 5 5 f
# 6 5 g
# 7 5 h
p = PatientKFold(patient_df, col_patient_id='patient', random_state=42)
for n_fold, (train_patients, test_patients) in enumerate(p):
print('===')
print(f'FOLD: {n_fold}')
print("train_patients:")
print(train_patients)
print("test_patients:")
print(test_patients)Output:
# FOLD: 0
# train_patients:
# patient | other_columns
# 0 1 a
# 1 2 b
# 2 2 c
# 3 3 d
# 5 5 f
# 6 5 g
# 7 5 h
# test_patients:
# patient | other_columns
# 4 4 e
# ===
#
# FOLD: 1
# train_patients:
# patient | other_columns
# 0 1 a
# 3 3 d
# 4 4 e
# 5 5 f
# 6 5 g
# 7 5 h
# test_patients:
# patient | other_columns
# 1 2 b
# 2 2 c
# ===
#
# FOLD: 2
# train_patients:
# patient | other_columns
# 0 1 a
# 1 2 b
# 2 2 c
# 4 4 e
# 5 5 f
# 6 5 g
# 7 5 h
# test_patients:
# patient | other_columns
# 3 3 d
# ===
#
# FOLD: 3
# train_patients:
# patient | other_columns
# 0 1 a
# 1 2 b
# 2 2 c
# 3 3 d
# 4 4 e
# test_patients:
# patient | other_columns
# 5 5 f
# 6 5 g
# 7 5 h
# ===
#
# FOLD: 4
# train_patients:
# patient | other_columns
# 1 2 b
# 2 2 c
# 3 3 d
# 4 4 e
# 5 5 f
# 6 5 g
# 7 5 h
# test_patients:
# patient | other_columns
# 0 1 aIn order to test this class, just run:
python -m unittestIf you would like to see a new functionality, have a suggestion on how to make the documentation clearer or report a problem, you can open an issue here on Github or send me an e-mail danilownunes@gmail.com.