PatientKFold

A simple K-Fold cross-validator with 'group' awareness. It will ensure that a group will not be on the train_set and test_set at the same time.

The best example is when working with medical data. When performing cross-validation, all data related to one patient muss be hold together in the same split and muss not leak to the other.

This class is inspired by scikit-learn's K-Fold and GroupKFold cross-validators, but (subjectively) easier to use and with better support for pandas DataFrame objects.

Requirements

What libraries you need to install and how to install them.

numpy>=1.8.0

You can install them either manually or through the command:

pip install -r requirements.txt

Package Installation

If you want to use this class, you have two options:

A) Simply copy and paste it in your project;

B) Or install it through pip following the command bellow:

pip install git+git://github.com/danilown/PatientKFold#egg=PatientKFold

Then, using it is as simples as:

from PatientKFold import PatientKFold

Note 1: As noted by David Winterbottom, if you freeze the environment to export the dependencies, note that this will add the specific commit to your requirements, so it might be a good idea to delete the commit ID from it.

Note 2: Due to the simplicity of this "package", this installation method was preferred over the more traditional PyPI.

Usage

The following examples are going to show how you could use this class.

First example is splitting a list of patient ids into 5 Folds.

Example 1:

from PatientKFold import PatientKFold

patients = [1,2,3,4,5,6,7,8,9,10,11,12,13]
p = PatientKFold(patients, random_state=42)

for train_patients, test_patients in p:
    print(train_patients, test_patients)
    print('===')

Output:

# [10, 13, 6, 12, 9, 4, 5, 1, 2, 11] [8, 7, 3]
# ===
# [8, 7, 3, 12, 9, 4, 5, 1, 2, 11] [10, 13, 6]
# ===
# [8, 7, 3, 10, 13, 6, 5, 1, 2, 11] [12, 9, 4]
# ===
# [8, 7, 3, 10, 13, 6, 12, 9, 4, 2, 11] [5, 1]
# ===
# [8, 7, 3, 10, 13, 6, 12, 9, 4, 5, 1] [2, 11]
# ===

In the second example we split a pd.DataFrame into 5 Folds informing which column represents the patient id.

Example 2:

from PatientKFold import PatientKFold
import pandas as pd

patient_df = {
    'patient': [1,2,2,3,4,5,5,5],
    'other_columns': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
} 

patient_df = pd.DataFrame.from_dict(patient_df)

# patient_df =
#       patient | other_columns
# 0        1             a
# 1        2             b
# 2        2             c
# 3        3             d
# 4        4             e
# 5        5             f
# 6        5             g
# 7        5             h

p = PatientKFold(patient_df, col_patient_id='patient', random_state=42)

for n_fold, (train_patients, test_patients) in enumerate(p):
    print('===')
    print(f'FOLD: {n_fold}')
    print("train_patients:")
    print(train_patients)
    print("test_patients:")
    print(test_patients)

Output:

# FOLD: 0
# train_patients:
#       patient | other_columns
# 0        1             a
# 1        2             b
# 2        2             c
# 3        3             d
# 5        5             f
# 6        5             g
# 7        5             h
# test_patients:
#       patient | other_columns
# 4        4             e
# ===
#
# FOLD: 1
# train_patients:
#       patient | other_columns
# 0        1             a
# 3        3             d
# 4        4             e
# 5        5             f
# 6        5             g
# 7        5             h
# test_patients:
#       patient | other_columns
# 1        2             b
# 2        2             c
# ===
#
# FOLD: 2
# train_patients:
#       patient | other_columns
# 0        1             a
# 1        2             b
# 2        2             c
# 4        4             e
# 5        5             f
# 6        5             g
# 7        5             h
# test_patients:
#       patient | other_columns
# 3        3             d
# ===
#
# FOLD: 3
# train_patients:
#       patient | other_columns
# 0        1             a
# 1        2             b
# 2        2             c
# 3        3             d
# 4        4             e
# test_patients:
#       patient | other_columns
# 5        5             f
# 6        5             g
# 7        5             h
# ===
#
# FOLD: 4
# train_patients:
#       patient | other_columns
# 1        2             b
# 2        2             c
# 3        3             d
# 4        4             e
# 5        5             f
# 6        5             g
# 7        5             h
# test_patients:
#       patient | other_columns
# 0        1             a

Testing

In order to test this class, just run:

python -m unittest

Support

If you would like to see a new functionality, have a suggestion on how to make the documentation clearer or report a problem, you can open an issue here on Github or send me an e-mail danilownunes@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
PatientKFold		PatientKFold
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatientKFold

Requirements

Package Installation

Usage

Testing

Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PatientKFold

Requirements

Package Installation

Usage

Testing

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages