ratiopath.model_selection.split
ArrayLike = np.typing.ArrayLike
Float = float | np.float16 | np.float32 | np.float64
Int = int | np.int8 | np.int16 | np.int32 | np.int64
MatrixLike = np.ndarray | pd.DataFrame | spmatrix
StratifiedGroupShuffleSplit
Bases: GroupsConsumerMixin, BaseShuffleSplit
Stratified shuffle split with non-overlapping groups.
Provides train/test indices to split data such that both stratification (preserving class distribution) and grouping (non-overlapping groups between splits) are maintained.
This splitter combines the functionality of StratifiedShuffleSplit and GroupShuffleSplit. It attempts to create folds which preserve the percentage of samples from each class while ensuring that samples from the same group do not appear in both train and test sets.
Read more in the :ref:User Guide <cross_validation>.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_splits
|
Int
|
Number of re-shuffling & splitting iterations. |
5
|
test_size
|
None | Float
|
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. |
None
|
train_size
|
None | Float
|
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. |
None
|
random_state
|
RandomState | None | Int
|
Controls the randomness of the training and testing indices. Pass
an int for reproducible output across multiple function calls.
See :term: |
None
|
Examples:
>>> import numpy as np
>>> from ratiopath.model_selection import StratifiedGroupShuffleSplit
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
>>> y = np.array([0, 0, 1, 1, 0, 1])
>>> groups = np.array([1, 1, 2, 2, 3, 3])
>>> sgss = StratifiedGroupShuffleSplit(n_splits=2, random_state=42)
>>> for train_index, test_index in sgss.split(X, y, groups):
... print(f"Train: {train_index}, Test: {test_index}")
Train: [0 1 2 3], Test: [4 5]
Train: [2 3 4 5], Test: [0 1]
Notes
The implementation finds the best stratification split by trying multiple splits and selecting the one that minimizes the difference between the class distributions in the original data and the test split.
Source code in ratiopath/model_selection/split.py
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
__init__(n_splits=5, *, test_size=None, train_size=None, random_state=None)
Source code in ratiopath/model_selection/split.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
split(X, y=None, groups=None)
Generate indices to split data into training and test set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
list[str] | MatrixLike
|
Training data, where |
required |
y
|
ArrayLike | None
|
The target variable for supervised learning problems. Stratification is done based on the y labels. |
None
|
groups
|
Any
|
Group labels for the samples used while splitting the dataset into train and test set. Must be provided. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
train |
Any
|
The training set indices for that split. |
test |
Any
|
The testing set indices for that split. |
Source code in ratiopath/model_selection/split.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None, groups=None)
Split arrays or matrices into random train and test subsets.
This is an extended version of sklearn.model_selection.train_test_split that
adds support for stratified splits with non-overlapping groups. When both
stratify and groups are provided, uses StratifiedGroupShuffleSplit to
ensure both class distributions and group separation are preserved.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*arrays
|
Any
|
sequence of indexables with same length / shape[0] Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. |
()
|
test_size
|
None | Float
|
If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the absolute
number of test samples. If None, the value is set to the complement of the
train size. If |
None
|
train_size
|
None | Float
|
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. |
None
|
random_state
|
RandomState | None | Int
|
Controls the randomness of the training and testing indices. Pass
an int for reproducible output across multiple function calls.
See :term: |
None
|
shuffle
|
bool
|
Whether or not to shuffle the data before splitting. If False, stratify must be None. |
True
|
stratify
|
None | ArrayLike
|
If not None, data is split in a stratified fashion, using this as the class labels. For binary or multiclass classification, this ensures that the test and training sets have approximately the same percentage of samples of each target class as the complete set. |
None
|
groups
|
None | ArrayLike
|
Group labels for the samples used while splitting the dataset into train
and test set. When provided with |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
splitting |
list
|
List containing train-test split of inputs. If |
Examples:
>>> import numpy as np
>>> from ratiopath.model_selection import train_test_split
>>> X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
>>> y = np.array([0, 0, 1, 1])
>>> groups = np.array([1, 1, 2, 2])
>>> X_train, X_test, y_train, y_test = train_test_split(
... X, y, test_size=0.25, random_state=42, stratify=y, groups=groups
... )
>>> X_train
array([[1, 2],
[5, 6],
[7, 8]])
>>> X_test
array([[3, 4]])
Notes
When shuffle=True and both stratify and groups are provided, uses
StratifiedGroupShuffleSplit to split the data, ensuring that:
- The class distribution is preserved in train and test sets
- No group appears in both train and test sets
When only one of stratify or groups is provided, uses the appropriate
single-constraint splitter.
When shuffle=False, a stratified split is not supported and stratify
must be None.
Source code in ratiopath/model_selection/split.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | |