K fold cross validation Python code without sklearn

Alyssa 76 Published: 10/31/2024

K fold cross validation Python code without sklearn

Here is an example of how you can implement k-fold cross-validation in Python without using the scikit-learn library:

import numpy as np
from itertools import cycle
def kfold_cv(X, y, num_folds):
folds = list(cycle(range(num_folds)))
for fold in range(num_folds):
test_index = [i for i, x in enumerate(folds) if x == fold]
train_index = [i for i, x in enumerate(folds) if x != fold]
X_test = X[test_index]
y_test = y[test_index]
X_train = np.delete(X, test_index, axis=0)
y_train = np.delete(y, test_index, axis=0)
yield X_train, X_test, y_train, y_test
Example usage:
X = np.random.rand(100, 20)  # Assume you have your training data here
y = np.random.randint(2, size=100)
num_folds = 5  # You can change this to whatever number of folds you want
for train_X, test_X, train_y, test_y in kfold_cv(X, y, num_folds):
print("Training on fold {}.".format(train_y))
print("Testing on fold {}.".format(test_y))

In the above code:

We first import the necessary libraries: NumPy and itertools for handling cycles. The kfold_cv function takes in the input data X, target y, and number of folds num_folds. It then creates a list folds that cycles through all the possible fold numbers (0 to num_folds-1). For each fold, it identifies the test samples (test_index) and train samples (train_index) based on the current fold. It then selects the test samples from both X and y, as well as the training samples. The np.delete function is used to remove the test samples from the original arrays. The yield statement is used to return the train and test data for each fold, one at a time.

Example usage: In this example:

We first create some random training data (X) and target variable (y). You can replace these with your actual data. Set the number of folds you want to use (5 in this case). Call kfold_cv with your data and desired fold count. The function will then yield train and test data for each fold, which you can then use to train and evaluate your model.

Note that the above code is a simple example, and you may want to add more error checking and handling depending on your specific needs.

How to implement k-fold cross-validation from scratch in Python?

K-fold cross-validation is a widely used technique for evaluating the performance of machine learning models by splitting the data into multiple subsets, training and testing the model on each subset, and averaging the results. Here's how to implement k-fold cross-validation from scratch in Python:

Data Preparation: Load your dataset into a Pandas DataFrame or NumPy array. Make sure it's partitioned into features (X) and target variable (y). K-Fold Initialization: Define the number of folds (k). For example, let's set k = 5. Splitting Data: Split your dataset into k subsets using stratified sampling or random sampling with replacement. This will ensure that each subset has approximately the same distribution as the original data. Training Loop: Initialize an empty list to store the model performance metrics (e.g., accuracy, F1 score, etc.). Then, iterate over each fold: Train your machine learning model on k-1 folds. Use the remaining 1/fold of the data as test set. Evaluate the model's performance on this test set. Record the metric(s) you're interested in (e.g., accuracy, F1 score). Repeat this process for each fold. Averaging Performance: Calculate the average performance across all k folds. This gives you a robust estimate of your model's performance on unseen data.

Here's a sample implementation using Python:

import numpy as np
from sklearn.metrics import accuracy_score
def k_fold_cross_validation(X, y, k=5):
Initialize empty list to store performance metrics
scores = []
Split data into k subsets
indices = np.array_split(np.arange(len(y)), k)
for i in range(k):
Train model on k-1 folds
train_idx = np.concatenate((indices[0], *indices[1:i], *indices[i+1:]))
X_train, y_train = X[train_idx], y[train_idx]
model = SomeMachineLearningModel()
model.fit(X_train, y_train)
Use remaining 1/fold as test set
test_idx = indices[i]
X_test, y_test = X[test_idx], y[test_idx]
Evaluate model performance on this test set
y_pred = model.predict(X_test)
score = accuracy_score(y_test, y_pred)
scores.append(score)
Calculate average performance across all k folds
avg_score = np.mean(scores)
return avg_score
Example usage:
X = ...  # Load your dataset's features
y = ...  # Load your dataset's target variable
k = 5
avg_accuracy = k_fold_cross_validation(X, y, k)
print(f"Average accuracy across {k} folds: {avg_accuracy:.4f}")

Remember to replace SomeMachineLearningModel() with the actual machine learning model you're using (e.g., Logistic Regression, Random Forest, etc.).

This basic implementation assumes a simple accuracy metric. Feel free to modify it to accommodate your specific use case and performance metrics.