How to split a Dataset into Train and Test Sets using Python
Last Updated :
18 Apr, 2025
One of the most important steps in preparing data for training a ML model is splitting the dataset into training and testing sets. This simply means dividing the data into two parts: one to train the machine learning model (training set), and another to evaluate how well it performs on unseen data (testing set). The training set is used to fit the model, and the statistics of the training set are known. The second set is called the test data set which is solely used for predictions.
We’ll see how to split a dataset into train and test sets using Python. We'll use scikit-learn
library to perform the split efficiently. Whether you're working with numerical data, text, or images, this is an essential part of any supervised machine learning workflow.
Installation:
The scikit-learn library can be installed using pip:-
Python
Alternatively, it can also be downloaded from here.
Dataset Splitting
Scikit-learn is one of the most widely used machine learning libraries in Python. It provides a range of tools for building models, pre-processing data, and evaluating performance. For splitting datasets, it provides a handy function called train_test_split()
within the model_selection
module, making it simple to divide your data into training and testing sets.
Syntax:
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)
Parameters:
*arrays
: The data you want to split. This can be in the form of lists, arrays, pandas DataFrames, or matrices.test_size
: A number between 0.0 and 1.0 that tells what portion of the data should go into the test set. For example, 0.2
means 20% of the data will be used for testing.train_size
: this is a number between 0.0 and 1.0 that tells what portion of the data should go into the training set. If not set, it’s automatically calculated based on the test_size
.random_state
: A number that makes sure the split is the same every time you run the code. It’s like setting a seed for the shuffle.shuffle
: If True
, the data is shuffled before splitting. This helps make the train and test sets more random. It’s True
by default.stratify
: This helps keep the same class distribution in both the train and test sets. It’s useful especially for classification problems.
Example
Let us take a sample data to perform splitting of data over it. The data can be downloaded from here in the form of CSV.

In the example, we first import pandas
and sklearn
. Then, we load the CSV file using the read_csv()
function. This stores the data in a DataFrame called df
. we want to predict the house price, which is in the last column so we set that as y
(target). All the other columns are used as features, stored in X
.
We use train_test_split()
to split the data:
test_size=0.05
means 5% of the data is used for testing, and 95% for training.random_state=0
ensures the split is the same every time we run the code.
Python
# import modules
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# read the dataset
df = pd.read_csv('Real-estate.csv')
# get the locations
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.05, random_state=0)
Output:


Hence, we have our splitted dataset into training and testing set. If you want to learn further more about improving your machine learning flow, you may explore :-
- Stratified sampling
- Cross validation
- Handling imbalanced datasets
- Pre-processing before splitting
- Machine Learning Models
Similar Reads
How To Do Train Test Split Using Sklearn In Python In this article, let's learn how to do a train test split using Sklearn in Python. Train Test Split Using Sklearn The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y). The dataframe gets divided into X_t
5 min read
How to split data into training and testing in Python without sklearn Here we will learn how to split a dataset into Train and Test sets in Python without using sklearn. The main concept that will be used here will be slicing. We can use the slicing functionalities to break the data into separate (train and test) parts. If we were to use sklearn this task is very easy
2 min read
Split the Dataset into the Training & Test Set in R In this article, we are going to see how to Splitting the dataset into the training and test sets using R Programming Language. Method 1: Using base RÂ The sample() method in base R is used to take a specified size data set as input. The data set may be a vector, matrix or a data frame. This method
4 min read
How to Split a Dataset Using PyTorch Splitting a dataset is an important step in training machine learning models. It helps to separate the data into different sets, typically training, and validation, so we can train our model on one set and validate its performance on another. In this article, we are going to discuss the process of s
6 min read
How to Generate a Train-Test-Split Based on a Group ID? Splitting a dataset into training and testing sets is a common and critical step in building machine learning models. The typical train_test_split function randomly partitions the data into training and test subsets. However, there are cases when you need to ensure that data related to the same grou
10 min read
How to Conduct a Two Sample T-Test in Python In this article, we are going to see how to conduct a two-sample T-test in Python. This test has another name as the independent samples t-test. It is basically used to check whether the unknown population means of given pair of groups are equal. tt allows one to test the null hypothesis that the me
7 min read
How can Tensorflow be used to split the flower dataset into training and validation? The Tensorflow flower dataset is a large dataset that consists of flower images. In this article, we are going to see how we can split the flower dataset into training and validation sets. For the purposes of this article, we will use tensorflow_datasets to load the dataset. Â It is a library of publ
3 min read
How to split the Dataset With scikit-learn's train_test_split() Function In this article, we will discuss how to split a dataset using scikit-learns' train_test_split(). sklearn.model_selection.train_test_split() function: The train_test_split() method is used to split our data into train and test sets. First, we need to divide our data into features (X) and labels (y).
8 min read
How to Conduct a Paired Samples T-Test in Python Paired sample T-test: This test is also known as the dependent sample t-test. It is a statistical concept and is used to check whether the mean difference between the two sets of observation is equal to zero. Â Each entity is measured is two times in this test that results in the pairs of observation
3 min read
Estimators Inspect the Titanic Dataset using Python The TensorFlow Estimator API is a high-level interface that simplifies the process of training and evaluating machine learning models in TensorFlow. It provides pre-built model architectures and optimization algorithms, as well as tools for input preprocessing, evaluation, and serving. To use the Es
7 min read