Handling Missing Values in a CSV File in Python
In Python, missing values in a CSV file can be handled using the pandas
library, which provides functions like fillna()
, dropna()
, and interpolate()
. These functions help in replacing, removing, or estimating missing values efficiently. In this tutorial, we will explore different techniques to handle missing values in a CSV file.
Examples to Handle Missing Values in a CSV File
1. Detecting Missing Values in a CSV File
Before handling missing values, we need to identify where they exist. In this example, we will read a CSV file using pandas
and check for missing values using isnull()
and sum()
.
main.py
import pandas as pd
# Reading the CSV file
df = pd.read_csv("data.csv")
# Checking for missing values
missing_values = df.isnull().sum()
# Printing missing values count for each column
print("Missing values in each column:\n", missing_values)
Explanation:
pd.read_csv("data.csv")
: Reads the CSV file into a DataFrame.df.isnull()
: Returns a DataFrame withTrue
for missing values andFalse
otherwise.df.isnull().sum()
: Counts the number of missing values in each column.print()
: Displays the missing values count for each column.
Output:
Missing values in each column:
Name 0
Age 2
Salary 1
City 3
dtype: int64
2. Removing Rows with Missing Values
Sometimes, it is necessary to remove rows containing missing values if they are not useful. We use the dropna()
method to achieve this.
main.py
# Removing rows with missing values
df_cleaned = df.dropna()
# Printing the cleaned DataFrame
print(df_cleaned)
Explanation:
df.dropna()
: Removes all rows containing at least one missing value.df_cleaned
: Stores the cleaned DataFrame without missing values.print(df_cleaned)
: Displays the DataFrame after removing missing values.
Output:
(DataFrame output without missing values)
3. Replacing Missing Values with a Default Value
Instead of removing rows, we can replace missing values with a default value using the fillna()
method.
main.py
# Replacing missing values with a default value
df_filled = df.fillna("Unknown")
# Printing the updated DataFrame
print(df_filled)
Explanation:
df.fillna("Unknown")
: Replaces all missing values with"Unknown"
.df_filled
: Stores the updated DataFrame with replaced values.print(df_filled)
: Displays the DataFrame after replacing missing values.
Output:
(DataFrame output with "Unknown" replacing missing values)
4. Filling Missing Values with the Column Mean
When dealing with numerical data, filling missing values with the column mean is a common approach.
main.py
# Filling missing values in 'Age' column with its mean
df["Age"].fillna(df["Age"].mean(), inplace=True)
# Printing the updated DataFrame
print(df)
Explanation:
df["Age"].mean()
: Computes the mean of the ‘Age’ column.df["Age"].fillna(df["Age"].mean(), inplace=True)
: Fills missing values in ‘Age’ with its mean.print(df)
: Displays the DataFrame with missing values replaced by the mean.
Output:
(DataFrame output with missing 'Age' values replaced by mean)
Conclusion
Handling missing values in a CSV file is essential for accurate data analysis. Here are the key techniques:
- Detecting missing values using
isnull().sum()
. - Removing missing values using
dropna()
. - Replacing missing values with a default value using
fillna()
. - Filling missing numerical values with column mean.