Read Large Files Efficiently in Python
To read large files efficiently in Python, you should use memory-efficient techniques such as reading the file line-by-line using with open()
and readline()
, reading files in chunks with read()
, or using libraries like pandas
and csv
for structured data. These methods ensure minimal memory consumption while processing large files.
Examples
1. Reading a Large File Line by Line
Reading a file line by line prevents loading the entire file into memory, which is useful for large files.

main.py
# Open the file in read mode
with open("sample.txt", "r") as file:
for line in file:
print('Line:', line.strip()) # Process the line
Explanation:
Here, we open the file using with open("sample.txt", "r")
, which ensures that the file is properly closed after execution. The for
loop reads one line at a time, reducing memory usage. The strip()
function removes extra spaces or newline characters.
Output:

2. Reading a Large File in Chunks
Reading a file in chunks is useful for binary files or when processing large text data without loading everything into memory.
main.py
# Open file and read in chunks
chunk_size = 1024 # Read 1KB at a time
with open("large_file.txt", "r") as file:
while True:
chunk = file.read(chunk_size)
if not chunk:
break
print(chunk)
Explanation:
We define a variable chunk_size = 1024
to read 1KB at a time. Using file.read(chunk_size)
, we read and process data in small chunks. The while True
loop ensures continuous reading until an empty chunk (not chunk
) is encountered, indicating the end of the file.
Output:
Chunk 1: Data...
Chunk 2: More Data...
Chunk 3: Even More Data...
3. Using the csv
Module for Large CSV Files
For structured text data, the csv
module allows efficient row-by-row reading.
main.py
import csv
# Open CSV file and read row by row
with open("large_file.csv", "r") as file:
reader = csv.reader(file)
for row in reader:
print(row)
Explanation:
We import the csv
module and use csv.reader(file)
to process the CSV file line by line. This prevents excessive memory usage and efficiently handles large datasets.
Output:
['ID', 'Name', 'Age']
['1', 'Alice', '25']
['2', 'Bob', '30']
4. Using pandas
for Large CSV Files
The pandas
library provides a memory-efficient way to process large CSV files using the chunksize
parameter.
main.py
import pandas as pd
# Read CSV file in chunks
chunk_size = 1000 # Process 1000 rows at a time
for chunk in pd.read_csv("large_file.csv", chunksize=chunk_size):
print(chunk.head()) # Process the chunk
Explanation:
We define chunk_size = 1000
to load 1000 rows at a time. The pd.read_csv("large_file.csv", chunksize=chunk_size)
function returns an iterable object, processing data in chunks instead of loading the entire file at once.
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
Conclusion
To efficiently read large files in Python, choose an approach based on the file type:
- Line-by-Line Reading: Best for large text files without structure.
- Chunk-Based Reading: Ideal for binary and text files needing memory efficiency.
csv.reader()
: Best for structured CSV files.pandas.read_csv()
: Optimal for large structured datasets.