Read Large Files into a List in Python

To read large files into a list in Python efficiently, we can use methods like reading line by line using with open(), using readlines() with limited memory consumption, reading files in chunks, or leveraging pandas for structured data. Below are different ways to achieve this with detailed explanations.


Examples

1. Reading a Large File Line by Line Using with open()

Using with open() and iterating over the file object allows reading large files without consuming excessive memory.

</>
Copy
# Open the file and read line by line
file_path = "large_file.txt"

# Using with open() to handle file operations safely
data_list = []
with open(file_path, "r") as file:
    for line in file:
        data_list.append(line.strip())  # Strip removes newline characters

# Print the first 5 lines for verification
print("First 5 lines:", data_list[:5])

Explanation:

  • open(file_path, "r"): Opens the file in read mode.
  • for line in file: Iterates through the file line by line.
  • line.strip(): Removes trailing newline characters.
  • data_list.append(line.strip()): Stores the cleaned lines in a list.

Output:

First 5 lines: ['Line 1 content', 'Line 2 content', 'Line 3 content', 'Line 4 content', 'Line 5 content']

2. Using readlines() with Memory Optimization

The readlines() method loads all lines into a list but can be optimized by limiting memory usage using file.readlines(chunk_size).

</>
Copy
# Open the file and read in chunks
file_path = "large_file.txt"

# Reading file using readlines with chunk size
data_list = []
with open(file_path, "r") as file:
    while True:
        lines = file.readlines(10000)  # Reads 10,000 bytes at a time
        if not lines:
            break
        data_list.extend([line.strip() for line in lines])  # Process each chunk

# Print the first 5 lines
print("First 5 lines:", data_list[:5])

Explanation:

  • readlines(10000): Reads the file in 10,000-byte chunks, reducing memory load.
  • if not lines: break: Stops reading when the end of the file is reached.
  • data_list.extend([line.strip() for line in lines]): Processes each chunk and stores clean lines in the list.

Output:

First 5 lines: ['Line 1 content', 'Line 2 content', 'Line 3 content', 'Line 4 content', 'Line 5 content']

3. Using File Reading with pandas for Large CSV Files

If working with structured data like CSV, using pandas is the most efficient approach.

</>
Copy
import pandas as pd

# Define file path
file_path = "large_data.csv"

# Read CSV file into a list of rows
data_list = pd.read_csv(file_path, chunksize=10000)

# Convert first chunk into a list
first_chunk = next(data_list)  # Get first chunk
list_data = first_chunk.values.tolist()  # Convert DataFrame to list

# Print first 5 rows
print("First 5 rows:", list_data[:5])

Explanation:

  • pd.read_csv(file_path, chunksize=10000): Reads the CSV file in 10,000-row chunks to avoid memory overflow.
  • next(data_list): Retrieves the first chunk of data.
  • first_chunk.values.tolist(): Converts the pandas DataFrame into a list.

Output:

First 5 rows: [['Row1_Column1', 'Row1_Column2'], ['Row2_Column1', 'Row2_Column2'], ...]

4. Reading a Large File in Fixed-Size Chunks

For binary or text files, reading them in fixed-size chunks is an efficient approach.

</>
Copy
# Open the file in read mode
file_path = "large_file.txt"

# Read file in fixed-size chunks
data_list = []
with open(file_path, "r") as file:
    while True:
        chunk = file.read(1024)  # Read 1 KB at a time
        if not chunk:
            break
        data_list.append(chunk.strip())  # Store cleaned chunk

# Print first 5 elements
print("First 5 chunks:", data_list[:5])

Explanation:

  • file.read(1024): Reads 1024 bytes (1 KB) at a time.
  • if not chunk: break: Stops when the end of the file is reached.
  • data_list.append(chunk.strip()): Stores each cleaned chunk in a list.

Output:

First 5 chunks: ['Chunk1 data...', 'Chunk2 data...', 'Chunk3 data...', 'Chunk4 data...', 'Chunk5 data...']

Conclusion

To efficiently read large files into a list, you can use:

  • Line-by-line reading: Memory efficient for text files.
  • readlines() with chunks: Controls memory usage.
  • pandas.read_csv(): Best for structured data.
  • Fixed-size chunks: Works well for binary files.