Are you a beginner worried about your systems and applications crashing every time you load a huge dataset, and it runs out of memory?
Worry not. This brief guide will show you how you can handle large datasets in Python like a pro.
Every data professional, beginner or expert, has encountered this common problem – “Panda’s memory error”. This is because your dataset is too large for Pandas. Once you do this, you will see a huge spike in RAM to 99%, and suddenly the IDE crashes. Beginners will assume that they need a more powerful computer, but the “pros” know that the performance is about working smarter and not harder.
So, what is the real solution? Well, it is about loading what’s necessary and not loading everything. This article explains how you can use large datasets in Python.
Common Techniques to Handle Large Datasets
Here are some of the common techniques you can use if the dataset is too large for Pandas to get the maximum out of the data without crashing the system.
- Master the Art of Memory Optimization
What a real data science expert will do first is change the way they use their tool, and not the tool entirely. Pandas, by default, is a memory-intensive library that assigns 64-bit types where even 8-bit types would be sufficient.
So, what do you need to do?
- Downcast numerical types – this means a column of integers ranging from 0 to 100 doesn’t need int64 (8 bytes). You can convert it to int8 (1 byte) to reduce the memory footprint for that column by 87.5%
- Categorical advantage – here, if you have a column with millions of rows but only ten unique values, then convert it to category dtype. It will replace bulky strings with smaller integer codes.
# Pro Tip: Optimize on the fly
df[‘status’] = df[‘status’].astype(‘category’)
df[‘age’] = pd.to_numeric(df[‘age’], downcast=’integer’)
2. Reading Data in Bits and Pieces
One of the easiest ways to use Data for exploration in Python is by processing them in smaller pieces rather than loading the entire dataset at once.
In this example, let us try to find the total revenue from a large dataset. You need to use the following code:
import pandas as pd
# Define chunk size (number of rows per chunk)
chunk_size = 100000
total_revenue = 0
# Read and process the file in chunks
for chunk in pd.read_csv(‘large_sales_data.csv’, chunksize=chunk_size):
# Process each chunk
total_revenue += chunk[‘revenue’].sum()
print(f”Total Revenue: ${total_revenue:,.2f}”)
This will only hold 100,000 rows, irrespective of how large the dataset is. So, even if there are 10 million rows, it will load 100,000 rows at one time, and the sum of each chunk will be later added to the total.
This technique can be best used for aggregations or filtering in large files.
3. Switch to Modern File Formats like Parquet & Feather
Pros use Apache Parquet. Let’s understand this. CSVs are row-based text files that force computers to read every column to find one. Apache Parquet is a column-based storage format, which means if you only need 3 columns from 100, then the system will only touch the data for those 3.
It also comes with a built-in feature of compression that shrinks even a 1GB CSV down to 100MB without losing a single row of data.
- Filtering During Reading
You know that you only need a subset of rows in most scenarios. In such cases, loading everything is not the right option. Instead, filter during the load process.
Here is an example where you can consider only transactions of 2024:
import pandas as pd
# Read in chunks and filter
chunk_size = 100000
filtered_chunks = []
for chunk in pd.read_csv(‘transactions.csv’, chunksize=chunk_size):
# Filter each chunk before storing it
filtered = chunk[chunk[‘year’] == 2024]
filtered_chunks.append(filtered)
# Combine the filtered chunks
df_2024 = pd.concat(filtered_chunks, ignore_index=True)
print(f”Loaded {len(df_2024)} rows from 2024″)
- Using Dask for Parallel Processing
Dask provides a Pandas-like API for huge datasets, along with handling other tasks like chunking and parallel processing automatically.
Here is a simple example of using Dask for the calculation of the average of a column
import dask.dataframe as dd
# Read with Dask (it handles chunking automatically)
df = dd.read_csv(‘huge_dataset.csv’)
# Operations look just like pandas
result = df[‘sales’].mean()
# Dask is lazy – compute() actually executes the calculation
average_sales = result.compute()
print(f”Average Sales: ${average_sales:,.2f}”)
Dask creates a plan to process data in small pieces instead of loading the entire file into memory. This tool can also use multiple CPU cores to speed up computation.
Here is a summary of when you can use these techniques:
|
Technique |
When to Use |
Key Benefit |
| Downcasting Types | When you have numerical data that fits in smaller ranges (e.g., ages, ratings, IDs). | Reduces memory footprint by up to 80% without losing data. |
| Categorical Conversion | When a column has repetitive text values (e.g., “Gender,” “City,” or “Status”). | Dramatically speeds up sorting and shrinks string-heavy DataFrames. |
| Chunking (chunksize) | When your dataset is larger than your RAM, but you only need a sum or average. | Prevents “Out of Memory” crashes by only keeping a slice of data in RAM at a time. |
| Parquet / Feather | When you frequently read/write the same data or only need specific columns. | Columnar storage allows the CPU to skip unneeded data and saves disk space. |
| Filtering During Load | When you only need a specific subset (e.g., “Current Year” or “Region X”). | Saves time and memory by never loading the irrelevant rows into Python. |
| Dask | When your dataset is massive (multi-GB/TB) and you need multi-core speed. | Automates parallel processing and handles data larger than your local memory. |
Conclusion
Remember, handling large datasets shouldn’t be a complex task, even for beginners. Also, you do not need a very powerful computer to load and run these huge datasets. With these common techniques, you can handle large datasets in Python like a pro. By referring to the table mentioned, you can know which technique should be used for what scenarios. For better knowledge, practice these techniques with sample datasets regularly. You can consider earning top data science certifications to learn these methodologies properly. Work smarter, and you can make the most of your datasets with Python without breaking a sweat.
The post How to Handle Large Datasets in Python Like a Pro appeared first on Datafloq.
