Parquet Files: What They Are and When to Use Them Instead of CSV
CSVs are everywhere, but they're slow, bloated, and lossy. Parquet is the format that data engineers actually use — and it's easier to work with than you think.
You’ve been emailing CSVs around. You’ve waited for a 2GB file to load into a spreadsheet. You’ve watched column types get mangled every time you reopen a file.
There’s a better way. It’s called Parquet, and once you understand it, you’ll wonder why anyone still uses CSV for anything serious.
What’s Wrong with CSV
- Every time you open it, the program has to guess what your columns mean
- Dates become strings. Numbers become text. Booleans become chaos
- A 500MB CSV might be 50MB as Parquet — same data, fraction of the size
- Want to read just one column out of fifty? Too bad — CSV makes you load the whole thing
What Makes Parquet Different
Parquet is a columnar file format. Instead of storing data row by row (like CSV), it stores it column by column. That means:
- It remembers your types — integers stay integers, dates stay dates
- It compresses beautifully — often 5-10x smaller than equivalent CSVs
- It’s fast to query — tools can read just the columns they need without touching the rest
- It’s the standard — Spark, BigQuery, Snowflake, DuckDB, Pandas — everything speaks Parquet
What We’ll Do
- Compare loading the same dataset as CSV vs Parquet and see the difference
- Convert a messy CSV into a clean Parquet file
- Query a Parquet file directly with DuckDB and Python
- Explore a multi-gigabyte dataset that would be painful as CSV but is instant as Parquet
- Talk about when to use Parquet, when CSV is fine, and how to make the switch
Who This Is For
Anyone who works with data files. If you’ve ever emailed a CSV, downloaded a spreadsheet export, or waited too long for a file to load — this is for you. Some Python familiarity helps but isn’t required.
Date and time coming soon — join the Meetup group to get notified.