Parquet Files: What They Are and When to Use Them Instead of CSV

You’ve been emailing CSVs around. You’ve waited for a 2GB file to load into a spreadsheet. You’ve watched column types get mangled every time you reopen a file.

There’s a better way. It’s called Parquet, and once you understand it, you’ll wonder why anyone still uses CSV for anything serious.

What’s Wrong with CSV

Every time you open it, the program has to guess what your columns mean
Dates become strings. Numbers become text. Booleans become chaos
A 500MB CSV might be 50MB as Parquet — same data, fraction of the size
Want to read just one column out of fifty? Too bad — CSV makes you load the whole thing

What Makes Parquet Different

Parquet is a columnar file format. Instead of storing data row by row (like CSV), it stores it column by column. That means:

It remembers your types — integers stay integers, dates stay dates
It compresses beautifully — often 5-10x smaller than equivalent CSVs
It’s fast to query — tools can read just the columns they need without touching the rest
It’s the standard — Spark, BigQuery, Snowflake, DuckDB, Pandas — everything speaks Parquet

What We’ll Do

Compare loading the same dataset as CSV vs Parquet and see the difference
Convert a messy CSV into a clean Parquet file
Query a Parquet file directly with DuckDB and Python
Explore a multi-gigabyte dataset that would be painful as CSV but is instant as Parquet
Talk about when to use Parquet, when CSV is fine, and how to make the switch

Who This Is For

Anyone who works with data files. If you’ve ever emailed a CSV, downloaded a spreadsheet export, or waited too long for a file to load — this is for you. Some Python familiarity helps but isn’t required.

Date and time coming soon — join the Meetup group to get notified.