Leuven.cool Citizen Science Urban Weather Station Network¶
Category: Meteorology · Size: 12.9 GB · Format: CSV, CSV.gz License: CC-BY-NC-4.0 (Non-commercial) · Zenodo record · Data sheet on the CSDH
Over 1 billion raw observations from ~110 low-cost weather stations spread across Leuven (Belgium), at 16-second resolution from June 2019 to March 2025.
The data is mounted read-only at /srv/data/leuven-urban-weather/.
Save anything you produce in your personal folder (~/).
⚠️ Large dataset (12.9 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/leuven-urban-weather')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
RAWDATA2019Q2.csv.gz (6.4 MB) RAWDATA2019Q3.csv.gz (185.2 MB) RAWDATA2019Q4.csv.gz (444.2 MB) RAWDATA2020Q1.csv.gz (505.7 MB) RAWDATA2020Q2.csv.gz (500.3 MB) RAWDATA2020Q3.csv.gz (523.5 MB) RAWDATA2020Q4.csv.gz (486.9 MB) RAWDATA2021Q1.csv.gz (527.1 MB) RAWDATA2021Q2.csv.gz (635.0 MB) RAWDATA2021Q3.csv.gz (602.4 MB) RAWDATA2021Q4.csv.gz (561.7 MB) RAWDATA2022Q1.csv.gz (605.8 MB) RAWDATA2022Q2.csv.gz (686.8 MB) RAWDATA2022Q3.csv.gz (707.5 MB) RAWDATA2022Q4.csv.gz (615.0 MB) RAWDATA2023Q1.csv.gz (619.3 MB) RAWDATA2023Q2.csv.gz (695.5 MB) RAWDATA2023Q3.csv.gz (648.9 MB) RAWDATA2023Q4.csv.gz (610.2 MB) RAWDATA2024Q1.csv.gz (580.3 MB) RAWDATA2024Q2.csv.gz (601.9 MB) RAWDATA2024Q3.csv.gz (599.8 MB) RAWDATA2024Q4.csv.gz (505.9 MB) RAWDATA2025Q1.csv.gz (439.3 MB) STATIONS.csv (0.0 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: STATIONS.csv
| ID | WOWID | LATITUDE | LONGITUDE | ALTITUDE | |
|---|---|---|---|---|---|
| 0 | GARMON01 | b1f9850c-d686-e911-80e7-0003ff59889d | 50.8710 | 4.694 | 21 |
| 1 | GARMON002 | 83a89aa6-2695-e911-80e7-0003ff59883f | 50.8468 | 4.756 | 47 |
| 2 | GARMON003 | 2a3596b2-2795-e911-80e7-0003ff59889d | 50.8700 | 4.728 | 44 |
| 3 | GARMON004 | 7d43d8ab-2895-e911-80e7-0003ff59883f | 50.8708 | 4.685 | 31 |
| 4 | GARMON005 | 3f67b310-5597-e911-80e7-0003ff59883f | 50.8814 | 4.713 | 26 |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 155 entries, 0 to 154 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 155 non-null object 1 WOWID 150 non-null object 2 LATITUDE 155 non-null float64 3 LONGITUDE 155 non-null float64 4 ALTITUDE 155 non-null int64 dtypes: float64(2), int64(1), object(2) memory usage: 6.2+ KB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 155 | 155 | GARMON01 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| WOWID | 150 | 149 | 23cb50b2-6aa2-ea11-8b71-0003ff59b0d7 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| LATITUDE | 155.0 | NaN | NaN | NaN | 50.871586 | 0.061385 | 50.1936 | 50.86645 | 50.8765 | 50.8846 | 50.9993 |
| LONGITUDE | 155.0 | NaN | NaN | NaN | 4.706471 | 0.115391 | 4.357 | 4.6865 | 4.701 | 4.7215 | 5.45 |
| ALTITUDE | 155.0 | NaN | NaN | NaN | 37.535484 | 28.09776 | 11.0 | 21.5 | 32.0 | 42.5 | 295.0 |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from Leuven.cool Citizen Science Urban Weather Station Network, license CC-BY-NC-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- leuven-urban-weather.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"