Leuven.cool Citizen Science Urban Weather Station Network¶

Category: Meteorology · Size: 12.9 GB · Format: CSV, CSV.gz License: CC-BY-NC-4.0 (Non-commercial) · Zenodo record · Data sheet on the CSDH

Over 1 billion raw observations from ~110 low-cost weather stations spread across Leuven (Belgium), at 16-second resolution from June 2019 to March 2025.

The data is mounted read-only at /srv/data/leuven-urban-weather/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (12.9 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/leuven-urban-weather')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
RAWDATA2019Q2.csv.gz  (6.4 MB)
RAWDATA2019Q3.csv.gz  (185.2 MB)
RAWDATA2019Q4.csv.gz  (444.2 MB)
RAWDATA2020Q1.csv.gz  (505.7 MB)
RAWDATA2020Q2.csv.gz  (500.3 MB)
RAWDATA2020Q3.csv.gz  (523.5 MB)
RAWDATA2020Q4.csv.gz  (486.9 MB)
RAWDATA2021Q1.csv.gz  (527.1 MB)
RAWDATA2021Q2.csv.gz  (635.0 MB)
RAWDATA2021Q3.csv.gz  (602.4 MB)
RAWDATA2021Q4.csv.gz  (561.7 MB)
RAWDATA2022Q1.csv.gz  (605.8 MB)
RAWDATA2022Q2.csv.gz  (686.8 MB)
RAWDATA2022Q3.csv.gz  (707.5 MB)
RAWDATA2022Q4.csv.gz  (615.0 MB)
RAWDATA2023Q1.csv.gz  (619.3 MB)
RAWDATA2023Q2.csv.gz  (695.5 MB)
RAWDATA2023Q3.csv.gz  (648.9 MB)
RAWDATA2023Q4.csv.gz  (610.2 MB)
RAWDATA2024Q1.csv.gz  (580.3 MB)
RAWDATA2024Q2.csv.gz  (601.9 MB)
RAWDATA2024Q3.csv.gz  (599.8 MB)
RAWDATA2024Q4.csv.gz  (505.9 MB)
RAWDATA2025Q1.csv.gz  (439.3 MB)
STATIONS.csv  (0.0 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: STATIONS.csv
Out[2]:
ID WOWID LATITUDE LONGITUDE ALTITUDE
0 GARMON01 b1f9850c-d686-e911-80e7-0003ff59889d 50.8710 4.694 21
1 GARMON002 83a89aa6-2695-e911-80e7-0003ff59883f 50.8468 4.756 47
2 GARMON003 2a3596b2-2795-e911-80e7-0003ff59889d 50.8700 4.728 44
3 GARMON004 7d43d8ab-2895-e911-80e7-0003ff59883f 50.8708 4.685 31
4 GARMON005 3f67b310-5597-e911-80e7-0003ff59883f 50.8814 4.713 26

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   ID         155 non-null    object 
 1   WOWID      150 non-null    object 
 2   LATITUDE   155 non-null    float64
 3   LONGITUDE  155 non-null    float64
 4   ALTITUDE   155 non-null    int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 6.2+ KB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
ID 155 155 GARMON01 1 NaN NaN NaN NaN NaN NaN NaN
WOWID 150 149 23cb50b2-6aa2-ea11-8b71-0003ff59b0d7 2 NaN NaN NaN NaN NaN NaN NaN
LATITUDE 155.0 NaN NaN NaN 50.871586 0.061385 50.1936 50.86645 50.8765 50.8846 50.9993
LONGITUDE 155.0 NaN NaN NaN 4.706471 0.115391 4.357 4.6865 4.701 4.7215 5.45
ALTITUDE 155.0 NaN NaN NaN 37.535484 28.09776 11.0 21.5 32.0 42.5 295.0

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Leuven.cool Citizen Science Urban Weather Station Network, license CC-BY-NC-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- leuven-urban-weather.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"