xAire: High-Resolution NO₂ Citizen Science Dataset (Barcelona)¶

Category: Air Quality · Size: 77.6 kB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

725 nitrogen dioxide measurements collected by 1,650 participants from 18 Barcelona schools using Palmes diffusion tubes, with asthma health-impact analysis.

The data is mounted read-only at /srv/data/xaire-no2-barcelona/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/xaire-no2-barcelona')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
xaire_datainbrief.csv  (0.1 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: xaire_datainbrief.csv
Out[2]:
tube_code school lat long address type no2_raw no2_unbiased no2_2017
0 1003091 xAire 41.397720 2.149740 carrer denia, 2-4 traffic 38.290447 41 39
1 1003092 xAire 41.398180 2.148380 plaça cardona, 1-2 background 42.617421 46 44
2 1003093 xAire 41.399155 2.145611 carrer aribau 265 traffic 26.454660 28 27
3 1003094 xAire 41.401366 2.148259 via augusta, 114 traffic 60.009157 64 62
4 1003095 xAire 41.403742 2.142602 carrer balmes 350-352 traffic 98.931474 106 102

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 725 entries, 0 to 724
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   tube_code     725 non-null    int64  
 1   school        725 non-null    object 
 2   lat           725 non-null    float64
 3   long          725 non-null    float64
 4   address       725 non-null    object 
 5   type          725 non-null    object 
 6   no2_raw       725 non-null    float64
 7   no2_unbiased  725 non-null    int64  
 8   no2_2017      725 non-null    int64  
dtypes: float64(3), int64(3), object(3)
memory usage: 51.1+ KB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
tube_code 725.0 NaN NaN NaN 1089424.386207 10902.337857 1003091.0 1090624.0 1090825.0 1091029.0 1091241.0
school 725 19 xAire 77 NaN NaN NaN NaN NaN NaN NaN
lat 725.0 NaN NaN NaN 41.403482 0.020399 41.356773 41.386892 41.40235 41.42319 41.445149
long 725.0 NaN NaN NaN 2.164603 0.023831 2.1105 2.148791 2.168509 2.1812 2.210294
address 725 673 traffic 13 NaN NaN NaN NaN NaN NaN NaN
type 725 4 traffic 482 NaN NaN NaN NaN NaN NaN NaN
no2_raw 725.0 NaN NaN NaN 46.528329 15.156252 11.724384 37.373208 43.114328 52.212016 122.537611
no2_unbiased 725.0 NaN NaN NaN 49.77931 16.22633 13.0 40.0 46.0 56.0 131.0
no2_2017 725.0 NaN NaN NaN 47.875862 15.629994 13.0 38.0 44.0 54.0 126.0

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from xAire: High-Resolution NO₂ Citizen Science Dataset (Barcelona), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- xaire-no2-barcelona.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"