CitieS-Health Barcelona: Air Pollution & Health Survey Results¶
Category: Public Health · Size: 123.2 kB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Online survey responses on Barcelona citizens' knowledge, perceptions and preferences regarding air pollution and its health effects.
The data is mounted read-only at /srv/data/cities-health-barcelona/.
Save anything you produce in your personal folder (~/).
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/cities-health-barcelona')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
CitieSHealth_BCN_DATA_Survey-Results_20200609_V01.csv (0.1 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: CitieSHealth_BCN_DATA_Survey-Results_20200609_V01.csv
| idnum | Q1_age | Q2_district | Q3 | Q4__calle | Q4__deporte | Q4__bicicleta | Q4__niños | Q4__casa | Q4__trabajo | ... | Q6__rendimiento | Q6__aparatorespiratorio | Q6__aparatodigestivo | Q6__envejecimiento | Q6__corazónyarterias | Q6__concentracion | Q6__estrés | Q6_fertilidad | Q6__saludmental | Q6__alergias | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 33 | NaN | Ciutat Vella | 4 | Quan camino pel carrer | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | Corazón y arterias | Concentración y desarrollo cognitivo | NaN | Fertilidad / Aparato reproductor | NaN | NaN |
| 1 | 26 | NaN | Ciutat Vella | 4 | Quan camino pel carrer | Cuando hago deporte al aire libre | Cuando voy en bicicleta | NaN | NaN | NaN | ... | NaN | Aparato respiratorio | Aparato digestivo | NaN | NaN | Concentración y desarrollo cognitivo | NaN | NaN | NaN | NaN |
| 2 | 18 | 38-47 | Ciutat Vella | 3 | Quan camino pel carrer | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Fertilidad / Aparato reproductor | NaN | NaN |
| 3 | 20 | 28-37 | Ciutat Vella | 4 | NaN | Cuando hago deporte al aire libre | Cuando voy en bicicleta | Cuando paseo con niños pequeños | NaN | NaN | ... | NaN | Aparato respiratorio | NaN | Envejecimiento | NaN | NaN | NaN | NaN | Salut mental | NaN |
| 4 | 575 | NaN | Ciutat Vella | 5 | Quan camino pel carrer | NaN | Cuando voy en bicicleta | NaN | NaN | Cuando estoy en el trabajo | ... | NaN | NaN | Aparato digestivo | Envejecimiento | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 32 columns
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 581 entries, 0 to 580 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 idnum 581 non-null int64 1 Q1_age 335 non-null object 2 Q2_district 581 non-null object 3 Q3 581 non-null int64 4 Q4__calle 430 non-null object 5 Q4__deporte 176 non-null object 6 Q4__bicicleta 238 non-null object 7 Q4__niños 169 non-null object 8 Q4__casa 85 non-null object 9 Q4__trabajo 21 non-null object 10 Q4__conduzco 69 non-null object 11 Q4__metro 127 non-null object 12 Q5__personasmayores 273 non-null object 13 Q5__niños 452 non-null object 14 Q5__estudiantes 20 non-null object 15 Q5__personasalergicas 101 non-null object 16 Q5__personasasmaticas 272 non-null object 17 Q5__embarazadas 220 non-null object 18 Q5__deportistas 33 non-null object 19 Q5__repartidores 47 non-null object 20 Q5__peatones 155 non-null object 21 Q6__pielcabello 51 non-null object 22 Q6__rendimiento 45 non-null object 23 Q6__aparatorespiratorio 398 non-null object 24 Q6__aparatodigestivo 65 non-null object 25 Q6__envejecimiento 142 non-null object 26 Q6__corazónyarterias 215 non-null object 27 Q6__concentracion 221 non-null object 28 Q6__estrés 187 non-null object 29 Q6_fertilidad 93 non-null object 30 Q6__saludmental 150 non-null object 31 Q6__alergias 83 non-null object dtypes: int64(2), object(30) memory usage: 145.4+ KB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| idnum | 581.0 | NaN | NaN | NaN | 292.089501 | 168.012055 | 2.0 | 147.0 | 292.0 | 437.0 | 584.0 |
| Q1_age | 335 | 6 | 38-47 | 113 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q2_district | 581 | 12 | Eixample | 153 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q3 | 581.0 | NaN | NaN | NaN | 3.72117 | 0.98158 | 1.0 | 3.0 | 4.0 | 4.0 | 5.0 |
| Q4__calle | 430 | 1 | Quan camino pel carrer | 430 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__deporte | 176 | 1 | Cuando hago deporte al aire libre | 176 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__bicicleta | 238 | 1 | Cuando voy en bicicleta | 238 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__niños | 169 | 1 | Cuando paseo con niños pequeños | 169 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__casa | 85 | 1 | Cuando estoy en casa | 85 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__trabajo | 21 | 1 | Cuando estoy en el trabajo | 21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__conduzco | 69 | 1 | Cuando conduzco | 69 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q4__metro | 127 | 1 | Cuando entro/viajo en el metro | 127 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__personasmayores | 273 | 1 | Personas mayores | 273 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__niños | 452 | 1 | Niños | 452 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__estudiantes | 20 | 1 | Estudiantes | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__personasalergicas | 101 | 1 | Personas alérgicas | 101 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__personasasmaticas | 272 | 1 | Personas asmáticas o con problemas respiratorios | 272 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__embarazadas | 220 | 1 | Embarazadas | 220 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__deportistas | 33 | 1 | Deportistas | 33 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Q5__repartidores | 47 | 1 | Repartidores en bicicleta o motos | 47 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from CitieS-Health Barcelona: Air Pollution & Health Survey Results, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- cities-health-barcelona.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"