CitieS-Health Barcelona: Air Pollution & Health Survey Results¶

Category: Public Health · Size: 123.2 kB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Online survey responses on Barcelona citizens' knowledge, perceptions and preferences regarding air pollution and its health effects.

The data is mounted read-only at /srv/data/cities-health-barcelona/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/cities-health-barcelona')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
CitieSHealth_BCN_DATA_Survey-Results_20200609_V01.csv  (0.1 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: CitieSHealth_BCN_DATA_Survey-Results_20200609_V01.csv
Out[2]:
idnum Q1_age Q2_district Q3 Q4__calle Q4__deporte Q4__bicicleta Q4__niños Q4__casa Q4__trabajo ... Q6__rendimiento Q6__aparatorespiratorio Q6__aparatodigestivo Q6__envejecimiento Q6__corazónyarterias Q6__concentracion Q6__estrés Q6_fertilidad Q6__saludmental Q6__alergias
0 33 NaN Ciutat Vella 4 Quan camino pel carrer NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Corazón y arterias Concentración y desarrollo cognitivo NaN Fertilidad / Aparato reproductor NaN NaN
1 26 NaN Ciutat Vella 4 Quan camino pel carrer Cuando hago deporte al aire libre Cuando voy en bicicleta NaN NaN NaN ... NaN Aparato respiratorio Aparato digestivo NaN NaN Concentración y desarrollo cognitivo NaN NaN NaN NaN
2 18 38-47 Ciutat Vella 3 Quan camino pel carrer NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN Fertilidad / Aparato reproductor NaN NaN
3 20 28-37 Ciutat Vella 4 NaN Cuando hago deporte al aire libre Cuando voy en bicicleta Cuando paseo con niños pequeños NaN NaN ... NaN Aparato respiratorio NaN Envejecimiento NaN NaN NaN NaN Salut mental NaN
4 575 NaN Ciutat Vella 5 Quan camino pel carrer NaN Cuando voy en bicicleta NaN NaN Cuando estoy en el trabajo ... NaN NaN Aparato digestivo Envejecimiento NaN NaN NaN NaN NaN NaN

5 rows × 32 columns

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581 entries, 0 to 580
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   idnum                    581 non-null    int64 
 1   Q1_age                   335 non-null    object
 2   Q2_district              581 non-null    object
 3   Q3                       581 non-null    int64 
 4   Q4__calle                430 non-null    object
 5   Q4__deporte              176 non-null    object
 6   Q4__bicicleta            238 non-null    object
 7   Q4__niños                169 non-null    object
 8   Q4__casa                 85 non-null     object
 9   Q4__trabajo              21 non-null     object
 10  Q4__conduzco             69 non-null     object
 11  Q4__metro                127 non-null    object
 12  Q5__personasmayores      273 non-null    object
 13  Q5__niños                452 non-null    object
 14  Q5__estudiantes          20 non-null     object
 15  Q5__personasalergicas    101 non-null    object
 16  Q5__personasasmaticas    272 non-null    object
 17  Q5__embarazadas          220 non-null    object
 18  Q5__deportistas          33 non-null     object
 19  Q5__repartidores         47 non-null     object
 20  Q5__peatones             155 non-null    object
 21  Q6__pielcabello          51 non-null     object
 22  Q6__rendimiento          45 non-null     object
 23  Q6__aparatorespiratorio  398 non-null    object
 24  Q6__aparatodigestivo     65 non-null     object
 25  Q6__envejecimiento       142 non-null    object
 26  Q6__corazónyarterias     215 non-null    object
 27  Q6__concentracion        221 non-null    object
 28  Q6__estrés               187 non-null    object
 29  Q6_fertilidad            93 non-null     object
 30  Q6__saludmental          150 non-null    object
 31  Q6__alergias             83 non-null     object
dtypes: int64(2), object(30)
memory usage: 145.4+ KB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
idnum 581.0 NaN NaN NaN 292.089501 168.012055 2.0 147.0 292.0 437.0 584.0
Q1_age 335 6 38-47 113 NaN NaN NaN NaN NaN NaN NaN
Q2_district 581 12 Eixample 153 NaN NaN NaN NaN NaN NaN NaN
Q3 581.0 NaN NaN NaN 3.72117 0.98158 1.0 3.0 4.0 4.0 5.0
Q4__calle 430 1 Quan camino pel carrer 430 NaN NaN NaN NaN NaN NaN NaN
Q4__deporte 176 1 Cuando hago deporte al aire libre 176 NaN NaN NaN NaN NaN NaN NaN
Q4__bicicleta 238 1 Cuando voy en bicicleta 238 NaN NaN NaN NaN NaN NaN NaN
Q4__niños 169 1 Cuando paseo con niños pequeños 169 NaN NaN NaN NaN NaN NaN NaN
Q4__casa 85 1 Cuando estoy en casa 85 NaN NaN NaN NaN NaN NaN NaN
Q4__trabajo 21 1 Cuando estoy en el trabajo 21 NaN NaN NaN NaN NaN NaN NaN
Q4__conduzco 69 1 Cuando conduzco 69 NaN NaN NaN NaN NaN NaN NaN
Q4__metro 127 1 Cuando entro/viajo en el metro 127 NaN NaN NaN NaN NaN NaN NaN
Q5__personasmayores 273 1 Personas mayores 273 NaN NaN NaN NaN NaN NaN NaN
Q5__niños 452 1 Niños 452 NaN NaN NaN NaN NaN NaN NaN
Q5__estudiantes 20 1 Estudiantes 20 NaN NaN NaN NaN NaN NaN NaN
Q5__personasalergicas 101 1 Personas alérgicas 101 NaN NaN NaN NaN NaN NaN NaN
Q5__personasasmaticas 272 1 Personas asmáticas o con problemas respiratorios 272 NaN NaN NaN NaN NaN NaN NaN
Q5__embarazadas 220 1 Embarazadas 220 NaN NaN NaN NaN NaN NaN NaN
Q5__deportistas 33 1 Deportistas 33 NaN NaN NaN NaN NaN NaN NaN
Q5__repartidores 47 1 Repartidores en bicicleta o motos 47 NaN NaN NaN NaN NaN NaN NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from CitieS-Health Barcelona: Air Pollution & Health Survey Results, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- cities-health-barcelona.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"