Pan-European Common Bird Monitoring Scheme (PECBMS)¶
Category: Ornithology · Size: 1.3 MB · Format: CSV, XLSX License: CC-BY-4.0 (Use open subset only; ES/CY 2016-17 restricted) · Zenodo record · Data sheet on the CSDH
Population indices for 170 breeding bird species across 28 European countries, produced by ~15,000 volunteers counting birds with standardised protocols every year.
The data is mounted read-only at /srv/data/pecbms-birds/.
Save anything you produce in your personal folder (~/).
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/pecbms-birds')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
indices2017.csv (0.2 MB) monitoring_schemes.xlsx (0.0 MB) national_indices2017.csv (1.0 MB) species_country.csv (0.0 MB) trends2017.csv (0.0 MB) trends_short2017.csv (0.0 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: indices2017.csv
| species | euring_code | year | index | se | |
|---|---|---|---|---|---|
| 0 | Tachybaptus ruficollis | 70 | 1990 | 100 | 0.0 |
| 1 | Tachybaptus ruficollis | 70 | 1991 | 77 | 18.0 |
| 2 | Tachybaptus ruficollis | 70 | 1992 | 106 | 26.0 |
| 3 | Tachybaptus ruficollis | 70 | 1993 | 103 | 21.0 |
| 4 | Tachybaptus ruficollis | 70 | 1994 | 128 | 27.0 |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5628 entries, 0 to 5627 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 5628 non-null object 1 euring_code 5628 non-null int64 2 year 5628 non-null int64 3 index 5628 non-null int64 4 se 5628 non-null float64 dtypes: float64(1), int64(3), object(1) memory usage: 220.0+ KB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| species | 5628 | 170 | Emberiza calandra | 38 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| euring_code | 5628.0 | NaN | NaN | NaN | 10893.665956 | 4699.543011 | 70.0 | 7240.0 | 11870.0 | 14640.0 | 18820.0 |
| year | 5628.0 | NaN | NaN | NaN | 2000.059701 | 10.680107 | 1980.0 | 1991.0 | 2001.0 | 2009.0 | 2017.0 |
| index | 5628.0 | NaN | NaN | NaN | 111.312011 | 104.486863 | 1.0 | 68.0 | 97.0 | 123.0 | 2175.0 |
| se | 5628.0 | NaN | NaN | NaN | 30.746722 | 95.895804 | -1.040816 | 7.0 | 12.0 | 22.0 | 2103.0 |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from Pan-European Common Bird Monitoring Scheme (PECBMS), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- pecbms-birds.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"