Pan-European Common Bird Monitoring Scheme (PECBMS)¶

Category: Ornithology · Size: 1.3 MB · Format: CSV, XLSX License: CC-BY-4.0 (Use open subset only; ES/CY 2016-17 restricted) · Zenodo record · Data sheet on the CSDH

Population indices for 170 breeding bird species across 28 European countries, produced by ~15,000 volunteers counting birds with standardised protocols every year.

The data is mounted read-only at /srv/data/pecbms-birds/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/pecbms-birds')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
indices2017.csv  (0.2 MB)
monitoring_schemes.xlsx  (0.0 MB)
national_indices2017.csv  (1.0 MB)
species_country.csv  (0.0 MB)
trends2017.csv  (0.0 MB)
trends_short2017.csv  (0.0 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: indices2017.csv
Out[2]:
species euring_code year index se
0 Tachybaptus ruficollis 70 1990 100 0.0
1 Tachybaptus ruficollis 70 1991 77 18.0
2 Tachybaptus ruficollis 70 1992 106 26.0
3 Tachybaptus ruficollis 70 1993 103 21.0
4 Tachybaptus ruficollis 70 1994 128 27.0

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5628 entries, 0 to 5627
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   species      5628 non-null   object 
 1   euring_code  5628 non-null   int64  
 2   year         5628 non-null   int64  
 3   index        5628 non-null   int64  
 4   se           5628 non-null   float64
dtypes: float64(1), int64(3), object(1)
memory usage: 220.0+ KB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
species 5628 170 Emberiza calandra 38 NaN NaN NaN NaN NaN NaN NaN
euring_code 5628.0 NaN NaN NaN 10893.665956 4699.543011 70.0 7240.0 11870.0 14640.0 18820.0
year 5628.0 NaN NaN NaN 2000.059701 10.680107 1980.0 1991.0 2001.0 2009.0 2017.0
index 5628.0 NaN NaN NaN 111.312011 104.486863 1.0 68.0 97.0 123.0 2175.0
se 5628.0 NaN NaN NaN 30.746722 95.895804 -1.040816 7.0 12.0 22.0 2103.0

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Pan-European Common Bird Monitoring Scheme (PECBMS), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- pecbms-birds.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"