Urban Plant Phenology and Frost Risk (Digitized Herbarium Images)¶

Category: Botany · Size: 1.1 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Digitised herbarium image dataset for 200 plant species from the eastern USA, with four reproductive phenological phases plus PRISM climate data and population density.

The data is mounted read-only at /srv/data/urban-plant-phenology/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (1.1 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:

from pathlib import Path

DATA = Path('/srv/data/urban-plant-phenology')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")

crowdsourcedData_upload.csv  (1,044.1 MB)
flowering_modeling_data (1).csv  (7.9 MB)
fruiting_modeling_data (1).csv  (4.5 MB)
peakflowering_modeling_data (1).csv  (7.9 MB)
peakfruiting_modeling_data (1).csv  (4.5 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:

import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()

Using: crowdsourcedData_upload.csv

Out[2]:

	coreid	flower	username	hit.id	is.duplicated	link1	link2	binomial_species	workerID
0	188	3	A2EHH2ZFIRF1BF	368IUKXGA52RR6COXIMSFR2B56W6P8	False	http://portal.neherbaria.org/imglib/cnh/UConn_...	http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG...	Anemone_virginiana	76297ff4f482b17b978cbe67a7836940
1	188	3	A1EK1RN2IS5MJM	368IUKXGA52RR6COXIMSFR2B56W6P8	False	http://portal.neherbaria.org/imglib/cnh/UConn_...	http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG...	Anemone_virginiana	441e4940b62d4a9ca6bfe06719ec7305
2	188	3	A2NYCAWYA7F29S	368IUKXGA52RR6COXIMSFR2B56W6P8	False	http://portal.neherbaria.org/imglib/cnh/UConn_...	http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG...	Anemone_virginiana	7ec23de1b1c495eefdc356f4db64fc57
3	188	3	A2EHH2ZFIRF1BF	368IUKXGA52RR6COXIMSFR2B56W6P8	False	http://portal.neherbaria.org/imglib/cnh/UConn_...	http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG...	Anemone_virginiana	76297ff4f482b17b978cbe67a7836940
4	188	3	A2WJ1KQW2UBTG6	368IUKXGA52RR6COXIMSFR2B56W6P8	False	http://portal.neherbaria.org/imglib/cnh/UConn_...	http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG...	Anemone_virginiana	7e1af15b1626fee69e404890f1449c11

First look¶

Shape, types and basic statistics.

In [3]:

df.info()
df.describe(include='all').T.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   coreid            100000 non-null  int64 
 1   bud               100000 non-null  int64 
 2   flower            100000 non-null  int64 
 3   fruit             100000 non-null  int64 
 4   username          100000 non-null  object
 5   hit.id            99958 non-null   object
 6   is.duplicated     99748 non-null   object
 7   link1             100000 non-null  object
 8   link2             100000 non-null  object
 9   binomial_species  100000 non-null  object
 10  workerID          100000 non-null  object
dtypes: int64(4), object(7)
memory usage: 8.4+ MB

Out[3]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
coreid	100000.0	NaN	NaN	NaN	70721.83906	36525.124027	188.0	30598.0	70596.0	102905.0	119278.0
bud	100000.0	NaN	NaN	NaN	2.06367	9.600903	0.0	0.0	0.0	1.0	380.0
flower	100000.0	NaN	NaN	NaN	4.96669	16.856964	0.0	0.0	2.0	4.0	567.0
fruit	100000.0	NaN	NaN	NaN	6.18891	15.202529	0.0	0.0	1.0	6.0	177.0
username	100000	763	A2BCNRHZU9V7C4	4431	NaN	NaN	NaN	NaN	NaN	NaN	NaN
hit.id	99958	156	A	13797	NaN	NaN	NaN	NaN	NaN	NaN	NaN
is.duplicated	99748	2	False	89395	NaN	NaN	NaN	NaN	NaN	NaN	NaN
link1	100000	1403	http://deliver.odai.yale.edu/content/repositor...	210	NaN	NaN	NaN	NaN	NaN	NaN	NaN
link2	100000	1403	http://deliver.odai.yale.edu/content/repositor...	210	NaN	NaN	NaN	NaN	NaN	NaN	NaN
binomial_species	100000	108	Arisaema_triphyllum	14343	NaN	NaN	NaN	NaN	NaN	NaN	NaN
workerID	100000	763	ca730ba0a75e4b492da32217f717318d	4431	NaN	NaN	NaN	NaN	NaN	NaN	NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:

import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')

No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).

Process in chunks and keep only the result:

total = 0
for chunk in pd.read_csv(file, chunksize=1_000_000):
    total += len(chunk)

Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
```
import duckdb
duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
```

Your turn¶

This is just the starting point. Some ideas:

Check the dataset challenge on its CSDH data sheet.
Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
Questions and results: on the platform forum.

Attribution: data from Urban Plant Phenology and Frost Risk (Digitized Herbarium Images), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:

# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- urban-plant-phenology.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"