Urban Plant Phenology and Frost Risk (Digitized Herbarium Images)¶
Category: Botany · Size: 1.1 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Digitised herbarium image dataset for 200 plant species from the eastern USA, with four reproductive phenological phases plus PRISM climate data and population density.
The data is mounted read-only at /srv/data/urban-plant-phenology/.
Save anything you produce in your personal folder (~/).
⚠️ Large dataset (1.1 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/urban-plant-phenology')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
crowdsourcedData_upload.csv (1,044.1 MB) flowering_modeling_data (1).csv (7.9 MB) fruiting_modeling_data (1).csv (4.5 MB) peakflowering_modeling_data (1).csv (7.9 MB) peakfruiting_modeling_data (1).csv (4.5 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: crowdsourcedData_upload.csv
| coreid | bud | flower | fruit | username | hit.id | is.duplicated | link1 | link2 | binomial_species | workerID | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 188 | 0 | 3 | 0 | A2EHH2ZFIRF1BF | 368IUKXGA52RR6COXIMSFR2B56W6P8 | False | http://portal.neherbaria.org/imglib/cnh/UConn_... | http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... | Anemone_virginiana | 76297ff4f482b17b978cbe67a7836940 |
| 1 | 188 | 0 | 3 | 0 | A1EK1RN2IS5MJM | 368IUKXGA52RR6COXIMSFR2B56W6P8 | False | http://portal.neherbaria.org/imglib/cnh/UConn_... | http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... | Anemone_virginiana | 441e4940b62d4a9ca6bfe06719ec7305 |
| 2 | 188 | 0 | 3 | 0 | A2NYCAWYA7F29S | 368IUKXGA52RR6COXIMSFR2B56W6P8 | False | http://portal.neherbaria.org/imglib/cnh/UConn_... | http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... | Anemone_virginiana | 7ec23de1b1c495eefdc356f4db64fc57 |
| 3 | 188 | 0 | 3 | 0 | A2EHH2ZFIRF1BF | 368IUKXGA52RR6COXIMSFR2B56W6P8 | False | http://portal.neherbaria.org/imglib/cnh/UConn_... | http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... | Anemone_virginiana | 76297ff4f482b17b978cbe67a7836940 |
| 4 | 188 | 0 | 3 | 0 | A2WJ1KQW2UBTG6 | 368IUKXGA52RR6COXIMSFR2B56W6P8 | False | http://portal.neherbaria.org/imglib/cnh/UConn_... | http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... | Anemone_virginiana | 7e1af15b1626fee69e404890f1449c11 |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 coreid 100000 non-null int64 1 bud 100000 non-null int64 2 flower 100000 non-null int64 3 fruit 100000 non-null int64 4 username 100000 non-null object 5 hit.id 99958 non-null object 6 is.duplicated 99748 non-null object 7 link1 100000 non-null object 8 link2 100000 non-null object 9 binomial_species 100000 non-null object 10 workerID 100000 non-null object dtypes: int64(4), object(7) memory usage: 8.4+ MB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| coreid | 100000.0 | NaN | NaN | NaN | 70721.83906 | 36525.124027 | 188.0 | 30598.0 | 70596.0 | 102905.0 | 119278.0 |
| bud | 100000.0 | NaN | NaN | NaN | 2.06367 | 9.600903 | 0.0 | 0.0 | 0.0 | 1.0 | 380.0 |
| flower | 100000.0 | NaN | NaN | NaN | 4.96669 | 16.856964 | 0.0 | 0.0 | 2.0 | 4.0 | 567.0 |
| fruit | 100000.0 | NaN | NaN | NaN | 6.18891 | 15.202529 | 0.0 | 0.0 | 1.0 | 6.0 | 177.0 |
| username | 100000 | 763 | A2BCNRHZU9V7C4 | 4431 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| hit.id | 99958 | 156 | A | 13797 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| is.duplicated | 99748 | 2 | False | 89395 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| link1 | 100000 | 1403 | http://deliver.odai.yale.edu/content/repositor... | 210 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| link2 | 100000 | 1403 | http://deliver.odai.yale.edu/content/repositor... | 210 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| binomial_species | 100000 | 108 | Arisaema_triphyllum | 14343 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| workerID | 100000 | 763 | ca730ba0a75e4b492da32217f717318d | 4431 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from Urban Plant Phenology and Frost Risk (Digitized Herbarium Images), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- urban-plant-phenology.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"