Urban Plant Phenology and Frost Risk (Digitized Herbarium Images)¶

Category: Botany · Size: 1.1 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Digitised herbarium image dataset for 200 plant species from the eastern USA, with four reproductive phenological phases plus PRISM climate data and population density.

The data is mounted read-only at /srv/data/urban-plant-phenology/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (1.1 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/urban-plant-phenology')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
crowdsourcedData_upload.csv  (1,044.1 MB)
flowering_modeling_data (1).csv  (7.9 MB)
fruiting_modeling_data (1).csv  (4.5 MB)
peakflowering_modeling_data (1).csv  (7.9 MB)
peakfruiting_modeling_data (1).csv  (4.5 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: crowdsourcedData_upload.csv
Out[2]:
coreid bud flower fruit username hit.id is.duplicated link1 link2 binomial_species workerID
0 188 0 3 0 A2EHH2ZFIRF1BF 368IUKXGA52RR6COXIMSFR2B56W6P8 False http://portal.neherbaria.org/imglib/cnh/UConn_... http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... Anemone_virginiana 76297ff4f482b17b978cbe67a7836940
1 188 0 3 0 A1EK1RN2IS5MJM 368IUKXGA52RR6COXIMSFR2B56W6P8 False http://portal.neherbaria.org/imglib/cnh/UConn_... http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... Anemone_virginiana 441e4940b62d4a9ca6bfe06719ec7305
2 188 0 3 0 A2NYCAWYA7F29S 368IUKXGA52RR6COXIMSFR2B56W6P8 False http://portal.neherbaria.org/imglib/cnh/UConn_... http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... Anemone_virginiana 7ec23de1b1c495eefdc356f4db64fc57
3 188 0 3 0 A2EHH2ZFIRF1BF 368IUKXGA52RR6COXIMSFR2B56W6P8 False http://portal.neherbaria.org/imglib/cnh/UConn_... http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... Anemone_virginiana 76297ff4f482b17b978cbe67a7836940
4 188 0 3 0 A2WJ1KQW2UBTG6 368IUKXGA52RR6COXIMSFR2B56W6P8 False http://portal.neherbaria.org/imglib/cnh/UConn_... http://bgbaseserver.eeb.uconn.edu/DATABASEIMAG... Anemone_virginiana 7e1af15b1626fee69e404890f1449c11

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   coreid            100000 non-null  int64 
 1   bud               100000 non-null  int64 
 2   flower            100000 non-null  int64 
 3   fruit             100000 non-null  int64 
 4   username          100000 non-null  object
 5   hit.id            99958 non-null   object
 6   is.duplicated     99748 non-null   object
 7   link1             100000 non-null  object
 8   link2             100000 non-null  object
 9   binomial_species  100000 non-null  object
 10  workerID          100000 non-null  object
dtypes: int64(4), object(7)
memory usage: 8.4+ MB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
coreid 100000.0 NaN NaN NaN 70721.83906 36525.124027 188.0 30598.0 70596.0 102905.0 119278.0
bud 100000.0 NaN NaN NaN 2.06367 9.600903 0.0 0.0 0.0 1.0 380.0
flower 100000.0 NaN NaN NaN 4.96669 16.856964 0.0 0.0 2.0 4.0 567.0
fruit 100000.0 NaN NaN NaN 6.18891 15.202529 0.0 0.0 1.0 6.0 177.0
username 100000 763 A2BCNRHZU9V7C4 4431 NaN NaN NaN NaN NaN NaN NaN
hit.id 99958 156 A 13797 NaN NaN NaN NaN NaN NaN NaN
is.duplicated 99748 2 False 89395 NaN NaN NaN NaN NaN NaN NaN
link1 100000 1403 http://deliver.odai.yale.edu/content/repositor... 210 NaN NaN NaN NaN NaN NaN NaN
link2 100000 1403 http://deliver.odai.yale.edu/content/repositor... 210 NaN NaN NaN NaN NaN NaN NaN
binomial_species 100000 108 Arisaema_triphyllum 14343 NaN NaN NaN NaN NaN NaN NaN
workerID 100000 763 ca730ba0a75e4b492da32217f717318d 4431 NaN NaN NaN NaN NaN NaN NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Urban Plant Phenology and Frost Risk (Digitized Herbarium Images), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- urban-plant-phenology.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"