PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos¶
Category: Phenology · Size: 12.9 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Global flower and fruit presence data produced by automated labelling of iNaturalist images up to March 2024, with per-observation detection metrics.
The data is mounted read-only at /srv/data/phenovision/.
Save anything you produce in your personal folder (~/).
⚠️ Large dataset (12.9 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/phenovision')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
annotations_all_w_headers_9cf8ad8.csv (12,870.8 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: annotations_all_w_headers_9cf8ad8.csv
| machine_learning_annotation_id | datasource | verbatim_date | day_of_year | year | latitude | longitude | coordinate_uncertainty_meters | family | count_family | ... | observed_image_guid | observed_image_url | observed_metadata_url | certainty | model_uri | prediction_probability | prediction_class | proportion_low_certainty_family | accuracy_excluding_low_certainty_family | accuracy_family | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0cf717af-4366-48df-8909-ea5fe454e1a0 | iNaturalist | 1806-12-31 | 365 | 1806 | -34.007000 | 18.450000 | 110.0 | Proteaceae | 4339.0 | ... | https://inaturalist-open-data.s3.amazonaws.com... | https://www.inaturalist.org/photos/15571385 | https://www.inaturalist.org/observations/c3400... | High | 10.57967/hf/2763 | 0.965820 | Detected | 0.080292 | 0.966742 | 0.932221 |
| 1 | d49811d4-e860-4bb3-a2af-370d3928a2dd | iNaturalist | 1893-06-30 | 181 | 1893 | -12.292183 | 49.209821 | 31092.0 | Fabaceae | 54776.0 | ... | https://inaturalist-open-data.s3.amazonaws.com... | https://www.inaturalist.org/photos/248874915 | https://www.inaturalist.org/observations/af674... | High | 10.57967/hf/2763 | 0.895508 | Detected | 0.047553 | 0.982105 | 0.959193 |
| 2 | 3a9bfeee-6841-4e73-9147-919462b94144 | iNaturalist | 1893-06-30 | 181 | 1893 | -17.626906 | 49.570417 | 30716.0 | Fabaceae | 54776.0 | ... | https://inaturalist-open-data.s3.amazonaws.com... | https://www.inaturalist.org/photos/248875771 | https://www.inaturalist.org/observations/6e83e... | High | 10.57967/hf/2763 | 0.949219 | Detected | 0.060385 | 0.962783 | 0.943855 |
| 3 | fa1be567-d680-4fce-aa23-fbca0521092a | iNaturalist | 1893-07-18 | 199 | 1893 | -17.213798 | 44.611626 | 30748.0 | Fabaceae | 54776.0 | ... | https://inaturalist-open-data.s3.amazonaws.com... | https://www.inaturalist.org/photos/248874282 | https://www.inaturalist.org/observations/f5295... | High | 10.57967/hf/2763 | 0.943359 | Detected | 0.060385 | 0.962783 | 0.943855 |
| 4 | f0cd84a5-8c0f-4fb0-81a2-976f0eec6fe4 | iNaturalist | 1893-07-18 | 199 | 1893 | -17.213798 | 44.611626 | 30748.0 | Fabaceae | 54776.0 | ... | https://inaturalist-open-data.s3.amazonaws.com... | https://www.inaturalist.org/photos/248874304 | https://www.inaturalist.org/observations/f5295... | High | 10.57967/hf/2763 | 0.805176 | Detected | 0.060385 | 0.962783 | 0.943855 |
5 rows × 25 columns
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 machine_learning_annotation_id 100000 non-null object 1 datasource 100000 non-null object 2 verbatim_date 100000 non-null object 3 day_of_year 100000 non-null int64 4 year 100000 non-null int64 5 latitude 100000 non-null float64 6 longitude 100000 non-null float64 7 coordinate_uncertainty_meters 83534 non-null float64 8 family 100000 non-null object 9 count_family 99984 non-null float64 10 genus 100000 non-null object 11 scientific_name 100000 non-null object 12 taxon_rank 100000 non-null object 13 basis_of_record 100000 non-null object 14 trait 100000 non-null object 15 observed_image_guid 100000 non-null object 16 observed_image_url 100000 non-null object 17 observed_metadata_url 100000 non-null object 18 certainty 100000 non-null object 19 model_uri 100000 non-null object 20 prediction_probability 100000 non-null float64 21 prediction_class 100000 non-null object 22 proportion_low_certainty_family 99972 non-null float64 23 accuracy_excluding_low_certainty_family 99951 non-null float64 24 accuracy_family 99972 non-null float64 dtypes: float64(8), int64(2), object(15) memory usage: 19.1+ MB
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| machine_learning_annotation_id | 100000 | 100000 | 0cf717af-4366-48df-8909-ea5fe454e1a0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| datasource | 100000 | 1 | iNaturalist | 100000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| verbatim_date | 100000 | 6751 | 2012-09-16 | 1005 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| day_of_year | 100000.0 | NaN | NaN | NaN | 214.98648 | 78.877365 | 1.0 | 165.0 | 231.0 | 271.0 | 366.0 |
| year | 100000.0 | NaN | NaN | NaN | 2001.10994 | 11.45197 | 1806.0 | 1993.0 | 2003.0 | 2012.0 | 2012.0 |
| latitude | 100000.0 | NaN | NaN | NaN | 13.238691 | 36.716359 | -54.961287 | -32.239031 | 32.307999 | 44.08858 | 79.91178 |
| longitude | 100000.0 | NaN | NaN | NaN | -2.697333 | 82.555869 | -176.62087 | -81.416518 | 7.769936 | 29.954922 | 178.320079 |
| coordinate_uncertainty_meters | 83534.0 | NaN | NaN | NaN | 5121.909893 | 22043.2927 | 0.0 | 20.0 | 200.0 | 1912.0 | 1785820.0 |
| family | 100000 | 309 | Asteraceae | 10995 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| count_family | 99984.0 | NaN | NaN | NaN | 26029.161376 | 31257.86504 | 1.0 | 3692.0 | 12091.0 | 28821.0 | 104342.0 |
| genus | 100000 | 4358 | Carex | 2686 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| scientific_name | 100000 | 20203 | Chamaenerion angustifolium | 255 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| taxon_rank | 100000 | 3 | species | 93180 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| basis_of_record | 100000 | 1 | MachineObservation | 100000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| trait | 100000 | 2 | flower | 82111 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| observed_image_guid | 100000 | 99845 | https://inaturalist-open-data.s3.amazonaws.com... | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| observed_image_url | 100000 | 99845 | https://www.inaturalist.org/photos/205047997 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| observed_metadata_url | 100000 | 62113 | https://www.inaturalist.org/observations/c2d1b... | 21 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| certainty | 100000 | 1 | High | 100000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| model_uri | 100000 | 1 | 10.57967/hf/2763 | 100000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- phenovision.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"