PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos¶

Category: Phenology · Size: 12.9 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Global flower and fruit presence data produced by automated labelling of iNaturalist images up to March 2024, with per-observation detection metrics.

The data is mounted read-only at /srv/data/phenovision/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (12.9 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:

from pathlib import Path

DATA = Path('/srv/data/phenovision')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")

annotations_all_w_headers_9cf8ad8.csv  (12,870.8 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:

import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()

Using: annotations_all_w_headers_9cf8ad8.csv

Out[2]:

	machine_learning_annotation_id	datasource	verbatim_date	day_of_year	year	latitude	longitude	coordinate_uncertainty_meters	family	count_family	...	observed_image_guid	observed_image_url	observed_metadata_url	certainty	model_uri	prediction_probability	prediction_class	proportion_low_certainty_family	accuracy_excluding_low_certainty_family	accuracy_family
0	0cf717af-4366-48df-8909-ea5fe454e1a0	iNaturalist	1806-12-31	365	1806	-34.007000	18.450000	110.0	Proteaceae	4339.0	...	https://inaturalist-open-data.s3.amazonaws.com...	https://www.inaturalist.org/photos/15571385	https://www.inaturalist.org/observations/c3400...	High	10.57967/hf/2763	0.965820	Detected	0.080292	0.966742	0.932221
1	d49811d4-e860-4bb3-a2af-370d3928a2dd	iNaturalist	1893-06-30	181	1893	-12.292183	49.209821	31092.0	Fabaceae	54776.0	...	https://inaturalist-open-data.s3.amazonaws.com...	https://www.inaturalist.org/photos/248874915	https://www.inaturalist.org/observations/af674...	High	10.57967/hf/2763	0.895508	Detected	0.047553	0.982105	0.959193
2	3a9bfeee-6841-4e73-9147-919462b94144	iNaturalist	1893-06-30	181	1893	-17.626906	49.570417	30716.0	Fabaceae	54776.0	...	https://inaturalist-open-data.s3.amazonaws.com...	https://www.inaturalist.org/photos/248875771	https://www.inaturalist.org/observations/6e83e...	High	10.57967/hf/2763	0.949219	Detected	0.060385	0.962783	0.943855
3	fa1be567-d680-4fce-aa23-fbca0521092a	iNaturalist	1893-07-18	199	1893	-17.213798	44.611626	30748.0	Fabaceae	54776.0	...	https://inaturalist-open-data.s3.amazonaws.com...	https://www.inaturalist.org/photos/248874282	https://www.inaturalist.org/observations/f5295...	High	10.57967/hf/2763	0.943359	Detected	0.060385	0.962783	0.943855
4	f0cd84a5-8c0f-4fb0-81a2-976f0eec6fe4	iNaturalist	1893-07-18	199	1893	-17.213798	44.611626	30748.0	Fabaceae	54776.0	...	https://inaturalist-open-data.s3.amazonaws.com...	https://www.inaturalist.org/photos/248874304	https://www.inaturalist.org/observations/f5295...	High	10.57967/hf/2763	0.805176	Detected	0.060385	0.962783	0.943855

5 rows × 25 columns

First look¶

Shape, types and basic statistics.

In [3]:

df.info()
df.describe(include='all').T.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 25 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   machine_learning_annotation_id           100000 non-null  object 
 1   datasource                               100000 non-null  object 
 2   verbatim_date                            100000 non-null  object 
 3   day_of_year                              100000 non-null  int64  
 4   year                                     100000 non-null  int64  
 5   latitude                                 100000 non-null  float64
 6   longitude                                100000 non-null  float64
 7   coordinate_uncertainty_meters            83534 non-null   float64
 8   family                                   100000 non-null  object 
 9   count_family                             99984 non-null   float64
 10  genus                                    100000 non-null  object 
 11  scientific_name                          100000 non-null  object 
 12  taxon_rank                               100000 non-null  object 
 13  basis_of_record                          100000 non-null  object 
 14  trait                                    100000 non-null  object 
 15  observed_image_guid                      100000 non-null  object 
 16  observed_image_url                       100000 non-null  object 
 17  observed_metadata_url                    100000 non-null  object 
 18  certainty                                100000 non-null  object 
 19  model_uri                                100000 non-null  object 
 20  prediction_probability                   100000 non-null  float64
 21  prediction_class                         100000 non-null  object 
 22  proportion_low_certainty_family          99972 non-null   float64
 23  accuracy_excluding_low_certainty_family  99951 non-null   float64
 24  accuracy_family                          99972 non-null   float64
dtypes: float64(8), int64(2), object(15)
memory usage: 19.1+ MB

Out[3]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
machine_learning_annotation_id	100000	100000	0cf717af-4366-48df-8909-ea5fe454e1a0	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
datasource	100000	1	iNaturalist	100000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
verbatim_date	100000	6751	2012-09-16	1005	NaN	NaN	NaN	NaN	NaN	NaN	NaN
day_of_year	100000.0	NaN	NaN	NaN	214.98648	78.877365	1.0	165.0	231.0	271.0	366.0
year	100000.0	NaN	NaN	NaN	2001.10994	11.45197	1806.0	1993.0	2003.0	2012.0	2012.0
latitude	100000.0	NaN	NaN	NaN	13.238691	36.716359	-54.961287	-32.239031	32.307999	44.08858	79.91178
longitude	100000.0	NaN	NaN	NaN	-2.697333	82.555869	-176.62087	-81.416518	7.769936	29.954922	178.320079
coordinate_uncertainty_meters	83534.0	NaN	NaN	NaN	5121.909893	22043.2927	0.0	20.0	200.0	1912.0	1785820.0
family	100000	309	Asteraceae	10995	NaN	NaN	NaN	NaN	NaN	NaN	NaN
count_family	99984.0	NaN	NaN	NaN	26029.161376	31257.86504	1.0	3692.0	12091.0	28821.0	104342.0
genus	100000	4358	Carex	2686	NaN	NaN	NaN	NaN	NaN	NaN	NaN
scientific_name	100000	20203	Chamaenerion angustifolium	255	NaN	NaN	NaN	NaN	NaN	NaN	NaN
taxon_rank	100000	3	species	93180	NaN	NaN	NaN	NaN	NaN	NaN	NaN
basis_of_record	100000	1	MachineObservation	100000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
trait	100000	2	flower	82111	NaN	NaN	NaN	NaN	NaN	NaN	NaN
observed_image_guid	100000	99845	https://inaturalist-open-data.s3.amazonaws.com...	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
observed_image_url	100000	99845	https://www.inaturalist.org/photos/205047997	2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
observed_metadata_url	100000	62113	https://www.inaturalist.org/observations/c2d1b...	21	NaN	NaN	NaN	NaN	NaN	NaN	NaN
certainty	100000	1	High	100000	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_uri	100000	1	10.57967/hf/2763	100000	NaN	NaN	NaN	NaN	NaN	NaN	NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:

import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')

No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).

Process in chunks and keep only the result:

total = 0
for chunk in pd.read_csv(file, chunksize=1_000_000):
    total += len(chunk)

Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
```
import duckdb
duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
```

Your turn¶

This is just the starting point. Some ideas:

Check the dataset challenge on its CSDH data sheet.
Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
Questions and results: on the platform forum.

Attribution: data from PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:

# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- phenovision.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"