PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos¶

Category: Phenology · Size: 12.9 GB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Global flower and fruit presence data produced by automated labelling of iNaturalist images up to March 2024, with per-observation detection metrics.

The data is mounted read-only at /srv/data/phenovision/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (12.9 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/phenovision')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
annotations_all_w_headers_9cf8ad8.csv  (12,870.8 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: annotations_all_w_headers_9cf8ad8.csv
Out[2]:
machine_learning_annotation_id datasource verbatim_date day_of_year year latitude longitude coordinate_uncertainty_meters family count_family ... observed_image_guid observed_image_url observed_metadata_url certainty model_uri prediction_probability prediction_class proportion_low_certainty_family accuracy_excluding_low_certainty_family accuracy_family
0 0cf717af-4366-48df-8909-ea5fe454e1a0 iNaturalist 1806-12-31 365 1806 -34.007000 18.450000 110.0 Proteaceae 4339.0 ... https://inaturalist-open-data.s3.amazonaws.com... https://www.inaturalist.org/photos/15571385 https://www.inaturalist.org/observations/c3400... High 10.57967/hf/2763 0.965820 Detected 0.080292 0.966742 0.932221
1 d49811d4-e860-4bb3-a2af-370d3928a2dd iNaturalist 1893-06-30 181 1893 -12.292183 49.209821 31092.0 Fabaceae 54776.0 ... https://inaturalist-open-data.s3.amazonaws.com... https://www.inaturalist.org/photos/248874915 https://www.inaturalist.org/observations/af674... High 10.57967/hf/2763 0.895508 Detected 0.047553 0.982105 0.959193
2 3a9bfeee-6841-4e73-9147-919462b94144 iNaturalist 1893-06-30 181 1893 -17.626906 49.570417 30716.0 Fabaceae 54776.0 ... https://inaturalist-open-data.s3.amazonaws.com... https://www.inaturalist.org/photos/248875771 https://www.inaturalist.org/observations/6e83e... High 10.57967/hf/2763 0.949219 Detected 0.060385 0.962783 0.943855
3 fa1be567-d680-4fce-aa23-fbca0521092a iNaturalist 1893-07-18 199 1893 -17.213798 44.611626 30748.0 Fabaceae 54776.0 ... https://inaturalist-open-data.s3.amazonaws.com... https://www.inaturalist.org/photos/248874282 https://www.inaturalist.org/observations/f5295... High 10.57967/hf/2763 0.943359 Detected 0.060385 0.962783 0.943855
4 f0cd84a5-8c0f-4fb0-81a2-976f0eec6fe4 iNaturalist 1893-07-18 199 1893 -17.213798 44.611626 30748.0 Fabaceae 54776.0 ... https://inaturalist-open-data.s3.amazonaws.com... https://www.inaturalist.org/photos/248874304 https://www.inaturalist.org/observations/f5295... High 10.57967/hf/2763 0.805176 Detected 0.060385 0.962783 0.943855

5 rows × 25 columns

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 25 columns):
 #   Column                                   Non-Null Count   Dtype  
---  ------                                   --------------   -----  
 0   machine_learning_annotation_id           100000 non-null  object 
 1   datasource                               100000 non-null  object 
 2   verbatim_date                            100000 non-null  object 
 3   day_of_year                              100000 non-null  int64  
 4   year                                     100000 non-null  int64  
 5   latitude                                 100000 non-null  float64
 6   longitude                                100000 non-null  float64
 7   coordinate_uncertainty_meters            83534 non-null   float64
 8   family                                   100000 non-null  object 
 9   count_family                             99984 non-null   float64
 10  genus                                    100000 non-null  object 
 11  scientific_name                          100000 non-null  object 
 12  taxon_rank                               100000 non-null  object 
 13  basis_of_record                          100000 non-null  object 
 14  trait                                    100000 non-null  object 
 15  observed_image_guid                      100000 non-null  object 
 16  observed_image_url                       100000 non-null  object 
 17  observed_metadata_url                    100000 non-null  object 
 18  certainty                                100000 non-null  object 
 19  model_uri                                100000 non-null  object 
 20  prediction_probability                   100000 non-null  float64
 21  prediction_class                         100000 non-null  object 
 22  proportion_low_certainty_family          99972 non-null   float64
 23  accuracy_excluding_low_certainty_family  99951 non-null   float64
 24  accuracy_family                          99972 non-null   float64
dtypes: float64(8), int64(2), object(15)
memory usage: 19.1+ MB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
machine_learning_annotation_id 100000 100000 0cf717af-4366-48df-8909-ea5fe454e1a0 1 NaN NaN NaN NaN NaN NaN NaN
datasource 100000 1 iNaturalist 100000 NaN NaN NaN NaN NaN NaN NaN
verbatim_date 100000 6751 2012-09-16 1005 NaN NaN NaN NaN NaN NaN NaN
day_of_year 100000.0 NaN NaN NaN 214.98648 78.877365 1.0 165.0 231.0 271.0 366.0
year 100000.0 NaN NaN NaN 2001.10994 11.45197 1806.0 1993.0 2003.0 2012.0 2012.0
latitude 100000.0 NaN NaN NaN 13.238691 36.716359 -54.961287 -32.239031 32.307999 44.08858 79.91178
longitude 100000.0 NaN NaN NaN -2.697333 82.555869 -176.62087 -81.416518 7.769936 29.954922 178.320079
coordinate_uncertainty_meters 83534.0 NaN NaN NaN 5121.909893 22043.2927 0.0 20.0 200.0 1912.0 1785820.0
family 100000 309 Asteraceae 10995 NaN NaN NaN NaN NaN NaN NaN
count_family 99984.0 NaN NaN NaN 26029.161376 31257.86504 1.0 3692.0 12091.0 28821.0 104342.0
genus 100000 4358 Carex 2686 NaN NaN NaN NaN NaN NaN NaN
scientific_name 100000 20203 Chamaenerion angustifolium 255 NaN NaN NaN NaN NaN NaN NaN
taxon_rank 100000 3 species 93180 NaN NaN NaN NaN NaN NaN NaN
basis_of_record 100000 1 MachineObservation 100000 NaN NaN NaN NaN NaN NaN NaN
trait 100000 2 flower 82111 NaN NaN NaN NaN NaN NaN NaN
observed_image_guid 100000 99845 https://inaturalist-open-data.s3.amazonaws.com... 2 NaN NaN NaN NaN NaN NaN NaN
observed_image_url 100000 99845 https://www.inaturalist.org/photos/205047997 2 NaN NaN NaN NaN NaN NaN NaN
observed_metadata_url 100000 62113 https://www.inaturalist.org/observations/c2d1b... 21 NaN NaN NaN NaN NaN NaN NaN
certainty 100000 1 High 100000 NaN NaN NaN NaN NaN NaN NaN
model_uri 100000 1 10.57967/hf/2763 100000 NaN NaN NaN NaN NaN NaN NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from PhenoVision: Machine Annotations of Phenology for iNaturalist Plant Photos, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- phenovision.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"