Insect Identification in the Wild: The AMI Dataset¶

Category: Entomology · Size: 14.4 GB · Format: ZIP License: MIT · Zenodo record · Data sheet on the CSDH

~2.5 million insect images from citizen science platforms (AMI-GBIF) and 2,893 annotated images from global automated camera traps (AMI-Traps) for automated insect monitoring.

The data is mounted read-only at /srv/data/ami-insects/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (14.4 GB). Your session has 4 GB RAM and your home folder is shared — don't extract the whole archive. Read the entries you need straight from inside the ZIP (see below); if you must extract, take only specific files, not everything.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/ami-insects')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
ami_dataset.zip  (14,356.4 MB)

Explore the ZIP¶

The dataset comes compressed. We list its contents without extracting; if it contains CSVs, pandas can read them straight from inside the ZIP. Remember: /srv/data is read-only — if you need to extract, do it into your folder (~/).

In [2]:
import zipfile
import pandas as pd

zips = sorted(DATA.rglob('*.zip'))
z = zipfile.ZipFile(zips[0])
print('Using:', zips[0].name)
names = z.namelist()
print(f'{len(names)} files inside; first 20:')
for n in names[:20]:
    print('  ', n)

csv_inside = [n for n in names if n.lower().endswith('.csv')]
if csv_inside:
    df = pd.read_csv(z.open(csv_inside[0]), nrows=100_000, low_memory=False)
    display(df.head())
Using: ami_dataset.zip
119918 files inside; first 20:
   ami_gbif/
   __MACOSX/._ami_gbif
   ami_gbif/binary_classification/
   __MACOSX/ami_gbif/._binary_classification
   ami_gbif/fine-grained_classification/
   __MACOSX/ami_gbif/._fine-grained_classification
   ami_gbif/readme.txt
   __MACOSX/ami_gbif/._readme.txt
   ami_gbif/binary_classification/0122862-240321170329656_nonmoths.zip
   __MACOSX/ami_gbif/binary_classification/._0122862-240321170329656_nonmoths.zip
   ami_gbif/binary_classification/metadata/
   __MACOSX/ami_gbif/binary_classification/._metadata
   ami_gbif/fine-grained_classification/0019051-231002084531237.zip
   __MACOSX/ami_gbif/fine-grained_classification/._0019051-231002084531237.zip
   ami_gbif/fine-grained_classification/metadata/
   __MACOSX/ami_gbif/fine-grained_classification/._metadata
   ami_gbif/binary_classification/metadata/ami-gbif_binary_category_map.json
   __MACOSX/ami_gbif/binary_classification/metadata/._ami-gbif_binary_category_map.json
   ami_gbif/binary_classification/metadata/ami-gbif_binary_test.csv
   __MACOSX/ami_gbif/binary_classification/metadata/._ami-gbif_binary_test.csv
image_path width height fetch_date coreid identifier id datasetKey speciesKey acceptedTaxonKey lifeStage decimalLatitude decimalLongitude eventDate life_stage_prediction binary
0 50c9509d-22c7-4a22-a47d-8c48425ef4a7/281389972... 623 565 2023-10-19 23:32:21 2813899724 https://inaturalist-open-data.s3.amazonaws.com... 2813899724 50c9509d-22c7-4a22-a47d-8c48425ef4a7 1806737.0 1806737.0 Adult 38.413667 -81.585664 2020-06-20T00:13:00 NaN moth
1 8a863029-f435-446a-821e-275f4f641165/371742203... 450 600 2024-04-08 23:57:50 3717422034 https://observation.org/photos/28040148.jpg 3717422034 8a863029-f435-446a-821e-275f4f641165 NaN 1708152.0 Imago 52.150000 5.400000 2020-06-21 NaN nonmoth
2 8a863029-f435-446a-821e-275f4f641165/382322684... 600 450 2024-04-08 23:54:48 3823226846 https://observation.org/photos/49126261.jpg 3823226846 8a863029-f435-446a-821e-275f4f641165 4480502.0 4480502.0 Imago 51.250000 5.400000 2022-05-07 NaN nonmoth
3 b124e1e0-4755-430f-9eab-894f25a9b59c/240669345... 505 640 2023-10-20 15:24:18 2406693453 https://www.artsobservasjoner.no/MediaLibrary/... 2406693453 b124e1e0-4755-430f-9eab-894f25a9b59c 1798971.0 1798971.0 NaN 58.076286 7.975989 2011-04-10T00:00:00 Adult moth
4 50c9509d-22c7-4a22-a47d-8c48425ef4a7/332779761... 2048 1598 2023-10-19 23:57:55 3327797610 https://inaturalist-open-data.s3.amazonaws.com... 3327797610 50c9509d-22c7-4a22-a47d-8c48425ef4a7 1861887.0 1861887.0 Adult 44.340081 -73.076875 2021-07-15T00:57:00 NaN moth

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Insect Identification in the Wild: The AMI Dataset, license MIT. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [3]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- ami-insects.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"