Insect Identification in the Wild: The AMI Dataset¶
Category: Entomology · Size: 14.4 GB · Format: ZIP License: MIT · Zenodo record · Data sheet on the CSDH
~2.5 million insect images from citizen science platforms (AMI-GBIF) and 2,893 annotated images from global automated camera traps (AMI-Traps) for automated insect monitoring.
The data is mounted read-only at /srv/data/ami-insects/.
Save anything you produce in your personal folder (~/).
⚠️ Large dataset (14.4 GB). Your session has 4 GB RAM and your home folder is shared — don't extract the whole archive. Read the entries you need straight from inside the ZIP (see below); if you must extract, take only specific files, not everything.
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/ami-insects')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
ami_dataset.zip (14,356.4 MB)
Explore the ZIP¶
The dataset comes compressed. We list its contents without extracting; if it contains CSVs, pandas can read them straight from inside the ZIP. Remember: /srv/data is read-only — if you need to extract, do it into your folder (~/).
import zipfile
import pandas as pd
zips = sorted(DATA.rglob('*.zip'))
z = zipfile.ZipFile(zips[0])
print('Using:', zips[0].name)
names = z.namelist()
print(f'{len(names)} files inside; first 20:')
for n in names[:20]:
print(' ', n)
csv_inside = [n for n in names if n.lower().endswith('.csv')]
if csv_inside:
df = pd.read_csv(z.open(csv_inside[0]), nrows=100_000, low_memory=False)
display(df.head())
Using: ami_dataset.zip 119918 files inside; first 20: ami_gbif/ __MACOSX/._ami_gbif ami_gbif/binary_classification/ __MACOSX/ami_gbif/._binary_classification ami_gbif/fine-grained_classification/ __MACOSX/ami_gbif/._fine-grained_classification ami_gbif/readme.txt __MACOSX/ami_gbif/._readme.txt ami_gbif/binary_classification/0122862-240321170329656_nonmoths.zip __MACOSX/ami_gbif/binary_classification/._0122862-240321170329656_nonmoths.zip ami_gbif/binary_classification/metadata/ __MACOSX/ami_gbif/binary_classification/._metadata ami_gbif/fine-grained_classification/0019051-231002084531237.zip __MACOSX/ami_gbif/fine-grained_classification/._0019051-231002084531237.zip ami_gbif/fine-grained_classification/metadata/ __MACOSX/ami_gbif/fine-grained_classification/._metadata ami_gbif/binary_classification/metadata/ami-gbif_binary_category_map.json __MACOSX/ami_gbif/binary_classification/metadata/._ami-gbif_binary_category_map.json ami_gbif/binary_classification/metadata/ami-gbif_binary_test.csv __MACOSX/ami_gbif/binary_classification/metadata/._ami-gbif_binary_test.csv
| image_path | width | height | fetch_date | coreid | identifier | id | datasetKey | speciesKey | acceptedTaxonKey | lifeStage | decimalLatitude | decimalLongitude | eventDate | life_stage_prediction | binary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7/281389972... | 623 | 565 | 2023-10-19 23:32:21 | 2813899724 | https://inaturalist-open-data.s3.amazonaws.com... | 2813899724 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | 1806737.0 | 1806737.0 | Adult | 38.413667 | -81.585664 | 2020-06-20T00:13:00 | NaN | moth |
| 1 | 8a863029-f435-446a-821e-275f4f641165/371742203... | 450 | 600 | 2024-04-08 23:57:50 | 3717422034 | https://observation.org/photos/28040148.jpg | 3717422034 | 8a863029-f435-446a-821e-275f4f641165 | NaN | 1708152.0 | Imago | 52.150000 | 5.400000 | 2020-06-21 | NaN | nonmoth |
| 2 | 8a863029-f435-446a-821e-275f4f641165/382322684... | 600 | 450 | 2024-04-08 23:54:48 | 3823226846 | https://observation.org/photos/49126261.jpg | 3823226846 | 8a863029-f435-446a-821e-275f4f641165 | 4480502.0 | 4480502.0 | Imago | 51.250000 | 5.400000 | 2022-05-07 | NaN | nonmoth |
| 3 | b124e1e0-4755-430f-9eab-894f25a9b59c/240669345... | 505 | 640 | 2023-10-20 15:24:18 | 2406693453 | https://www.artsobservasjoner.no/MediaLibrary/... | 2406693453 | b124e1e0-4755-430f-9eab-894f25a9b59c | 1798971.0 | 1798971.0 | NaN | 58.076286 | 7.975989 | 2011-04-10T00:00:00 | Adult | moth |
| 4 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7/332779761... | 2048 | 1598 | 2023-10-19 23:57:55 | 3327797610 | https://inaturalist-open-data.s3.amazonaws.com... | 3327797610 | 50c9509d-22c7-4a22-a47d-8c48425ef4a7 | 1861887.0 | 1861887.0 | Adult | 44.340081 | -73.076875 | 2021-07-15T00:57:00 | NaN | moth |
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from Insect Identification in the Wild: The AMI Dataset, license MIT. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- ami-insects.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"