Fruit-Bearing Plant Species Observations in Brazilian Cities (iNaturalist)¶

Category: Urban Botany · Size: 6.0 MB · Format: CSV License: CC0-1.0 · Zenodo record · Data sheet on the CSDH

iNaturalist observations of fruit-bearing plants across all 27 Brazilian state capitals, with 47 columns of taxonomic, temporal and spatial metadata.

The data is mounted read-only at /srv/data/fruit-plants-brazil/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/fruit-plants-brazil')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
observations_22-08-2023.csv  (6.0 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: observations_22-08-2023.csv
Out[2]:
id observed_on_string observed_on time_observed_at time_zone user_id user_login user_name created_at updated_at ... taxon_id taxon_kingdom_name taxon_phylum_name taxon_class_name taxon_order_name taxon_family_name taxon_genus_name taxon_species_name taxon_subspecies_name taxon_variety_name
0 26052 2011-07-24 24/07/2011 NaN Brasilia 1370 designonze Gabriela Castro 2011-07-25 23:51:48 UTC 2020-10-24 03:33:16 UTC ... 85098 Plantae Tracheophyta Magnoliopsida Sapindales Anacardiaceae Spondias Spondias mombin NaN NaN
1 122013 2012-08-16 16/08/2012 NaN Brasilia 9421 mauricio_mercadante Mauricio Mercadante 2012-09-13 01:36:04 UTC 2022-08-31 20:50:10 UTC ... 84838 Plantae Tracheophyta Magnoliopsida Myrtales Myrtaceae Syzygium Syzygium jambos NaN NaN
2 122037 2012-06-26 26/06/2012 NaN Hawaii 9421 mauricio_mercadante Mauricio Mercadante 2012-09-13 03:35:07 UTC 2022-08-31 20:50:10 UTC ... 153017 Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Dipteryx Dipteryx alata NaN NaN
3 123165 2012-04-22 22/04/2012 NaN Hawaii 9421 mauricio_mercadante Mauricio Mercadante 2012-09-15 22:50:10 UTC 2022-08-31 20:50:10 UTC ... 118964 Plantae Tracheophyta Magnoliopsida Malvales Malvaceae Sterculia Sterculia striata NaN NaN
4 220970 2013-03-22 22/03/2013 NaN Santiago 14169 netosevero Francisco Severo Neto 2013-03-22 14:37:46 UTC 2023-06-28 13:52:28 UTC ... 153017 Plantae Tracheophyta Magnoliopsida Fabales Fabaceae Dipteryx Dipteryx alata NaN NaN

5 rows × 47 columns

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10942 entries, 0 to 10941
Data columns (total 47 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   id                               10942 non-null  int64  
 1   observed_on_string                10839 non-null  object 
 2   observed_on                       10837 non-null  object 
 3   time_observed_at                  10150 non-null  object 
 4   time_zone                         10940 non-null  object 
 5   user_id                           10942 non-null  int64  
 6   user_login                        10942 non-null  object 
 7   user_name                         9037 non-null   object 
 8   created_at                        10942 non-null  object 
 9   updated_at                        10942 non-null  object 
 10  quality_grade                     10942 non-null  object 
 11  license                           6170 non-null   object 
 12  url                               10942 non-null  object 
 13  image_url                         10942 non-null  object 
 14  tag_list                          329 non-null    object 
 15  description                       1338 non-null   object 
 16  num_identification_agreements     10942 non-null  int64  
 17  num_identification_disagreements  10942 non-null  int64  
 18  captive_cultivated                10942 non-null  bool   
 19  oauth_application_id              8222 non-null   float64
 20  place_guess                       10941 non-null  object 
 21  latitude                          10942 non-null  object 
 22  longitude                         10942 non-null  object 
 23  positional_accuracy               8832 non-null   float64
 24  private_place_guess               0 non-null      float64
 25  private_latitude                  0 non-null      float64
 26  private_longitude                 0 non-null      float64
 27  public_positional_accuracy        8899 non-null   float64
 28  geoprivacy                        67 non-null     object 
 29  taxon_geoprivacy                  3613 non-null   object 
 30  coordinates_obscured              10942 non-null  bool   
 31  positioning_method                3445 non-null   object 
 32  positioning_device                3595 non-null   object 
 33  species_guess                     9681 non-null   object 
 34  scientific_name                   10942 non-null  object 
 35  common_name                       8616 non-null   object 
 36  iconic_taxon_name                 10942 non-null  object 
 37  taxon_id                          10942 non-null  int64  
 38  taxon_kingdom_name                10942 non-null  object 
 39  taxon_phylum_name                 10942 non-null  object 
 40  taxon_class_name                  10942 non-null  object 
 41  taxon_order_name                  10942 non-null  object 
 42  taxon_family_name                 10942 non-null  object 
 43  taxon_genus_name                  10936 non-null  object 
 44  taxon_species_name                9702 non-null   object 
 45  taxon_subspecies_name             28 non-null     object 
 46  taxon_variety_name                66 non-null     object 
dtypes: bool(2), float64(6), int64(5), object(34)
memory usage: 3.8+ MB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
id 10942.0 NaN NaN NaN 90096010.014257 49771355.303694 26052.0 47738730.75 86225883.5 137934586.0 179628795.0
observed_on_string 10839 10517 2022-11-19 38 NaN NaN NaN NaN NaN NaN NaN
observed_on 10837 2271 02/05/2021 170 NaN NaN NaN NaN NaN NaN NaN
time_observed_at 10150 9970 2020-10-15 17:56:00 UTC 9 NaN NaN NaN NaN NaN NaN NaN
time_zone 10940 62 Brasilia 8107 NaN NaN NaN NaN NaN NaN NaN
user_id 10942.0 NaN NaN NaN 2932202.384939 1932926.099828 1370.0 1347391.0 2581998.0 4187166.0 7269362.0
user_login 10942 2765 ericfischerrempe 566 NaN NaN NaN NaN NaN NaN NaN
user_name 9037 2028 Eric Fischer Rempe 566 NaN NaN NaN NaN NaN NaN NaN
created_at 10942 10834 2020-09-04 13:37:39 UTC 3 NaN NaN NaN NaN NaN NaN NaN
updated_at 10942 10702 2023-07-27 18:27:45 UTC 57 NaN NaN NaN NaN NaN NaN NaN
quality_grade 10942 3 research 4572 NaN NaN NaN NaN NaN NaN NaN
license 6170 6 CC-BY-NC 5284 NaN NaN NaN NaN NaN NaN NaN
url 10942 10942 http://www.inaturalist.org/observations/26052 1 NaN NaN NaN NaN NaN NaN NaN
image_url 10942 10941 https://static.inaturalist.org/photos/34889661... 2 NaN NaN NaN NaN NaN NaN NaN
tag_list 329 228 UnB, Universidade de Brasília, taxonomy:binomi... 8 NaN NaN NaN NaN NaN NaN NaN
description 1338 1062 Fotografado pela Patrulha Centaurus 19 NaN NaN NaN NaN NaN NaN NaN
num_identification_agreements 10942.0 NaN NaN NaN 0.866112 0.853433 0.0 0.0 1.0 1.0 6.0
num_identification_disagreements 10942.0 NaN NaN NaN 0.007677 0.090371 0.0 0.0 0.0 0.0 2.0
captive_cultivated 10942 2 False 7736 NaN NaN NaN NaN NaN NaN NaN
oauth_application_id 8222.0 NaN NaN NaN 9.53819 48.331838 2.0 2.0 2.0 3.0 524.0

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Fruit-Bearing Plant Species Observations in Brazilian Cities (iNaturalist), license CC0-1.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- fruit-plants-brazil.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"