Tropical Rainforest Birds: Acoustic + eBird Distribution Models¶

Category: Ornithology · Size: 8.3 GB · Format: CSV, R, ZIP License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Automated acoustic monitoring data and eBird observations combined for distribution models of suboscine birds in southwestern Amazonia.

The data is mounted read-only at /srv/data/tropical-rainforest-birds/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (8.3 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/tropical-rainforest-birds')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
analysis.R  (0.1 MB)
calc_er_summary_metrics.R  (0.0 MB)
calc_occupancy_metrics.R  (0.0 MB)
data-raw.zip  (16.2 MB)
data.zip  (39.2 MB)
det_prob_test.R  (0.0 MB)
environmental-variables_checklists.csv  (17.6 MB)
gis.zip  (1,552.8 MB)
logistic_cicra.R  (0.0 MB)
logistic_samples_testset.zip  (0.1 MB)
logit_cicra.R  (0.0 MB)
logit_postprocessing_results.zip  (0.7 MB)
main_cicra_only.R  (0.1 MB)
main_cicra_only_audio.R  (0.1 MB)
occ_model_selection_audio.R  (0.0 MB)
occ_model_selection_eBird.R  (0.0 MB)
occ_model_selection_pooled.R  (0.0 MB)
run_occ_temp.R  (0.0 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: environmental-variables_checklists.csv
Out[2]:
checklist_id elevation_mean_300m elevation_sd_300m HAND_mean_300m HAND_sd_300m canopy_mean_300m canopy_sd_300m ed_c01_open_water_300m pland_c01_open_water_300m ed_c02_floodplain_300m ... ed_c10_scrub_shrub_1k pland_c10_scrub_shrub_1k ed_c11_built_area_1k pland_c11_built_area_1k ed_c12_bare_ground_1k pland_c12_bare_ground_1k ed_c13_snow_ice_1k pland_c13_snow_ice_1k ed_c14_sparsely_vegetated_1k pland_c14_sparsely_vegetated_1k
0 G7641541 207.530899 15.869239 22.530880 15.869239 16.383030 11.567419 28.112308 29.545455 46.249281 ... 0 0 0.0 0.0 3.908009 0.780696 0 0 0.0 0.0
1 G7641542 207.530899 15.869239 22.530880 15.869239 16.383030 11.567419 28.112308 29.545455 46.249281 ... 0 0 0.0 0.0 3.908009 0.780696 0 0 0.0 0.0
2 G7641547 207.530899 15.869239 22.530880 15.869239 16.383030 11.567419 28.112308 29.545455 46.249281 ... 0 0 0.0 0.0 3.908009 0.780696 0 0 0.0 0.0
3 G7678829 207.530899 15.869239 22.530880 15.869239 16.383030 11.567419 28.112308 29.545455 46.249281 ... 0 0 0.0 0.0 3.908009 0.780696 0 0 0.0 0.0
4 G7206243 289.131989 14.737938 19.887278 14.575183 21.488491 13.826680 32.564318 28.967254 18.995852 ... 0 0 0.0 0.0 0.000000 0.000000 0 0 0.0 0.0

5 rows × 69 columns

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30580 entries, 0 to 30579
Data columns (total 69 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   checklist_id                       30580 non-null  object 
 1   elevation_mean_300m                30580 non-null  float64
 2   elevation_sd_300m                  30580 non-null  float64
 3   HAND_mean_300m                     30580 non-null  float64
 4   HAND_sd_300m                       30580 non-null  float64
 5   canopy_mean_300m                   30580 non-null  float64
 6   canopy_sd_300m                     30580 non-null  float64
 7   ed_c01_open_water_300m             30580 non-null  float64
 8   pland_c01_open_water_300m          30580 non-null  float64
 9   ed_c02_floodplain_300m             30580 non-null  float64
 10  pland_c02_floodplain_300m          30580 non-null  float64
 11  ed_c03_transition_forest_300m      30580 non-null  float64
 12  pland_c03_transition_forest_300m   30580 non-null  float64
 13  ed_c04_terra_firme_300m            30580 non-null  float64
 14  pland_c04_terra_firme_300m         30580 non-null  float64
 15  ed_c05_premontane_forest_300m      30580 non-null  int64  
 16  pland_c05_premontane_forest_300m   30580 non-null  int64  
 17  ed_c06_montane_forest_300m         30580 non-null  int64  
 18  pland_c06_montane_forest_300m      30580 non-null  int64  
 19  ed_c07_grass_300m                  30580 non-null  int64  
 20  pland_c07_grass_300m               30580 non-null  int64  
 21  ed_c08_flooded_vegetation_300m     30580 non-null  float64
 22  pland_c08_flooded_vegetation_300m  30580 non-null  float64
 23  ed_c09_crops_300m                  30580 non-null  float64
 24  pland_c09_crops_300m               30580 non-null  float64
 25  ed_c10_scrub_shrub_300m            30580 non-null  int64  
 26  pland_c10_scrub_shrub_300m         30580 non-null  int64  
 27  ed_c11_built_area_300m             30580 non-null  float64
 28  pland_c11_built_area_300m          30580 non-null  float64
 29  ed_c12_bare_ground_300m            30580 non-null  float64
 30  pland_c12_bare_ground_300m         30580 non-null  float64
 31  ed_c13_snow_ice_300m               30580 non-null  int64  
 32  pland_c13_snow_ice_300m            30580 non-null  int64  
 33  ed_c14_sparsely_vegetated_300m     30580 non-null  float64
 34  pland_c14_sparsely_vegetated_300m  30580 non-null  float64
 35  elevation_mean_1k                  30580 non-null  float64
 36  elevation_sd_1k                    30580 non-null  float64
 37  HAND_mean_1k                       30580 non-null  float64
 38  HAND_sd_1k                         30580 non-null  float64
 39  canopy_mean_1k                     30580 non-null  float64
 40  canopy_sd_1k                       30580 non-null  float64
 41  ed_c01_open_water_1k               30580 non-null  float64
 42  pland_c01_open_water_1k            30580 non-null  float64
 43  ed_c02_floodplain_1k               30580 non-null  float64
 44  pland_c02_floodplain_1k            30580 non-null  float64
 45  ed_c03_transition_forest_1k        30580 non-null  float64
 46  pland_c03_transition_forest_1k     30580 non-null  float64
 47  ed_c04_terra_firme_1k              30580 non-null  float64
 48  pland_c04_terra_firme_1k           30580 non-null  float64
 49  ed_c05_premontane_forest_1k        30580 non-null  float64
 50  pland_c05_premontane_forest_1k     30580 non-null  float64
 51  ed_c06_montane_forest_1k           30580 non-null  int64  
 52  pland_c06_montane_forest_1k        30580 non-null  int64  
 53  ed_c07_grass_1k                    30580 non-null  int64  
 54  pland_c07_grass_1k                 30580 non-null  int64  
 55  ed_c08_flooded_vegetation_1k       30580 non-null  float64
 56  pland_c08_flooded_vegetation_1k    30580 non-null  float64
 57  ed_c09_crops_1k                    30580 non-null  float64
 58  pland_c09_crops_1k                 30580 non-null  float64
 59  ed_c10_scrub_shrub_1k              30580 non-null  int64  
 60  pland_c10_scrub_shrub_1k           30580 non-null  int64  
 61  ed_c11_built_area_1k               30580 non-null  float64
 62  pland_c11_built_area_1k            30580 non-null  float64
 63  ed_c12_bare_ground_1k              30580 non-null  float64
 64  pland_c12_bare_ground_1k           30580 non-null  float64
 65  ed_c13_snow_ice_1k                 30580 non-null  int64  
 66  pland_c13_snow_ice_1k              30580 non-null  int64  
 67  ed_c14_sparsely_vegetated_1k       30580 non-null  float64
 68  pland_c14_sparsely_vegetated_1k    30580 non-null  float64
dtypes: float64(50), int64(18), object(1)
memory usage: 16.1+ MB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
checklist_id 30580 30510 G11200839 8 NaN NaN NaN NaN NaN NaN NaN
elevation_mean_300m 30580.0 NaN NaN NaN 259.416624 23.507895 169.163605 247.568634 257.509918 269.676544 502.928345
elevation_sd_300m 30580.0 NaN NaN NaN 10.588889 6.652001 0.0 4.082106 10.058463 17.622892 26.432089
HAND_mean_300m 30580.0 NaN NaN NaN 28.07559 16.539799 0.0 14.94241 25.06822 40.058731 119.920013
HAND_sd_300m 30580.0 NaN NaN NaN 10.834183 6.624898 0.0 4.206368 10.202206 18.314869 25.497375
canopy_mean_300m 30580.0 NaN NaN NaN 28.349423 3.712381 0.0 27.086662 28.927591 30.048012 35.838051
canopy_sd_300m 30580.0 NaN NaN NaN 3.827301 2.691151 0.0 1.868741 2.890475 4.805346 15.560221
ed_c01_open_water_300m 30580.0 NaN NaN NaN 8.084925 15.617845 0.0 0.0 0.0 7.058714 118.79025
pland_c01_open_water_300m 30580.0 NaN NaN NaN 3.337372 7.59246 0.0 0.0 0.0 0.737101 100.0
ed_c02_floodplain_300m 30580.0 NaN NaN NaN 27.647269 21.84691 0.0 6.161236 27.352516 40.603123 130.178123
pland_c02_floodplain_300m 30580.0 NaN NaN NaN 39.331729 34.715929 0.0 4.914005 37.009804 76.22549 100.0
ed_c03_transition_forest_300m 30580.0 NaN NaN NaN 37.924055 26.702645 0.0 11.700393 45.88164 55.94208 138.604656
pland_c03_transition_forest_300m 30580.0 NaN NaN NaN 13.678431 15.670454 0.0 2.70936 9.1133 19.65602 94.472362
ed_c04_terra_firme_300m 30580.0 NaN NaN NaN 18.169868 16.422677 0.0 0.0 22.004416 27.419887 102.994069
pland_c04_terra_firme_300m 30580.0 NaN NaN NaN 42.357551 36.071288 0.0 2.0 45.652174 70.515971 100.0
ed_c05_premontane_forest_300m 30580.0 NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0
pland_c05_premontane_forest_300m 30580.0 NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ed_c06_montane_forest_300m 30580.0 NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0
pland_c06_montane_forest_300m 30580.0 NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ed_c07_grass_300m 30580.0 NaN NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Tropical Rainforest Birds: Acoustic + eBird Distribution Models, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- tropical-rainforest-birds.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"