About the Citizen Science Data Hub
The Citizen Science Data Hub (CSDH) is an open platform by Fundación Ibercivis that brings together citizen science datasets and a ready-to-use environment to work on them. It is the data side of the project; the training modules — the Citizen Science Data Academy — live on the ECS Academy (Moodle).
Where each thing lives
| Piece | Where | What it is |
|---|---|---|
| Public gallery | data.ibercivis.es | Static catalogue of the datasets: data sheet, license, download and challenge per dataset. No account, no cookies. |
| Work environment | jupyterhub.ibercivis.es | JupyterHub. Sign in with GitHub, get your own workspace with the data mounted read-only. Per-user limits: 4 GB RAM, 2 CPU. |
| Example notebooks | github.com/Ibercivis/citizen-science-data | One notebook per dataset. The "Work on this dataset" button clones them into your Hub workspace via nbgitpuller. |
| Datasets (files) | Zenodo · /srv/data on the server |
Downloads link to Zenodo (citable DOIs). The same files are mounted read-only inside the Hub. |
| Forum & questions | GitHub Discussions | Where challenges, questions and results are discussed. |
| Training | ECS Academy (Moodle) | The Data Analysis course series. The Hub links to it; it does not reproduce it. |
Working with large datasets
Each Hub session has 4 GB of RAM and 2 CPUs, and the datasets
are mounted read-only in /srv/data — a single shared copy, so opening one
never duplicates it to your space. The catch: some datasets are far bigger than 4 GB
(the largest single file is ~13 GB). Opening a file is free; loading it whole
into memory is what crashes the kernel. You can still analyse a 10–20 GB file
on a 4 GB session — you just work like the pros:
- Read only the columns you need —
pd.read_csv(f, usecols=[…])orpd.read_parquet(f, columns=[…]). Parquet is columnar, so reading 3 of 50 columns barely touches the rest. - Process in chunks —
for chunk in pd.read_csv(f, chunksize=1_000_000): …walks the whole file keeping only a slice in memory at a time. - Query on disk with DuckDB (pre-installed) — run SQL straight against a
CSV or Parquet file and only the result comes into memory:
duckdb.sql("SELECT … FROM '/srv/data/…/file.parquet' GROUP BY …").df(). This handles files much larger than RAM. - For ZIP datasets, read entries from inside the archive instead of extracting the whole thing to your shared home folder.
Every example notebook for a large dataset opens with a reminder of this, and the CSV and Parquet ones include ready-to-use snippets. Downloads on the gallery point to Zenodo, so you never need to pull tens of gigabytes onto your own machine to get started.
Who runs it
The platform is operated by Fundación Ibercivis. See the legal notice for provider details, the privacy policy for how personal data is handled, and the terms of use for the work environment.
Funding
Funded by the European Union (Horizon Europe, grant agreement No. 101058509 — ECS project). Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the REA. Neither the European Union nor the granting authority can be held responsible for them.