CS Track Database: Citizen Science Projects Catalog¶
Category: Citizen Science Metadata · Size: 2.4 MB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Comprehensive database of citizen science projects with titles, topic areas, languages, URLs and alignment with the UN Sustainable Development Goals.
The data is mounted read-only at /srv/data/cs-track-database/.
Save anything you produce in your personal folder (~/).
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/cs-track-database')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
CSTrack_database.csv (2.4 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: CSTrack_database.csv
| Insert Date | Language | Title | URL Platform | Research Areas | Sdg | |
|---|---|---|---|---|---|---|
| 0 | 2020-06-16 | German | Virenmonitoring | https://www.citizen-science.at/projekte/virenm... | ["Life Sciences & Biomedicine, Virology, 0.962... | [] |
| 1 | 2020-06-16 | German | Deutsch in Österreich | https://www.citizen-science.at/projekte/deutsc... | ["Social Sciences, Linguistics, 0.716406357773... | [] |
| 2 | 2020-06-16 | German | Citree | https://www.citizen-science.at/projekte/citree | ["Physical Sciences, Sustainability Science, 0... | ["SDG, SDG #11, 0.37277650533619905" "SDG, SDG... |
| 3 | 2020-06-16 | German | Fossilfinder | ["https://www.citizen-science.at/projekte/foss... | ["Technology, Remote Sensing, 0.40562888841322... | ["SDG, SDG #1, 0.2798758512519837" "SDG, SDG #... |
| 4 | 2020-06-16 | German | Roadkill | ["https://boku.ac.at/citizen-science/projekte"... | ["Life Sciences & Biomedicine, Veterinary Scie... | ["SDG, SDG #15, 0.26473126252405066"] |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4949 entries, 0 to 4948 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Insert Date 4949 non-null object 1 Language 4929 non-null object 2 Title 4949 non-null object 3 URL Platform 4947 non-null object 4 Research Areas 4849 non-null object 5 Sdg 4849 non-null object dtypes: object(6) memory usage: 232.1+ KB
| count | unique | top | freq | |
|---|---|---|---|---|
| Insert Date | 4949 | 56 | 2021-01-10 | 1095 |
| Language | 4929 | 58 | English | 2771 |
| Title | 4949 | 4946 | DigiVol | 2 |
| URL Platform | 4947 | 4623 | ["https://eu-citizen.science/projects"] | 114 |
| Research Areas | 4849 | 4502 | [] | 30 |
| Sdg | 4849 | 2628 | [] | 2083 |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
No direct numeric columns: explore df on your own.
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from CS Track Database: Citizen Science Projects Catalog, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- cs-track-database.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"