CS Track Database: Citizen Science Projects Catalog¶

Category: Citizen Science Metadata · Size: 2.4 MB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Comprehensive database of citizen science projects with titles, topic areas, languages, URLs and alignment with the UN Sustainable Development Goals.

The data is mounted read-only at /srv/data/cs-track-database/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:

from pathlib import Path

DATA = Path('/srv/data/cs-track-database')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")

CSTrack_database.csv  (2.4 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:

import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()

Using: CSTrack_database.csv

Out[2]:

	Insert Date	Language	Title	URL Platform	Research Areas	Sdg
0	2020-06-16	German	Virenmonitoring	https://www.citizen-science.at/projekte/virenm...	["Life Sciences & Biomedicine, Virology, 0.962...	[]
1	2020-06-16	German	Deutsch in Österreich	https://www.citizen-science.at/projekte/deutsc...	["Social Sciences, Linguistics, 0.716406357773...	[]
2	2020-06-16	German	Citree	https://www.citizen-science.at/projekte/citree	["Physical Sciences, Sustainability Science, 0...	["SDG, SDG #11, 0.37277650533619905" "SDG, SDG...
3	2020-06-16	German	Fossilfinder	["https://www.citizen-science.at/projekte/foss...	["Technology, Remote Sensing, 0.40562888841322...	["SDG, SDG #1, 0.2798758512519837" "SDG, SDG #...
4	2020-06-16	German	Roadkill	["https://boku.ac.at/citizen-science/projekte"...	["Life Sciences & Biomedicine, Veterinary Scie...	["SDG, SDG #15, 0.26473126252405066"]

First look¶

Shape, types and basic statistics.

In [3]:

df.info()
df.describe(include='all').T.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4949 entries, 0 to 4948
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Insert Date     4949 non-null   object
 1   Language        4929 non-null   object
 2   Title           4949 non-null   object
 3   URL Platform    4947 non-null   object
 4   Research Areas  4849 non-null   object
 5   Sdg             4849 non-null   object
dtypes: object(6)
memory usage: 232.1+ KB

Out[3]:

	count	unique	top	freq
Insert Date	4949	56	2021-01-10	1095
Language	4929	58	English	2771
Title	4949	4946	DigiVol	2
URL Platform	4947	4623	["https://eu-citizen.science/projects"]	114
Research Areas	4849	4502	[]	30
Sdg	4849	2628	[]	2083

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:

import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')

No direct numeric columns: explore df on your own.

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).

Process in chunks and keep only the result:

total = 0
for chunk in pd.read_csv(file, chunksize=1_000_000):
    total += len(chunk)

Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
```
import duckdb
duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
```

Your turn¶

This is just the starting point. Some ideas:

Check the dataset challenge on its CSDH data sheet.
Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
Questions and results: on the platform forum.

Attribution: data from CS Track Database: Citizen Science Projects Catalog, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:

# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- cs-track-database.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"