CS Track Database: Citizen Science Projects Catalog¶

Category: Citizen Science Metadata · Size: 2.4 MB · Format: CSV License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Comprehensive database of citizen science projects with titles, topic areas, languages, URLs and alignment with the UN Sustainable Development Goals.

The data is mounted read-only at /srv/data/cs-track-database/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/cs-track-database')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
CSTrack_database.csv  (2.4 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: CSTrack_database.csv
Out[2]:
Insert Date Language Title URL Platform Research Areas Sdg
0 2020-06-16 German Virenmonitoring https://www.citizen-science.at/projekte/virenm... ["Life Sciences & Biomedicine, Virology, 0.962... []
1 2020-06-16 German Deutsch in Österreich https://www.citizen-science.at/projekte/deutsc... ["Social Sciences, Linguistics, 0.716406357773... []
2 2020-06-16 German Citree https://www.citizen-science.at/projekte/citree ["Physical Sciences, Sustainability Science, 0... ["SDG, SDG #11, 0.37277650533619905" "SDG, SDG...
3 2020-06-16 German Fossilfinder ["https://www.citizen-science.at/projekte/foss... ["Technology, Remote Sensing, 0.40562888841322... ["SDG, SDG #1, 0.2798758512519837" "SDG, SDG #...
4 2020-06-16 German Roadkill ["https://boku.ac.at/citizen-science/projekte"... ["Life Sciences & Biomedicine, Veterinary Scie... ["SDG, SDG #15, 0.26473126252405066"]

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4949 entries, 0 to 4948
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Insert Date     4949 non-null   object
 1   Language        4929 non-null   object
 2   Title           4949 non-null   object
 3   URL Platform    4947 non-null   object
 4   Research Areas  4849 non-null   object
 5   Sdg             4849 non-null   object
dtypes: object(6)
memory usage: 232.1+ KB
Out[3]:
count unique top freq
Insert Date 4949 56 2021-01-10 1095
Language 4929 58 English 2771
Title 4949 4946 DigiVol 2
URL Platform 4947 4623 ["https://eu-citizen.science/projects"] 114
Research Areas 4849 4502 [] 30
Sdg 4849 2628 [] 2083

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No direct numeric columns: explore df on your own.

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from CS Track Database: Citizen Science Projects Catalog, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- cs-track-database.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"