TESS Network Motivation Survey (Light Pollution Citizen Science)¶
Category: Light Pollution · Size: 5.4 MB · Format: CSV, HTML, JSON, R, TTL License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Motivation survey data from ~120 volunteers hosting TESS photometers to measure night-sky brightness and light pollution at a global scale.
The data is mounted read-only at /srv/data/tess-light-pollution/.
Save anything you produce in your personal folder (~/).
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/tess-light-pollution')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
ro-crate-metadata.json (0.0 MB) ro-crate-preview.html (0.0 MB) tess-network-analysis-results.csv (0.0 MB) tess-network-analysis-script.R (0.0 MB) tess-network-procedure.ttl (0.1 MB) tess-network-results.csv (0.5 MB) tess-network-results.ttl (2.3 MB) tess-network-survey.ttl (2.4 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: tess-network-analysis-results.csv
| variable | mean | stdev | correlation | pvalue | significance | |
|---|---|---|---|---|---|---|
| 0 | achievement | 4.160256 | 0.815808 | 0.423953 | 1.097718e-04 | *** |
| 1 | belongingness | 3.730769 | 0.710278 | 0.456155 | 2.702843e-05 | *** |
| 2 | benevolence | 4.461538 | 0.596359 | 0.619622 | 1.463416e-09 | *** |
| 3 | conformity | 2.307692 | 0.957644 | 0.075494 | 5.112308e-01 | NaN |
| 4 | hedonism | 4.205128 | 0.811588 | 0.588373 | 1.471785e-08 | *** |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 variable 10 non-null object 1 mean 10 non-null float64 2 stdev 10 non-null float64 3 correlation 10 non-null float64 4 pvalue 10 non-null float64 5 significance 8 non-null object dtypes: float64(4), object(2) memory usage: 608.0+ bytes
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| variable | 10 | 10 | achievement | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 10.0 | NaN | NaN | NaN | 3.776923 | 0.7744 | 2.307692 | 3.230769 | 4.169872 | 4.330128 | 4.474359 |
| stdev | 10.0 | NaN | NaN | NaN | 0.805613 | 0.144445 | 0.596359 | 0.721084 | 0.794575 | 0.815291 | 1.117624 |
| correlation | 10.0 | NaN | NaN | NaN | 0.417705 | 0.196864 | 0.075494 | 0.309438 | 0.440054 | 0.564128 | 0.672226 |
| pvalue | 10.0 | NaN | NaN | NaN | 0.070149 | 0.164198 | 0.0 | 0.000001 | 0.000068 | 0.012106 | 0.511231 |
| significance | 8 | 2 | *** | 7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from TESS Network Motivation Survey (Light Pollution Citizen Science), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- tess-light-pollution.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"