SuperWASP Variable Star Photometry Archive (VeSPA)¶
Category: Astronomy · Size: 137.5 MB · Format: CSV, YAML License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH
Metadata for periodic variable stars classified by citizens in the Zooniverse SuperWASP Variable Stars project, with folded light curves and period parameters.
The data is mounted read-only at /srv/data/superwasp-vespa/.
Save anything you produce in your personal folder (~/).
What's in the dataset¶
from pathlib import Path
DATA = Path('/srv/data/superwasp-vespa')
for f in sorted(DATA.rglob('*')):
if f.is_file():
print(f"{f.relative_to(DATA)} ({f.stat().st_size/1e6:,.1f} MB)")
export.csv (137.5 MB) fields.yaml (0.0 MB) params.yaml (0.0 MB)
Load the data¶
The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.
import pandas as pd
csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)
def load_csv(path, **kw):
"""Robust reader: detects the separator and tries utf-8 then latin-1."""
for enc in ('utf-8', 'latin-1'):
try:
return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
except UnicodeDecodeError:
continue
df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: export.csv
| SuperWASP ID | Period Length | RA | Dec | Maximum magnitude | Minimum magnitude | Mean magnitude | Amplitude | Classification | Classification count | Folding flag | Sigma | Chi squared | FITS URL | JSON URL | Unfolded plot URL | Folded plot URL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1SWASPJ000126.73-001344.2 | 24007.58594 | 0h01m26.73s | -0d13m44.2s | 14.023840 | 21.350196 | 14.870849 | 7.326356 | Pulsator | 4 | Uncertain | 6.22 | 37.37 | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... |
| 1 | 1SWASPJ000236.45+002446.3 | 20553.37500 | 0h02m36.45s | 0d24m46.3s | 13.998806 | 16.670559 | 14.678380 | 2.671753 | EA/EB | 8 | Half | 5.88 | 30.41 | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... |
| 2 | 1SWASPJ000236.45+002446.3 | 16595.36914 | 0h02m36.45s | 0d24m46.3s | 13.998806 | 16.670559 | 14.678380 | 2.671753 | EW | 7 | Half | 4.51 | 21.07 | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... |
| 3 | 1SWASPJ000236.45+002446.3 | 27404.40430 | 0h02m36.45s | 0d24m46.3s | 13.998806 | 16.670559 | 14.678380 | 2.671753 | EA/EB | 7 | Certain | 4.90 | 25.00 | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... |
| 4 | 1SWASPJ000236.20+002516.3 | 27404.29492 | 0h02m36.2s | 0d25m16.3s | 14.079385 | 17.870184 | 14.824788 | 3.790799 | EA/EB | 7 | Certain | 4.54 | 21.15 | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... | https://www.superwasp.org/media/sources/1SWASP... |
First look¶
Shape, types and basic statistics.
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 100000 entries, 0 to 99999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SuperWASP ID 100000 non-null object 1 Period Length 100000 non-null float64 2 RA 100000 non-null object 3 Dec 100000 non-null object 4 Maximum magnitude 100000 non-null float64 5 Minimum magnitude 96941 non-null float64 6 Mean magnitude 100000 non-null float64 7 Amplitude 96941 non-null float64 8 Classification 100000 non-null object 9 Classification count 100000 non-null int64 10 Folding flag 100000 non-null object 11 Sigma 100000 non-null float64 12 Chi squared 100000 non-null float64 13 FITS URL 100000 non-null object 14 JSON URL 100000 non-null object 15 Unfolded plot URL 100000 non-null object 16 Folded plot URL 100000 non-null object dtypes: float64(7), int64(1), object(9) memory usage: 13.0+ MB
/opt/tljh/user/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract sqr = _ensure_numeric((avg - values) ** 2) /opt/tljh/user/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract sqr = _ensure_numeric((avg - values) ** 2)
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SuperWASP ID | 100000 | 61358 | 1SWASPJ005420.20+391224.1 | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Period Length | 100000.0 | NaN | NaN | NaN | 885344.136422 | 2205434.898275 | 3597.0249 | 19721.416992 | 37551.52539 | 290681.210938 | 37475808.0 |
| RA | 100000 | 60595 | 4h03m36.4s | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dec | 100000 | 60820 | 7d00m59.5s | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Maximum magnitude | 100000.0 | NaN | NaN | NaN | 12.323996 | 1.303957 | 6.016183 | 11.516335 | 12.516113 | 13.311662 | 16.555348 |
| Minimum magnitude | 96941.0 | NaN | NaN | NaN | inf | NaN | 6.675255 | 12.184347 | 13.399917 | 14.720874 | inf |
| Mean magnitude | 100000.0 | NaN | NaN | NaN | 12.7389 | 1.444923 | 6.502205 | 11.81065 | 12.870022 | 13.788994 | 17.87681 |
| Amplitude | 96941.0 | NaN | NaN | NaN | inf | NaN | 0.05284 | 0.483938 | 0.869866 | 1.5963 | inf |
| Classification | 100000 | 5 | Rotator | 32737 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Classification count | 100000.0 | NaN | NaN | NaN | 3.0714 | 2.545951 | 1.0 | 1.0 | 2.0 | 6.0 | 10.0 |
| Folding flag | 100000 | 3 | Certain | 67610 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Sigma | 100000.0 | NaN | NaN | NaN | 5.079453 | 1.172522 | 3.5 | 4.13 | 4.84 | 5.8 | 10.7 |
| Chi squared | 100000.0 | NaN | NaN | NaN | 293.435658 | 975.328887 | 20.0 | 42.61 | 86.52 | 214.9025 | 72992.79 |
| FITS URL | 100000 | 61358 | https://www.superwasp.org/media/sources/1SWASP... | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| JSON URL | 100000 | 61358 | https://www.superwasp.org/media/sources/1SWASP... | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Unfolded plot URL | 100000 | 61358 | https://www.superwasp.org/media/sources/1SWASP... | 15 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Folded plot URL | 100000 | 100000 | https://www.superwasp.org/media/sources/1SWASP... | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
A first chart¶
Histogram of the first numeric column — swap it for the variable you care about.
import matplotlib.pyplot as plt
num = df.select_dtypes('number')
if num.shape[1]:
col = num.columns[0]
num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
plt.tight_layout()
else:
print('No direct numeric columns: explore df on your own.')
Working with data larger than memory¶
Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:
- Read only the columns you need:
pd.read_csv(f, usecols=[...])/pd.read_parquet(f, columns=[...]). - Process in chunks and keep only the result:
total = 0 for chunk in pd.read_csv(file, chunksize=1_000_000): total += len(chunk)
- Query with SQL without loading anything — DuckDB (already installed) reads
CSV and Parquet straight from disk and only brings the result into memory:
import duckdb duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
Your turn¶
This is just the starting point. Some ideas:
- Check the dataset challenge on its CSDH data sheet.
- Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
- Edited this notebook and want the original back? Use the Restore cell
below (or the
restore.ipynbnotebook). - Questions and results: on the platform forum.
Attribution: data from SuperWASP Variable Star Photometry Archive (VeSPA), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #) 2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk
# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- superwasp-vespa.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"