SuperWASP Variable Star Photometry Archive (VeSPA)¶

Category: Astronomy · Size: 137.5 MB · Format: CSV, YAML License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Metadata for periodic variable stars classified by citizens in the Zooniverse SuperWASP Variable Stars project, with folded light curves and period parameters.

The data is mounted read-only at /srv/data/superwasp-vespa/. Save anything you produce in your personal folder (~/).

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/superwasp-vespa')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
export.csv  (137.5 MB)
fields.yaml  (0.0 MB)
params.yaml  (0.0 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: export.csv
Out[2]:
SuperWASP ID Period Length RA Dec Maximum magnitude Minimum magnitude Mean magnitude Amplitude Classification Classification count Folding flag Sigma Chi squared FITS URL JSON URL Unfolded plot URL Folded plot URL
0 1SWASPJ000126.73-001344.2 24007.58594 0h01m26.73s -0d13m44.2s 14.023840 21.350196 14.870849 7.326356 Pulsator 4 Uncertain 6.22 37.37 https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP...
1 1SWASPJ000236.45+002446.3 20553.37500 0h02m36.45s 0d24m46.3s 13.998806 16.670559 14.678380 2.671753 EA/EB 8 Half 5.88 30.41 https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP...
2 1SWASPJ000236.45+002446.3 16595.36914 0h02m36.45s 0d24m46.3s 13.998806 16.670559 14.678380 2.671753 EW 7 Half 4.51 21.07 https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP...
3 1SWASPJ000236.45+002446.3 27404.40430 0h02m36.45s 0d24m46.3s 13.998806 16.670559 14.678380 2.671753 EA/EB 7 Certain 4.90 25.00 https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP...
4 1SWASPJ000236.20+002516.3 27404.29492 0h02m36.2s 0d25m16.3s 14.079385 17.870184 14.824788 3.790799 EA/EB 7 Certain 4.54 21.15 https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP... https://www.superwasp.org/media/sources/1SWASP...

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 17 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   SuperWASP ID          100000 non-null  object 
 1   Period Length         100000 non-null  float64
 2   RA                    100000 non-null  object 
 3   Dec                   100000 non-null  object 
 4   Maximum magnitude     100000 non-null  float64
 5   Minimum magnitude     96941 non-null   float64
 6   Mean magnitude        100000 non-null  float64
 7   Amplitude             96941 non-null   float64
 8   Classification        100000 non-null  object 
 9   Classification count  100000 non-null  int64  
 10  Folding flag          100000 non-null  object 
 11  Sigma                 100000 non-null  float64
 12  Chi squared           100000 non-null  float64
 13  FITS URL              100000 non-null  object 
 14  JSON URL              100000 non-null  object 
 15  Unfolded plot URL     100000 non-null  object 
 16  Folded plot URL       100000 non-null  object 
dtypes: float64(7), int64(1), object(9)
memory usage: 13.0+ MB
/opt/tljh/user/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract
  sqr = _ensure_numeric((avg - values) ** 2)
/opt/tljh/user/lib/python3.10/site-packages/pandas/core/nanops.py:1016: RuntimeWarning: invalid value encountered in subtract
  sqr = _ensure_numeric((avg - values) ** 2)
Out[3]:
count unique top freq mean std min 25% 50% 75% max
SuperWASP ID 100000 61358 1SWASPJ005420.20+391224.1 15 NaN NaN NaN NaN NaN NaN NaN
Period Length 100000.0 NaN NaN NaN 885344.136422 2205434.898275 3597.0249 19721.416992 37551.52539 290681.210938 37475808.0
RA 100000 60595 4h03m36.4s 15 NaN NaN NaN NaN NaN NaN NaN
Dec 100000 60820 7d00m59.5s 15 NaN NaN NaN NaN NaN NaN NaN
Maximum magnitude 100000.0 NaN NaN NaN 12.323996 1.303957 6.016183 11.516335 12.516113 13.311662 16.555348
Minimum magnitude 96941.0 NaN NaN NaN inf NaN 6.675255 12.184347 13.399917 14.720874 inf
Mean magnitude 100000.0 NaN NaN NaN 12.7389 1.444923 6.502205 11.81065 12.870022 13.788994 17.87681
Amplitude 96941.0 NaN NaN NaN inf NaN 0.05284 0.483938 0.869866 1.5963 inf
Classification 100000 5 Rotator 32737 NaN NaN NaN NaN NaN NaN NaN
Classification count 100000.0 NaN NaN NaN 3.0714 2.545951 1.0 1.0 2.0 6.0 10.0
Folding flag 100000 3 Certain 67610 NaN NaN NaN NaN NaN NaN NaN
Sigma 100000.0 NaN NaN NaN 5.079453 1.172522 3.5 4.13 4.84 5.8 10.7
Chi squared 100000.0 NaN NaN NaN 293.435658 975.328887 20.0 42.61 86.52 214.9025 72992.79
FITS URL 100000 61358 https://www.superwasp.org/media/sources/1SWASP... 15 NaN NaN NaN NaN NaN NaN NaN
JSON URL 100000 61358 https://www.superwasp.org/media/sources/1SWASP... 15 NaN NaN NaN NaN NaN NaN NaN
Unfolded plot URL 100000 61358 https://www.superwasp.org/media/sources/1SWASP... 15 NaN NaN NaN NaN NaN NaN NaN
Folded plot URL 100000 100000 https://www.superwasp.org/media/sources/1SWASP... 1 NaN NaN NaN NaN NaN NaN NaN

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from SuperWASP Variable Star Photometry Archive (VeSPA), license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- superwasp-vespa.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"