Galaxy Zoo DESI: Detailed Morphology Classifications¶

Category: Astronomy · Size: 11.5 GB · Format: CSV, Parquet License: CC-BY-4.0 · Zenodo record · Data sheet on the CSDH

Detailed morphological classifications of 8.7 million galaxies from the DESI Legacy Imaging Survey, produced with deep learning trained on Galaxy Zoo volunteers.

The data is mounted read-only at /srv/data/galaxy-zoo-desi/. Save anything you produce in your personal folder (~/).

⚠️ Large dataset (11.5 GB). Your Hub session has 4 GB RAM — do not load the whole file into memory or the kernel will crash. Work like the pros: read only the columns you need, process the file in chunks, or query it straight from disk with DuckDB (no full load). Copy-paste patterns are in "Working with data larger than memory" near the end of this notebook.

What's in the dataset¶

In [1]:
from pathlib import Path

DATA = Path('/srv/data/galaxy-zoo-desi')

for f in sorted(DATA.rglob('*')):
    if f.is_file():
        print(f"{f.relative_to(DATA)}  ({f.stat().st_size/1e6:,.1f} MB)")
external_catalog.parquet  (1,616.7 MB)
gz_desi_deep_learning_catalog_advanced.parquet  (7,558.1 MB)
gz_desi_deep_learning_catalog_friendly.csv  (1,612.4 MB)
gz_desi_deep_learning_catalog_friendly.parquet  (658.8 MB)
gz_desi_gzd8_volunteer_core_catalog.parquet  (6.4 MB)
gz_desi_gzd8_volunteer_extended_catalog.parquet  (5.5 MB)

Load the data¶

The dataset comes as CSV. In the real world CSV files aren't always uniform (separator , or ;, UTF-8 or Latin-1 encoding), so we use a loader that detects it. We limit to 100,000 rows to explore quickly — drop nrows when you want to work with everything.

In [2]:
import pandas as pd

csvs = sorted(DATA.rglob('*.csv')) + sorted(DATA.rglob('*.csv.gz')) + sorted(DATA.rglob('*.gz'))
print('Using:', csvs[0].name)

def load_csv(path, **kw):
    """Robust reader: detects the separator and tries utf-8 then latin-1."""
    for enc in ('utf-8', 'latin-1'):
        try:
            return pd.read_csv(path, sep=None, engine='python', encoding=enc, **kw)
        except UnicodeDecodeError:
            continue

df = load_csv(csvs[0], nrows=100_000)
df.head()
Using: gz_desi_deep_learning_catalog_friendly.csv
Out[2]:
dr8_id ra dec brickid objid smooth-or-featured_smooth_fraction smooth-or-featured_featured-or-disk_fraction smooth-or-featured_artifact_fraction disk-edge-on_yes_fraction disk-edge-on_no_fraction ... spiral-arm-count_3_fraction spiral-arm-count_4_fraction spiral-arm-count_more-than-4_fraction spiral-arm-count_cant-tell_fraction merging_none_fraction merging_minor-disturbance_fraction merging_major-disturbance_fraction merging_merger_fraction catalog_version legacy_survey_data_release
0 100000_1081 32.084931 -44.311422 100000 1081 0.69 0.25 0.06 NaN NaN ... NaN NaN NaN NaN 0.84 0.12 0.02 0.01 1.0.0 DR8
1 100000_1401 32.140085 -44.293668 100000 1401 0.77 0.12 0.11 NaN NaN ... NaN NaN NaN NaN 0.60 0.15 0.05 0.21 1.0.0 DR8
2 100000_1483 32.275015 -44.288957 100000 1483 0.81 0.10 0.08 NaN NaN ... NaN NaN NaN NaN 0.59 0.17 0.04 0.19 1.0.0 DR8
3 100000_1509 32.045648 -44.287172 100000 1509 0.64 0.27 0.09 NaN NaN ... NaN NaN NaN NaN 0.17 0.07 0.05 0.71 1.0.0 DR8
4 100000_1869 32.170627 -44.267273 100000 1869 0.88 0.05 0.07 NaN NaN ... NaN NaN NaN NaN 0.68 0.25 0.05 0.02 1.0.0 DR8

5 rows × 41 columns

First look¶

Shape, types and basic statistics.

In [3]:
df.info()
df.describe(include='all').T.head(20)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 41 columns):
 #   Column                                        Non-Null Count   Dtype  
---  ------                                        --------------   -----  
 0   dr8_id                                        100000 non-null  object 
 1   ra                                            100000 non-null  float64
 2   dec                                           100000 non-null  float64
 3   brickid                                       100000 non-null  int64  
 4   objid                                         100000 non-null  int64  
 5   smooth-or-featured_smooth_fraction            100000 non-null  float64
 6   smooth-or-featured_featured-or-disk_fraction  100000 non-null  float64
 7   smooth-or-featured_artifact_fraction          100000 non-null  float64
 8   disk-edge-on_yes_fraction                     11948 non-null   float64
 9   disk-edge-on_no_fraction                      11948 non-null   float64
 10  has-spiral-arms_yes_fraction                  7351 non-null    float64
 11  has-spiral-arms_no_fraction                   7351 non-null    float64
 12  bar_strong_fraction                           7351 non-null    float64
 13  bar_weak_fraction                             7351 non-null    float64
 14  bar_no_fraction                               7351 non-null    float64
 15  bulge-size_dominant_fraction                  7351 non-null    float64
 16  bulge-size_large_fraction                     7351 non-null    float64
 17  bulge-size_moderate_fraction                  7351 non-null    float64
 18  bulge-size_small_fraction                     7351 non-null    float64
 19  bulge-size_none_fraction                      7351 non-null    float64
 20  how-rounded_round_fraction                    81439 non-null   float64
 21  how-rounded_in-between_fraction               81439 non-null   float64
 22  how-rounded_cigar-shaped_fraction             81439 non-null   float64
 23  edge-on-bulge_boxy_fraction                   2460 non-null    float64
 24  edge-on-bulge_none_fraction                   2460 non-null    float64
 25  edge-on-bulge_rounded_fraction                2460 non-null    float64
 26  spiral-winding_tight_fraction                 4685 non-null    float64
 27  spiral-winding_medium_fraction                4685 non-null    float64
 28  spiral-winding_loose_fraction                 4685 non-null    float64
 29  spiral-arm-count_1_fraction                   4685 non-null    float64
 30  spiral-arm-count_2_fraction                   4685 non-null    float64
 31  spiral-arm-count_3_fraction                   4685 non-null    float64
 32  spiral-arm-count_4_fraction                   4685 non-null    float64
 33  spiral-arm-count_more-than-4_fraction         4685 non-null    float64
 34  spiral-arm-count_cant-tell_fraction           4685 non-null    float64
 35  merging_none_fraction                         100000 non-null  float64
 36  merging_minor-disturbance_fraction            100000 non-null  float64
 37  merging_major-disturbance_fraction            100000 non-null  float64
 38  merging_merger_fraction                       100000 non-null  float64
 39  catalog_version                               100000 non-null  object 
 40  legacy_survey_data_release                    100000 non-null  object 
dtypes: float64(36), int64(2), object(3)
memory usage: 31.3+ MB
Out[3]:
count unique top freq mean std min 25% 50% 75% max
dr8_id 100000 100000 100000_1081 1 NaN NaN NaN NaN NaN NaN NaN
ra 100000.0 NaN NaN NaN 158.501302 139.958766 0.000383 40.929024 79.160883 323.514389 359.992159
dec 100000.0 NaN NaN NaN -43.343537 0.581483 -44.374993 -43.84433 -43.34548 -42.841351 -42.125242
brickid 100000.0 NaN NaN NaN 104152.80792 2432.567319 100000.0 102041.0 104129.0 106249.0 108350.0
objid 100000.0 NaN NaN NaN 2608.71779 1685.785361 0.0 1214.0 2448.0 3796.0 8488.0
smooth-or-featured_smooth_fraction 100000.0 NaN NaN NaN 0.676405 0.214371 0.02 0.6 0.77 0.83 0.91
smooth-or-featured_featured-or-disk_fraction 100000.0 NaN NaN NaN 0.213579 0.208996 0.02 0.07 0.12 0.28 0.95
smooth-or-featured_artifact_fraction 100000.0 NaN NaN NaN 0.110018 0.095357 0.03 0.07 0.09 0.11 0.91
disk-edge-on_yes_fraction 11948.0 NaN NaN NaN 0.309472 0.387722 0.01 0.03 0.06 0.77 0.99
disk-edge-on_no_fraction 11948.0 NaN NaN NaN 0.690528 0.387722 0.01 0.23 0.94 0.97 0.99
has-spiral-arms_yes_fraction 7351.0 NaN NaN NaN 0.815897 0.177276 0.05 0.75 0.88 0.94 0.99
has-spiral-arms_no_fraction 7351.0 NaN NaN NaN 0.184103 0.177276 0.01 0.06 0.12 0.25 0.95
bar_strong_fraction 7351.0 NaN NaN NaN 0.163595 0.14871 0.02 0.06 0.11 0.22 0.87
bar_weak_fraction 7351.0 NaN NaN NaN 0.32223 0.104038 0.04 0.24 0.32 0.4 0.61
bar_no_fraction 7351.0 NaN NaN NaN 0.514122 0.20164 0.04 0.36 0.54 0.68 0.94
bulge-size_dominant_fraction 7351.0 NaN NaN NaN 0.014595 0.012033 0.01 0.01 0.01 0.02 0.24
bulge-size_large_fraction 7351.0 NaN NaN NaN 0.059431 0.067475 0.01 0.02 0.03 0.07 0.56
bulge-size_moderate_fraction 7351.0 NaN NaN NaN 0.41169 0.171707 0.04 0.27 0.41 0.55 0.8
bulge-size_small_fraction 7351.0 NaN NaN NaN 0.432016 0.188136 0.03 0.29 0.43 0.57 0.91
bulge-size_none_fraction 7351.0 NaN NaN NaN 0.0821 0.122954 0.01 0.01 0.03 0.09 0.84

A first chart¶

Histogram of the first numeric column — swap it for the variable you care about.

In [4]:
import matplotlib.pyplot as plt

num = df.select_dtypes('number')
if num.shape[1]:
    col = num.columns[0]
    num[col].plot.hist(bins=50, figsize=(8, 4), title=col)
    plt.tight_layout()
else:
    print('No direct numeric columns: explore df on your own.')
No description has been provided for this image

Working with data larger than memory¶

Your session has a 4 GB RAM limit, but you can analyse files of 10 GB or more without loading them whole:

  • Read only the columns you need: pd.read_csv(f, usecols=[...]) / pd.read_parquet(f, columns=[...]).
  • Process in chunks and keep only the result:
    total = 0
    for chunk in pd.read_csv(file, chunksize=1_000_000):
        total += len(chunk)
    
  • Query with SQL without loading anything — DuckDB (already installed) reads CSV and Parquet straight from disk and only brings the result into memory:
    import duckdb
    duckdb.sql("SELECT column, count(*) FROM '/srv/data/.../file.parquet' GROUP BY column").df()
    

Your turn¶

This is just the starting point. Some ideas:

  • Check the dataset challenge on its CSDH data sheet.
  • Work on a copy: right-click the file → Duplicate (or Save Notebook As…). Your changes only live in your Hub space — they're never pushed to GitHub.
  • Edited this notebook and want the original back? Use the Restore cell below (or the restore.ipynb notebook).
  • Questions and results: on the platform forum.

Attribution: data from Galaxy Zoo DESI: Detailed Morphology Classifications, license CC-BY-4.0. Notebook from the Citizen Science Data Hub (CSDH) — Fundación Ibercivis.

In [5]:
# ⚠️ RESTORE: this DISCARDS YOUR CHANGES to this notebook and resets it to the original.
# 1. Uncomment the line below (remove the #)   2. Run this cell
# 3. Then: menu File → Reload Notebook from Disk

# !git -C ~/citizen-science-data fetch -q origin && git -C ~/citizen-science-data checkout origin/main -- galaxy-zoo-desi.ipynb && echo "Restored. Now: File → Reload Notebook from Disk"