ASTR199 — Stellar Parameter Prediction via Deep Learning

Role: Machine Learning Researcher · Advisor: Prof. Theissen, UC San Diego · Stack: Python, DuckDB, PyTorch, Pandas, Scikit-learn · Year: 2026

GitHub →

Overview

This project develops a two-stage deep learning pipeline to predict fundamental stellar parameters — Effective Temperature (T_eff), Metallicity ([Fe/H]), and Surface Gravity (log g) — from photometric catalog data. The core challenges are data quality at scale, multi-task loss collapse in joint training, and color-log g degeneracy that standard feature sets cannot resolve. The final pipeline drives log g R² from 0.519 to 0.833 by injecting physics-informed features at inference time.

Data Engineering

Cross-survey integration: Engineered an ETL pipeline using DuckDB for high-throughput cross-matching and NA handling across ~1M rows spanning 7 photometric catalogs (Gaia, SDSS, WISE, and others).
Integrated with Pandas for sigma clipping and strict inner joins, retaining only observations with complete, non-conflicting labels.
Validated data integrity via color-magnitude diagrams and 2D color histograms to confirm expected physical correlations before modeling.

Physics-Informed Feature Engineering

Raw photometric colors are highly collinear — naive PCA loses discriminating information in the correlated subspace. To break this degeneracy:

Engineered 171 color indices and 19 absolute magnitudes from the raw photometric bands, constructing a physics-informed feature space in Pandas.
Applied 1-sigma outlier clipping independently across F/G/K/M spectral cohorts to robustly filter background noise without discarding minority-class stars.
Engineered orthogonal luminosity features to break prediction degeneracy that PCA identified on highly collinear inputs.

Two-Stage Pipeline Architecture

Joint multi-target training collapsed — the model sacrificed less-correlated targets (log g) to minimize total loss on the dominant target (T_eff). To resolve this:

Stage 1: Trained a dedicated model to predict T_eff from the photometric feature space.
Stage 2: Injected Stage-1 T_eff predictions as Hertzsprung-Russell diagram proxies, providing a physics-grounded signal that breaks the color-log g degeneracy. This drove log g R² from 0.519 → 0.833.
Applied Homoscedastic Uncertainty Loss to handle remaining heteroscedastic residuals in the joint loss formulation.
Strictly enforced train/val/test isolation between pipeline stages to prevent data leakage.

Residual Analysis

Conducted systematic residual analysis to define model boundaries: diagnosed specific prediction failures on K-dwarfs as a consequence of narrow target variance and long-tail sample imbalance, not model capacity constraints.
This distinction directed further data collection effort rather than wasted compute on architectural changes.