ASTR199 — Stellar Parameter Prediction via Deep Learning
Two-stage deep learning pipeline predicting stellar Teff, metallicity, and log g from 1M+ row multi-survey photometric catalog, driving log g R² from 0.519 to 0.833.
Role: Machine Learning Researcher · Advisor: Prof. Theissen, UC San Diego · Stack: Python, DuckDB, PyTorch, Pandas, Scikit-learn · Year: 2026
Overview
This project develops a two-stage deep learning pipeline to predict fundamental stellar parameters — Effective Temperature (T_eff), Metallicity ([Fe/H]), and Surface Gravity (log g) — from photometric catalog data. The core challenges are data quality at scale, multi-task loss collapse in joint training, and color-log g degeneracy that standard feature sets cannot resolve. The final pipeline drives log g R² from 0.519 to 0.833 by injecting physics-informed features at inference time.
Data Engineering
- Cross-survey integration: Engineered an ETL pipeline using DuckDB for high-throughput cross-matching and NA handling across ~1M rows spanning 7 photometric catalogs (Gaia, SDSS, WISE, and others).
- Integrated with Pandas for sigma clipping and strict inner joins, retaining only observations with complete, non-conflicting labels.
- Validated data integrity via color-magnitude diagrams and 2D color histograms to confirm expected physical correlations before modeling.
Physics-Informed Feature Engineering
Raw photometric colors are highly collinear — naive PCA loses discriminating information in the correlated subspace. To break this degeneracy:
- Engineered 171 color indices and 19 absolute magnitudes from the raw photometric bands, constructing a physics-informed feature space in Pandas.
- Applied 1-sigma outlier clipping independently across F/G/K/M spectral cohorts to robustly filter background noise without discarding minority-class stars.
- Engineered orthogonal luminosity features to break prediction degeneracy that PCA identified on highly collinear inputs.
Two-Stage Pipeline Architecture
Joint multi-target training collapsed — the model sacrificed less-correlated targets (log g) to minimize total loss on the dominant target (T_eff). To resolve this:
- Stage 1: Trained a dedicated model to predict T_eff from the photometric feature space.
- Stage 2: Injected Stage-1 T_eff predictions as Hertzsprung-Russell diagram proxies, providing a physics-grounded signal that breaks the color-log g degeneracy. This drove log g R² from 0.519 → 0.833.
- Applied Homoscedastic Uncertainty Loss to handle remaining heteroscedastic residuals in the joint loss formulation.
- Strictly enforced train/val/test isolation between pipeline stages to prevent data leakage.
Residual Analysis
- Conducted systematic residual analysis to define model boundaries: diagnosed specific prediction failures on K-dwarfs as a consequence of narrow target variance and long-tail sample imbalance, not model capacity constraints.
- This distinction directed further data collection effort rather than wasted compute on architectural changes.