ASTR199 — Stellar Parameter Prediction via Deep Learning
Multi-target MLP regression on 1M+ row multi-survey astronomical catalog, with LoRA fine-tuning for M-dwarf sub-populations.
Role: Machine Learning Researcher · Advisor: Prof. Theissen, UC San Diego · Stack: Python, SQL, PyTorch, PEFT/LoRA, Scikit-learn · Year: 2026
Overview
This project develops a deep learning pipeline to predict fundamental stellar parameters — Effective Temperature (Teff), Metallicity ([Fe/H]), and Surface Gravity (log g) — from photometric catalog data. The core challenge is data quality: cross-matching disparate astronomical surveys at scale, then building models that generalize across spectral types without overfitting to the well-represented majority classes.
Data Engineering
- Cross-survey integration: Engineered a large-scale SQL pipeline to extract and cross-match 1M+ rows across disparate astronomical catalogs: Gaia, SDSS, WISE, and additional photometric surveys.
- Applied strict inner joins to ensure only observations with complete, non-conflicting labels across all surveys were retained.
- Implemented Sigma Clipping for statistical outlier filtering to guarantee high signal-to-noise inputs before any modeling began.
- Validated data integrity by plotting color-magnitude diagrams (HR diagrams) and 2D color histograms to confirm expected physical correlations.
Baseline Model
- Developed a Multi-Layer Perceptron (MLP) for simultaneous multi-target regression predicting Teff, [Fe/H], and log g in a single forward pass.
- Applied stratified train-test splits across spectral types (F, G, K stars) to prevent domain overfitting to the most common stellar classes.
- Evaluated on MSE and R² per target parameter.
M-Dwarf Challenge & LoRA Fine-Tuning
M-dwarf stars (spectral type M) present a long-tail distribution problem: they are the most common stars in the galaxy but severely underrepresented in labeled catalogs, and their complex molecular absorption features confuse models trained on hotter F/G/K stars.
- Applied PEFT via LoRA to adapt the base MLP for M-dwarf sub-populations, adding a small number of trainable low-rank adapter parameters without retraining the full model.
- Fine-tuned on a specialized low-resource M-dwarf dataset.
- Result: ~20% reduction in RMSE for Teff predictions on the M-dwarf test set, demonstrating LoRA’s effectiveness as a targeted adaptation strategy for long-tail regression problems.