ASTR199 — Stellar Parameter Prediction via Deep Learning

Role: Machine Learning Researcher · Advisor: Prof. Theissen, UC San Diego · Stack: Python, SQL, PyTorch, PEFT/LoRA, Scikit-learn · Year: 2026

Overview

This project develops a deep learning pipeline to predict fundamental stellar parameters — Effective Temperature (Teff), Metallicity ([Fe/H]), and Surface Gravity (log g) — from photometric catalog data. The core challenge is data quality: cross-matching disparate astronomical surveys at scale, then building models that generalize across spectral types without overfitting to the well-represented majority classes.

Data Engineering

Cross-survey integration: Engineered a large-scale SQL pipeline to extract and cross-match 1M+ rows across disparate astronomical catalogs: Gaia, SDSS, WISE, and additional photometric surveys.
Applied strict inner joins to ensure only observations with complete, non-conflicting labels across all surveys were retained.
Implemented Sigma Clipping for statistical outlier filtering to guarantee high signal-to-noise inputs before any modeling began.
Validated data integrity by plotting color-magnitude diagrams (HR diagrams) and 2D color histograms to confirm expected physical correlations.

Baseline Model

Developed a Multi-Layer Perceptron (MLP) for simultaneous multi-target regression predicting Teff, [Fe/H], and log g in a single forward pass.
Applied stratified train-test splits across spectral types (F, G, K stars) to prevent domain overfitting to the most common stellar classes.
Evaluated on MSE and R² per target parameter.

M-Dwarf Challenge & LoRA Fine-Tuning

M-dwarf stars (spectral type M) present a long-tail distribution problem: they are the most common stars in the galaxy but severely underrepresented in labeled catalogs, and their complex molecular absorption features confuse models trained on hotter F/G/K stars.

Applied PEFT via LoRA to adapt the base MLP for M-dwarf sub-populations, adding a small number of trainable low-rank adapter parameters without retraining the full model.
Fine-tuned on a specialized low-resource M-dwarf dataset.
Result: ~20% reduction in RMSE for Teff predictions on the M-dwarf test set, demonstrating LoRA’s effectiveness as a targeted adaptation strategy for long-tail regression problems.