ASTR199 — Stellar Parameter Prediction via Deep Learning

Multi-target MLP regression on 1M+ row multi-survey astronomical catalog, with LoRA fine-tuning for M-dwarf sub-populations.

Role: Machine Learning Researcher  ·  Advisor: Prof. Theissen, UC San Diego  ·  Stack: Python, SQL, PyTorch, PEFT/LoRA, Scikit-learn  ·  Year: 2026


Overview

This project develops a deep learning pipeline to predict fundamental stellar parameters — Effective Temperature (Teff), Metallicity ([Fe/H]), and Surface Gravity (log g) — from photometric catalog data. The core challenge is data quality: cross-matching disparate astronomical surveys at scale, then building models that generalize across spectral types without overfitting to the well-represented majority classes.

Data Engineering

  • Cross-survey integration: Engineered a large-scale SQL pipeline to extract and cross-match 1M+ rows across disparate astronomical catalogs: Gaia, SDSS, WISE, and additional photometric surveys.
  • Applied strict inner joins to ensure only observations with complete, non-conflicting labels across all surveys were retained.
  • Implemented Sigma Clipping for statistical outlier filtering to guarantee high signal-to-noise inputs before any modeling began.
  • Validated data integrity by plotting color-magnitude diagrams (HR diagrams) and 2D color histograms to confirm expected physical correlations.

Baseline Model

  • Developed a Multi-Layer Perceptron (MLP) for simultaneous multi-target regression predicting Teff, [Fe/H], and log g in a single forward pass.
  • Applied stratified train-test splits across spectral types (F, G, K stars) to prevent domain overfitting to the most common stellar classes.
  • Evaluated on MSE and R² per target parameter.

M-Dwarf Challenge & LoRA Fine-Tuning

M-dwarf stars (spectral type M) present a long-tail distribution problem: they are the most common stars in the galaxy but severely underrepresented in labeled catalogs, and their complex molecular absorption features confuse models trained on hotter F/G/K stars.

  • Applied PEFT via LoRA to adapt the base MLP for M-dwarf sub-populations, adding a small number of trainable low-rank adapter parameters without retraining the full model.
  • Fine-tuned on a specialized low-resource M-dwarf dataset.
  • Result: ~20% reduction in RMSE for Teff predictions on the M-dwarf test set, demonstrating LoRA’s effectiveness as a targeted adaptation strategy for long-tail regression problems.