KataGo × LLM — Explainable Go AI

Fine-tuning LLMs with RL to explain Go strategies using KataGo's superhuman analysis.

Role: Project Lead  ·  Stack: Python, C++, Qwen3-8B, Hugging Face TRL, GRPO, KataGo  ·  Year: 2025 – Present

GitHub →


Motivation

Traditional Go AI like KataGo is opaque: it outputs top-k moves with win-rate estimates but cannot explain why a move fits the global context of the game. Beginners and intermediate players receive no transferable insight. This project fine-tunes a large language model to translate KataGo’s raw policy and value signals into human-interpretable strategic reasoning, elevating bot strength from a 10k baseline to ~7k in real-world testing.

Technical Approach

RLAIF Fine-tuning Pipeline

  • Fine-tuned Qwen3-8B via GRPO using Hugging Face TRL on a 113k-row dataset of KataGo-annotated game positions.
  • Applied 4-bit GGUF quantization for edge deployment on 8GB VRAM, achieving ~42.5 tok/sec throughput and 0.52s TTFT.
  • Engineered a regime-switching RL reward, adaptively overweighting rank-based signals in high-uncertainty states (win-rate ~0.5) and policy priors in deterministic positions to optimize risk-adjusted decision making.

Reward Hacking Mitigation

  • Resolved severe reward hacking across the 113k-row dataset by implementing a strict -0.5 format-validation penalty gate to enforce output legality before any score-based reward is applied.
  • Dynamically scaled policy/score-lead weights for lopsided positions (downsampling low-information data), eliminating the model’s incentive to exploit imbalanced game states.

C++ GTP Proxy

  • Engineered a zero-network-overhead C++ Go Text Protocol (GTP) proxy bridging the Lizzie GUI with the local LLM, intercepting GTP commands with no additional latency.
  • Serializes live game states and KataGo’s analysis in real time, enabling sub-second move rationale generation during live play.

Demo: Auto-play on Lizzie

Auto-play on Lizzie: Human vs. Model Gameplay