KataGo × LLM — Explainable Go AI

Role: Project Lead · Stack: Python, C++, Qwen3-8B, Hugging Face TRL, GRPO, KataGo · Year: 2025 – Present

Motivation

Traditional Go AI like KataGo is opaque: it outputs top-k moves with win-rate estimates but cannot explain why a move fits the global context of the game. Beginners and intermediate players receive no transferable insight. This project fine-tunes a large language model to translate KataGo’s raw policy and value signals into human-interpretable strategic reasoning, elevating bot strength from a 10k baseline to ~7k in real-world testing.

Technical Approach

RLAIF Fine-tuning Pipeline

Fine-tuned Qwen3-8B via GRPO using Hugging Face TRL on a 113k-row dataset of KataGo-annotated game positions.
Applied 4-bit GGUF quantization for edge deployment on 8GB VRAM, achieving ~42.5 tok/sec throughput and 0.52s TTFT.
Engineered a regime-switching RL reward, adaptively overweighting rank-based signals in high-uncertainty states (win-rate ~0.5) and policy priors in deterministic positions to optimize risk-adjusted decision making.

Reward Hacking Mitigation

Resolved severe reward hacking across the 113k-row dataset by implementing a strict -0.5 format-validation penalty gate to enforce output legality before any score-based reward is applied.
Dynamically scaled policy/score-lead weights for lopsided positions (downsampling low-information data), eliminating the model’s incentive to exploit imbalanced game states.

C++ GTP Proxy

Engineered a zero-network-overhead C++ Go Text Protocol (GTP) proxy bridging the Lizzie GUI with the local LLM, intercepting GTP commands with no additional latency.
Serializes live game states and KataGo’s analysis in real time, enabling sub-second move rationale generation during live play.

Demo: Auto-play on Lizzie

Auto-play on Lizzie: Human vs. Model Gameplay