Hey! I'm Alex Wa, a 2nd year Math and CS major at Yale and YES Scholar. My research currently span RL and NLP, and I'm also interested in LLM architecture and ML systems.
Currently, I'm developing RL environments in Prime Intellect's RL Residency and researching RL4LLMs and rubrics as rewards with the Yale NLP lab. Previously, I've done research in rubrics (Judgment Labs), geometric algebra (APOLLO Labs), algebraic topology (SUMaC '23), abstract algebra (SUMaC '22), and biostatistics (Emory).
In my free time, I draw, run The Veritas Search, play board games like Catan, and enjoy photography.
Posts
[WIP] frontier model training methodologies
How do labs train a frontier, multi-billion parameter model? We look towards Hugging Face’s SmolLM3, Prime Intellect’s Intellect 3, Nous Research’s Hermes 4, OpenAI’s gpt-oss-120b, Kimi’s Kimi K2, DeepSeek’s DeepSeek-R1, and Qwen’s Qwen3. This blog is an attempt towards distilling the techniques, motivations, and considerations, used to train their models with an emphasis on training methodology over infrastructure.
activation engineering for privacy protection in LLMs
LLMs trained on web-scale corpora inadvertently memorize and leak personally identifiable information (PII) present in their training data. We investigate inference-time interventions to suppress this privacy leakage. We evaluate three editing strategies: activation patching with computed steering vectors (APNEAP), random Gaussian noise steering, and Spectral Editing of Activations (SEA). Using the Enron email corpus with GPT-Neo-1.3B and finetuned Qwen3-8B-enron, we measure targeted PII suppression via exposure metrics like mean reciprocal rank (MRR), and utility via perplexity.
combinatorial reasoning environments for LLMs and RL
Can RL agents learn to play spatial reasoning puzzle games as well as, or better than, LLMs? We develop a complete RL pipeline by developing an environment for fruit box (a grid-based reasoning game) using Prime Intellect’s verifiers library, benchmarking LLMs like gpt-5.1 and gemini-3-pro, and training RL agents with SFT and GRPO to play. Repo here.
whirlwind of PPO and RLHF for LLMs from scratch
RLHF with PPO from scratch and lots of fine-tuning GPT-2 models for movie sentiment classification. Transformer environments, adapative KL control, logit/temperature scaling, whitening, and more. Full implementation here.