Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Non-negative matrix factorization algorithms greatly improve topic model fits

Published in Arxiv prepint, 2021

We report on the potential for using algorithms for non-negative matrix factorization (NMF) to improve parameter estimation in topic models. While several papers have studied connections between NMF and topic models, none have suggested leveraging these connections to develop new algorithms for fitting topic models. NMF avoids the “sum-to-one” constraints on the topic model parameters, resulting in an optimization problem with simpler structure and more efficient computations. Building on recent advances in optimization algorithms for NMF, we show that first solving the NMF problem then recovering the topic model fit can produce remarkably better fits, and in less time, than standard algorithms for topic models. While we focus primarily on maximum likelihood estimation, we show that this approach also has the potential to improve variational inference for topic models. Our methods are implemented in the R package fastTopics.

Recommended citation: Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens. "Non-negative matrix factorization algorithms greatly improve topic model fits." arXiv preprint arXiv:2105.13440 (2021). https://arxiv.org/abs/2105.13440

A Unified Causal View of Domain Invariant Representation Learning

Published in Arxiv prepint, 2022

Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. To address this, we may wish to learn representations of data that are domain-invariant in the sense that we preserve data structure that is stable across domains, but throw out spuriously-varying parts. There are many representation-learning approaches of this type, including methods based on data augmentation, distributional invariances, and risk invariance. Unfortunately, when faced with any particular real-world domain shift, it is unclear which, if any, of these methods might be expected to work. The purpose of this paper is to show how the different methods relate to each other, and clarify the real-world circumstances under which each is expected to succeed. The key tool is a new notion of domain shift relying on the idea that causal relationships are invariant, but non-causal relationships (e.g., due to confounding) may vary.

Recommended citation: Zihao Wang, and Victor Veitch. "A Unified Causal View of Domain Invariant Representation Learning." arXiv preprint arXiv:2208.06987 (2022). https://arxiv.org/abs/2208.06987

Concept Algebra for (Score-Based) Text-Controlled Generative Models

Published in Thirty-seventh Conference on Neural Information Processing Systems, 2023

This paper concerns the structure of learned representations in text-guided generative models, focusing on score-based models. A key property of such models is that they can compose disparate concepts in a ‘disentangled’ manner.This suggests these models have internal representations that encode concepts in a ‘disentangled’ manner. Here, we focus on the idea that concepts are encoded as subspaces of some representation space. We formalize what this means, show there’s a natural choice for the representation, and develop a simple method for identifying the part of the representation corresponding to a given concept. In particular, this allows us to manipulate the concepts expressed by the model through algebraic manipulation of the representation. We demonstrate the idea with examples using Stable Diffusion.

Recommended citation: Zihao Wang and Lin Gui and Jeffrey Negrea and Victor Veitch. "Concept Algebra for (Score-Based) Text-Controlled Generative Models" https://openreview.net/pdf?id=SGlrCuwdsB

Transforming and Combining Rewards for Aligning Large Language Models

Published in Proceedings of the 41 st International Conference on Machine Learning, 2024

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better’’ than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is “good” in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

Recommended citation: Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch "Transforming and Combining Rewards for Aligning Large Language Models." https://openreview.net/pdf/1057e32f12e63e37ed2cead5d230e21b6fc1a66f.pdf

Does Editing Provide Evidence for Localization?

Published in ICML 2024 Workshop on Mechanistic Interpretability, 2024

A basic aspiration for interpretability research in large language models is to “localize” semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretion of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at \emph{random} localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Recommended citation: Zihao Wang, Victor Veitch "Does Editing Provide Evidence for Localization?" https://openreview.net/pdf?id=oZXcwWTCfe

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Published in International Conference on Learning Representations (ICLR), 2026

Reinforcement fine-tuning often suffers from reward over-optimization, where a policy model learns to hack reward signals and receive high scores while producing lower-quality outputs. We show that a key issue is reward misspecification in the high-reward tail: the reward model struggles to distinguish excellent responses from merely great ones. To address this, we study rubric-based rewards, which can leverage off-policy examples while remaining less sensitive to their artifacts. We introduce a workflow for eliciting rubrics that capture distinctions among strong and diverse responses, and empirically show that rubric-based rewards substantially mitigate reward over-optimization and improve LLM post-training.

Recommended citation: Junkai Zhang^*, Zihao Wang^*, Lin Gui^*, Swarnashree Mysore Sathyendra, Jaehwan Jeong, Victor Veitch, Wei Wang, Yunzhong He, Bing Liu, and Lifeng Jin. "Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training." International Conference on Learning Representations (ICLR), 2026. ^*Equal contribution. https://openreview.net/forum?id=pBjy4ek2QV

Reward Hacking in Rubric-Based Reinforcement Learning

Published in arXiv preprint, 2026

Rubric-based rewards are useful for reinforcement learning in open-ended settings, but they can introduce new reward-hacking risks. We study how policy optimization exploits rubric-based verifiers, distinguishing failures caused by weak verification from limitations in rubric design itself. Across medical and science tasks, we find that stronger verifiers reduce but do not eliminate exploitation, and that incomplete rubrics can still drive gains on rubric criteria while degrading broader response quality. We also introduce a verifier-free diagnostic based on policy log-probabilities for tracking when training quality stops improving.

Recommended citation: Anas Mahmoud, MohammadHossein Rezaei, Zihao Wang, Anisha Gunjal, Bing Liu, and Yunzhong He. "Reward Hacking in Rubric-Based Reinforcement Learning." arXiv preprint arXiv:2605.12474 (2026). https://arxiv.org/abs/2605.12474

Online Rubrics Elicitation from Pairwise Comparisons

Published in Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not available and human preferences provide only coarse signals. Static rubrics can become vulnerable to reward hacking and fail to capture new desiderata that emerge during training. We introduce Online Rubrics Elicitation, a method that dynamically curates evaluation criteria through pairwise comparisons between responses from current and reference policies. This online process continuously identifies and mitigates errors as training proceeds, yielding consistent improvements over training exclusively with static rubrics.

Recommended citation: MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Bing Liu, Yunzhong He, and Afra Feyza Akyürek. "Online Rubrics Elicitation from Pairwise Comparisons." Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026. https://icml.cc/virtual/2026/poster/65623

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015