Posts by Collection

portfolio

Portfolio item number 1

Short description of portfolio item number 1

Portfolio item number 2

Short description of portfolio item number 2

publications

Non-negative matrix factorization algorithms greatly improve topic model fits

Published in Arxiv prepint, 2021

We report on the potential for using algorithms for non-negative matrix factorization (NMF) to improve parameter estimation in topic models. While several papers have studied connections between NMF and topic models, none have suggested leveraging these connections to develop new algorithms for fitting topic models. NMF avoids the “sum-to-one” constraints on the topic model parameters, resulting in an optimization problem with simpler structure and more efficient computations. Building on recent advances in optimization algorithms for NMF, we show that first solving the NMF problem then recovering the topic model fit can produce remarkably better fits, and in less time, than standard algorithms for topic models. While we focus primarily on maximum likelihood estimation, we show that this approach also has the potential to improve variational inference for topic models. Our methods are implemented in the R package fastTopics.

Recommended citation: Peter Carbonetto, Abhishek Sarkar, Zihao Wang, Matthew Stephens. "Non-negative matrix factorization algorithms greatly improve topic model fits." arXiv preprint arXiv:2105.13440 (2021). https://arxiv.org/abs/2105.13440

A Unified Causal View of Domain Invariant Representation Learning

Published in Arxiv prepint, 2022

Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. To address this, we may wish to learn representations of data that are domain-invariant in the sense that we preserve data structure that is stable across domains, but throw out spuriously-varying parts. There are many representation-learning approaches of this type, including methods based on data augmentation, distributional invariances, and risk invariance. Unfortunately, when faced with any particular real-world domain shift, it is unclear which, if any, of these methods might be expected to work. The purpose of this paper is to show how the different methods relate to each other, and clarify the real-world circumstances under which each is expected to succeed. The key tool is a new notion of domain shift relying on the idea that causal relationships are invariant, but non-causal relationships (e.g., due to confounding) may vary.

Recommended citation: Zihao Wang, and Victor Veitch. "A Unified Causal View of Domain Invariant Representation Learning ." arXiv preprint arXiv:2208.06987 (2022). https://arxiv.org/abs/2208.06987

Concept Algebra for (Score-Based) Text-Controlled Generative Models

Published in Thirty-seventh Conference on Neural Information Processing Systems, 2023

This paper concerns the structure of learned representations in text-guided generative models, focusing on score-based models. A key property of such models is that they can compose disparate concepts in a ‘disentangled’ manner.This suggests these models have internal representations that encode concepts in a ‘disentangled’ manner. Here, we focus on the idea that concepts are encoded as subspaces of some representation space. We formalize what this means, show there’s a natural choice for the representation, and develop a simple method for identifying the part of the representation corresponding to a given concept. In particular, this allows us to manipulate the concepts expressed by the model through algebraic manipulation of the representation. We demonstrate the idea with examples using Stable Diffusion.

Recommended citation: Zihao Wang and Lin Gui and Jeffrey Negrea and Victor Veitch. "Concept Algebra for (Score-Based) Text-Controlled Generative Models" https://openreview.net/pdf?id=SGlrCuwdsB

Transforming and Combining Rewards for Aligning Large Language Models

Published in Proceedings of the 41 st International Conference on Machine Learning, 2024

A common approach for aligning language models to human preferences is to first learn a reward model from preference data, and then use this reward model to update the language model. We study two closely related problems that arise in this approach. First, any monotone transformation of the reward model preserves preference ranking; is there a choice that is ``better’’ than others? Second, we often wish to align language models to multiple properties: how should we combine multiple reward models? Using a probabilistic interpretation of the alignment procedure, we identify a natural choice for transformation for (the common case of) rewards learned from Bradley-Terry preference models. This derived transformation has two important properties. First, it emphasizes improving poorly-performing outputs, rather than outputs that already score well. This mitigates both underfitting (where some prompts are not improved) and reward hacking (where the model learns to exploit misspecification of the reward model). Second, it enables principled aggregation of rewards by linking summation to logical conjunction: the sum of transformed rewards corresponds to the probability that the output is “good” in all measured properties, in a sense we make precise. Experiments aligning language models to be both helpful and harmless using RLHF show substantial improvements over the baseline (non-transformed) approach.

Recommended citation: Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch "Transforming and Combining Rewards for Aligning Large Language Models." https://openreview.net/pdf/1057e32f12e63e37ed2cead5d230e21b6fc1a66f.pdf

Does Editing Provide Evidence for Localization?

Published in ICML 2024 Workshop on Mechanistic Interpretability, 2024

A basic aspiration for interpretability research in large language models is to “localize” semantically meaningful behaviors to particular components within the LLM. There are various heuristics for finding candidate locations within the LLM. Once a candidate localization is found, it can be assessed by editing the internal representations at the corresponding localization and checking whether this induces model behavior that is consistent with the semantic interpretion of the localization. The question we address here is: how strong is the evidence provided by such edits? To assess localization, we want to assess the effect of the optimal intervention at a particular location. The key new technical tool is a way of adapting LLM alignment techniques to find such optimal localized edits. With this tool in hand, we give an example where the edit-based evidence for localization appears strong, but where localization clearly fails. Indeed, we find that optimal edits at \emph{random} localizations can be as effective as aligning the full model. In aggregate, our results suggest that merely observing that localized edits induce targeted changes in behavior provides little to no evidence that these locations actually encode the target behavior.

Recommended citation: Zihao Wang, Victor Veitch "Does Editing Provide Evidence for Localization?" https://openreview.net/pdf?id=oZXcwWTCfe

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015