Valentin Hofmann is a final-year DPhil student at the University of Oxford and a research assistant at LMU Munich. His work broadly focuses on the intersection of natural language processing, linguistics, and computational social science, with specific interests in tokenization, socially and temporally aware language models, and graph-based methods. He has previously spent time as a research intern at DeepMind and as a visiting scholar at Stanford University.
Language models (LMs) like ChatGPT have achieved unprecedented levels of performance in natural language processing. One common characteristic of these models is that they segment text into a sequence of tokens from a fixed-size vocabulary, a step commonly referred to as tokenization.
In this talk, I will take a closer look at how linguistic properties of the tokenization impact how LMs process complex words (e.g., “superbizarre”). I will first give an overview of different forms of complex word processing in humans and AI systems. I will then present recent computational studies showing that the tokenization of LMs can lead to linguistically invalid segmentations (e.g., “superb-iza-rre”) that severely affect how LMs interpret complex words. Finally, I will discuss potential solutions of this problem.