The Perplexity Trap: Why Molecular Language Models Need to Rethink Pretraining Objectives

Introduction

The hunt for novel therapeutic molecules confronts one of computational chemistry's most daunting statistics: the space of synthesizable drug-like molecules spans an estimated 10^23 to 10^60 candidates. Navigating this expanse requires generative models that can efficiently explore chemical space while respecting the hard constraints of molecular validity and synthetic accessibility. In this context, Molecular Large Language Models (Mol-LLMs) have emerged as a scalable paradigm, treating molecular structures as strings (SMILES, SELFIES, or SAFE formats) and applying autoregressive transformers to generate novel compounds.

The recent work by Chitsaz et al., "NovoMolGen: Rethinking Molecular Language Model Pretraining," presents the most systematic investigation to date of how standard NLP pretraining practices transfer to molecular generation. Pretraining on 1.5 billion molecules, the authors achieve state of the art results in both unconstrained and goal directed generation. Yet beneath these headline results lies a more troubling finding: performance metrics measured during pretraining, such as perplexity, show only weak correlation with downstream molecular optimization performance. This observation suggests that the field's current reliance on next token prediction objectives may constitute a fundamental misalignment between how we train models and what we actually need them to do.

The Syntax-Function Disconnect

NovoMolGen rigorously examines the pipeline from raw molecular strings to functional molecules, testing four representations (SMILES, DeepSMILES, SELFIES, and SAFE) and two tokenization strategies (Byte Pair Encoding and Atomwise tokenization). The scale of the study is unprecedented: models ranging from small architectures to large transformers trained on 1.5 billion molecules drawn from GDB-13 and ZINC. This scale allows the authors to probe a question that has lingered at the margins of the field. Does optimizing for next token prediction on chemical strings actually produce models capable of generating molecules with desired biological properties?

Their answer is nuanced but concerning. While pretraining loss continues to decrease with scale, as expected from neural scaling laws, this improvement does not reliably translate to better performance on property optimization tasks. In natural language processing, lower perplexity generally signals better linguistic understanding and improved downstream task performance. In molecular language modeling, however, chemical syntax and chemical function appear to be distinct optimization landscapes. A model can become expert at predicting the next character in a SMILES string, mastering the grammar of aromatic rings and bond notation, while remaining agnostic to whether the resulting molecule is synthesizable, stable, or biologically active.

This finding challenges the implicit assumption underlying much of the recent work in this space, including SAFE-GPT and GP-MoLFormer. These models, impressive in their scale (87M and 46.8M parameters respectively), treat molecular generation primarily as a sequence modeling problem. NovoMolGen reveals that this framing may be insufficient. The highly structured syntax of molecular strings, with their shorter sequence lengths and constrained vocabularies compared to natural language, creates a training dynamic where the model can achieve low perplexity by memorizing local chemical motifs rather than learning global structure-property relationships.

Tokenization Bottlenecks and Architectural Limitations

The NovoMolGen study exposes critical limitations in how molecular information is compressed into token sequences. The comparison between BPE tokenization, which merges frequently co-occurring character sequences, and atomwise tokenization, which treats each atom and bond symbol discretely, reveals that tokenization strategy significantly impacts validity and diversity metrics. Yet both approaches suffer from a fundamental constraint: they reduce three dimensional molecular graphs to one dimensional strings, destroying spatial information that is crucial for predicting protein binding and other biological activities.

Current Mol-LLMs operate within a representational straitjacket. While graph neural networks preserve topological information, they struggle to scale to billions of molecules. String based methods scale efficiently but lose geometric and electronic structure information. NovoMolGen's systematic comparison of SMILES variants (DeepSMILES eliminating parentheses, SELFIES ensuring validity through constrained grammar) shows that while these representations improve validity rates, they do not resolve the core issue. The model learns to generate syntactically correct strings, but correctness does not imply utility.

The paper also illuminates the novelty-memorization tradeoff at extreme scale. As training datasets grow into the billions, models face increasing pressure to memorize training distributions rather than learn compositional rules for generating novel chemistry. This is particularly problematic for drug discovery, where the most valuable molecules often lie at the periphery of known chemical space, not in the dense clusters of frequently occurring scaffolds.

Original Insights: Beyond Autoregressive Pretraining

The findings from NovoMolGen suggest that the field is approaching an inflection point. Within the next two to three years, I predict we will see a shift away from pure autoregressive pretraining for molecular generation. The evidence is compelling: if perplexity does not predict functional performance, then we are optimizing the wrong objective function.

The successor paradigm will likely embrace multiobjective pretraining that jointly optimizes for chemical validity, retrosynthetic accessibility, and property prediction during the pretraining phase itself, not merely during fine tuning or reinforcement learning. Rather than treating molecular generation as a unimodal sequence prediction task, future models will incorporate auxiliary tasks such as binding affinity prediction, synthetic route feasibility scoring, and geometric constraint satisfaction directly into the pretraining loss.

Equally important will be the emergence of hybrid architectures that fuse graph neural networks with string based transformers. These multimodal systems will process molecular graphs directly while leveraging the scalability of transformer architectures, avoiding the information loss inherent in linearizing chemical structures. We may see architectures that alternate between graph convolutions for local chemical environment modeling and attention mechanisms for long range dependency capture across molecular sequences.

The tokenization bottleneck will also break. Future molecular language models may operate on learnable continuous representations of atoms and bonds, moving beyond discrete token vocabularies entirely. This would allow gradients to flow through chemical space more smoothly, enabling optimization in a continuous latent space rather than the discrete, highly constrained space of valid SMILES strings.

Conclusion

NovoMolGen represents a necessary correction to the trajectory of molecular generative modeling. By demonstrating that scale and pretraining loss are poor proxies for downstream utility, Chitsaz and colleagues force the field to confront a difficult question. Are we building language models that happen to process molecules, or are we building chemical models that leverage language architecture? The distinction matters profoundly for therapeutic discovery.

The study's limitations, which the authors acknowledge candidly, point toward future research directions. Current evaluation metrics (validity, FCD, Tanimoto similarity) remain synthetic proxies for the true test of chemical utility: experimental validation of binding affinity, selectivity, and synthetic tractability. As the field moves forward, we must develop pretraining objectives that mirror the actual constraints of drug discovery pipelines.

The open questions are substantial. How do we define a pretraining objective that balances exploration of chemical space with the exploitation of known bioactive scaffolds? Can we construct unsupervised objectives that capture the thermodynamic and kinetic constraints of molecular stability without relying on expensive labeled datasets? NovoMolGen provides the empirical foundation to address these questions, establishing that current methods have hit a ceiling defined by their training objectives rather than their capacity.

The next generation of molecular foundation models will likely look less like GPT and more like specialized physical simulators, incorporating inductive biases from chemistry and biology from the ground up. NovoMolGen marks the beginning of this transition, showing us that in the molecular domain, language is a representation, not the reality we seek to model.