Hi Lab,
As you know from my catch up post, I recently joined Electric Twin as a scientific advisor. Here’s an article I wrote on what simulation offers science in general and simulating humans could offer behavioral science. The original article is here.
Simulating Ourselves
The Promise of AI and Behavioral Science
Interventions are to the social sciences what inventions are to the physical sciences – an application of science as technology1. But psychological and behavioural science has problems that prevent it from being sufficiently reliable, trustworthy, and immediately useful in the real world. More widespread preregistration with fewer undocumented deviations2,3, better Bayesian statistics4,5, or less overgeneralization from Western, educated, industrialized, rich, and democratic (WEIRD) societies6–8 would go a long way to increasing the trustworthiness and reproducibility of psychological and behavioral research. But if I could wave a magic wand, these would not be the changes I would make. Because in the parlance of biology9,10, these are proximate symptoms of ultimate problems—problems that lie in how we theorize, how we measure, and, ultimately, how we connect social science to the real world.
Theory, measurement, and generalizability are the three ultimate problems in psychological and behavioural science
1. A Problem in Theory
Many, if not most psychological and behavioural science papers, are motivated by minitheories and hypotheses based on past data or intuitions and observations from the particular, often WEIRD life experience of researchers; or verbal explanations labeled “theories” with no formal models or falsifiable predictions11. Here’s an example. For the last twenty five years, behavioral scientists have agonised over whether people prefer fewer choices over many choices. A 2000 study suggested that people were more likely to purchase gourmet jams or chocolates when offered 6 choices over 24 or 30. It led to TED talks with millions of views and an industry of research around the so-called paradox of choice or choice overload12,13. But as I explain in my paper, A problem in theory11, the question of whether people prefer fewer or more choices
“is either nonsensical or underspecified… our species has had to make decisions with different numbers of choices to survive. Our decision-making strategy will be affected by the importance of the choice (for example, when faced with 30 mortgage options a consumer reaction of “I guess I don’t need to buy a house” is clearly suboptimal given the importance of the decision), by the information they have available (for example, the decisions of others), and indeed, by the number of choices.”
Subsequent meta-analyses of all studies on the paradox of choice have failed to replicate the original jam study and suggested either no effect14 or a variety of factors that might moderate the effect15. To their credit, the key researchers wrote about these many problems in a 2022 Behavioral Scientist article, concluding with it’s complicated…
I don’t mean to single out this particular line of research. It’s just one example of a common lack of attendance to theory in psychological and behavioural science. Another prominent example, if only because it remains popular in the business world, is the idea of growth mindset vs a fixed mindset. The difference between a growth and fixed mindset has repeatedly failed to replicate or at best shown weak effects well overblown compared to the original claims and alternative interventions. Training a growth mindset is intuitively appealing, but poorly theorised and doesn’t actually do much.
As Henri Poincare so eloquently put it:
“science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house”16.
Few papers attempt to accumulate scientific knowledge through general theories or overarching theoretical frameworks leading to a lot of stones of varying sturdiness but little attempt to build the house17.
2. A Problem in Measurement
The rareness of theory built from first principles makes it challenging to “carve nature at its joints” when it comes to psychological constructs. Are self-esteem, well-being, or neuroticism truly universal constructs, or are they specific to WEIRD culture and recent generations? Take neuroticism as a personality trait, for example. Even widely studied frameworks like the Big Five personality traits—openness, conscientiousness, extraversion, agreeableness, and neuroticism—do not appear consistently across all societies. In some smaller, traditional societies, only two core traits—extraversion and conscientiousness—emerge. Moreover, the extent to which these traits split into five distinct dimensions often correlates with the complexity of a given society.18,19.
Beyond the question of whether these psychological constructs truly exist in all cultures, reliably measuring them across cultural contexts is a challenge. This difficulty is compounded by questionable measurement practices20. Self-report measures, which may be more reliable than implicit methods like response times (commonly used for implicit bias)21, can be poorly correlated with actual behavior. For instance, self-reported social media use correlates only modestly (r = .38) with actual logged usage22—a serious limitation for research aiming to explore social media’s effects on mental health. Stated preferences, too, frequently diverge from the “revealed preferences” of actually observed behavior; what people say they will buy is often not what they actually buy.
Surveys and psychological studies often rely on self-reported data, yet these self-reports are typically a blend of perception and reality, wishful thinking, social desirability, unstable on-the-spot responses, or even random noise. People are often unreliable narrators, reflecting not only what they think they want but also what they believe they should want and how they wish to be seen by others.
3. A Problem in Generalizability and Application
The lack of robust theory and reliable measurement hinders the ability to generalize and apply findings in psychological and behavioral science. Without a strong theoretical foundation, it’s difficult to distinguish results that are unusual and interesting from results that are unusual and probably wrong. It’s difficult to know how to apply them to real-world situations1—or to know which cultures, demographics, or time periods they might be relevant to1,23,24. And without reliable measurement, mapping findings onto statistical models becomes problematic25,26. Behavioral scientists often claim that “context matters” for the effectiveness of nudges and interventions, but without a sound theoretical framework and reliable measurement, we have no way of understanding how or why context matters. And if the science doesn’t work in the real world then in reality, it doesn’t work at all.
AI as a solution
These are all problems where breakthroughs in AI, and in particular, Large Language Models (LLMs), will be transformational. LLMs give us new ways to rethink the field—its theories, its methods, and its applicability to diverse populations—in ways that were unimaginable just 3 years ago.
Trained on vast, open-ended datasets—like online discourse, literature, historical records, and social media—LLMs capture an almost incomprehensibly broad range of human interactions and cultural norms across societies and over time. These are stored in a latent space that can be explored by researchers to refine, test, falsify and confirm pre-existing theories, or expose yet-to-be-considered emergent patterns, revealing associations that are invisible through traditional means to motivate new theories. Imagine trying to theorize about empathy without access to diverse contexts: an LLM trained on massive, multilingual data sets can reveal how empathy manifests in different social contexts, offering insights that no controlled experiment could easily capture.
The holy grail of psychological and behavioral science is building generalizable theories to explain and predict human behavior and cultural change tested with accurate measurement from a diversity of real world contexts that can then reliably inform policies and interventions. We’re not there yet. While we have developed useful insights that have guided a range of interventions27–30 and now understand the rules underlying human behaviour and cultural change as a “theory of human behaviour” or “theory of everyone” as I refer to it in a recent book31, we have not yet created universally predictive models. Human behavior is immensely complicated, complex, and highly variable. But human behaviour is not random. There are patterns and predictors of those behaviors. Each of us is a product of millions of years of genetic evolution, thousands of years of cultural evolution, and a short lifetime of personal experiences. But those experiences come from a society shaped by evolutionary constraints. Thus, just as we can’t predict the path of a single gas particle we may struggle to predict the behaviour of an individual human. But just as we have no problem describing gas laws of many particles, we may be able to predict the behavior of cultural clusters and specific societies and subpopulations7,32. Key to this is effective measurement.
But psychological and behavioral measurement typically requires costly, time-consuming methods that frequently fall short, especially when data collection relies on self-reports from narrow, usually WEIRD samples. Real world interventions are even more expensive and time consuming. LLMs allow us to confront these issues with a scale, precision, and context that behavioral science has been missing.
Imagine a tool capable of predicting how diverse groups of people might respond to a policy, product, intervention or message–free from the biases and overgeneralization from small samples typical of most surveys. Such a tool wouldn’t just simulate typical responses; it would allow researchers to investigate the behaviors of specific subpopulations that were previously unreachable or difficult to study—from diehard QAnon followers talking to each other on dark corners of the Internet to Gen Alpha who were born into a fully digital world.
A report from McCrindle—who coined the term “GenAlpha” —suggest that this generation spends more time on screens than any generation before. But other reports suggest, the majority of GenAlpha go outside or reduce technology use to manage their mental health. Perhaps both trends are true, as Gen Alpha adapts to an increasingly digital world. At the moment, we don’t know, but their digital footprints reveal where they’re going.
Large Language Models (LLMs) trained on vast online interactions can now be used to create digital twins and synthetic populations, making it possible to quickly “poll” specific groups or simulate complex interactions that would otherwise be time-consuming, costly, and challenging to study. These models allow researchers to conduct high-throughput behavioral and cultural experiments, varying parameters and exploring outcomes with synthetic populations. Insights gained from these simulated experiments can then be tested in the more costly and intensive real world, when it truly matters. This approach holds the potential to revolutionize behavioral science.
Here are some promising paths forward.
Theory: Expanding Theory-Building Through Emergent Data Patterns
Traditional theory-building in psychological and behavioral science often begins with a hypothesis drawn from a combination of existing literature, expert opinion, and intuition. Researchers then test this hypothesis in controlled lab settings, hoping that their findings will generalize beyond the sample. This approach, however, has its limitations—especially when models grounded in first principles and set within a cohesive theoretical framework are possible33. Even when theories are backed by replicated experiments, they often struggle to predict complex behaviors outside of narrow lab settings, constrained by the cultural and demographic specifics of the sample and the inherent limits of verbal theorizing.
LLMs offer a fundamentally different approach. They allow us to start with synthetic data or to rapidly iterate through the scientific cycle of theorizing, hypothesizing, testing, and refining the theory or measurement. There are challenges, of course: off-the-shelf LLMs primarily reflect behaviors of individuals from WEIRD samples34,35. Yet in theory, these models also capture a diversity of behaviors from various subpopulations within WEIRD societies and other global contexts. For communities that spend little time online or languages that are underrepresented on the web, accurate simulation may be a bigger challenge, but current LLMs hold within their latent space the 60-70% of the world who are now online ready for researchers to explore.
At Electric Twin and in my lab, we’ve made progress in simulating subpopulations and getting LLMs to behave like specific cultures and as we continue to make progress, the scientific cycle of theory-hypothesis-testing-refinement becomes fluid, faster, and more dynamic. Eventually researchers will be able to test and refine concepts in real-time with synthetic populations that reflect real-world diversity.
Measurement, Generalizability, and Intervention: Precision and Scale Through Synthetic Populations
Social scientists often rely on methods like surveys or focus groups, which are labor-intensive and prone to biases from the interviewer, dominant individuals, group composition, or even anchoring on what was said first. Moreover, people might not answer honestly on sensitive topics and they may be influenced by the mere presence of an interviewer.
The synthetic populations that Electric Twin’s technologies make possible address these limitations by creating virtual respondents who mirror the diversity within WEIRD populations (hopefully beyond soon!). These are commercial tools, but the science behind them will allow researchers to quickly gauge responses to questions, framing, or messaging across a wide range of synthetic yet lifelike participants. I’m particularly excited by some new experiments in interactions between different synthetic participants. Thanks to falling costs in LLMs, we can now simulate focus groups and larger, more diverse populations.
There has been an explosion of papers replicating psychological and behavioral experiments with LLMs36–42 and exciting perspectives and reviews mapping out the possibilities of human-machine behavior, culture, and cultural evolution43,44. Building on these and overcoming the caveats expressed by the authors, LLMs become predictive tools, simulating how various demographic groups might react under specific circumstances.
Imagine a public health campaign testing its messaging on key virtual subpopulations before rolling it out to the public. Each subpopulation, modeled as a synthetic population, could reveal which framing is most effective and which issues resonate most with different communities. There are also commercial applications for businesses understanding customers, employees, shareholders, or new markets. For example, being able to test ahead of time could have helped Walmart avoid its multibillion failure in Germany, widely attributed to a cultural mismatch. Or Australian DIY giant Bunnings failure in the UK:
“If there are two golden rules for internationalising retailers, they are: 1. Develop as comprehensive an understanding of the new market as possible, by whatever means. 2. Be prepared to modify your domestic business model to meet the dictates of the new market, however drastically. In acquiring Homebase, Wesfarmers has spectacularly failed to heed either of these two golden rules… Bunnings is a highly successful business in its native Australia. The UK is not Australia.”
Beyond the obvious commercial applications, behavioral scientists can deploy these simulated populations to test behavioural interventions or public policies, using the model's feedback to refine approaches before investing in real-world trials to confirm those predictions.
But the implications go deeper. The latent space modeled by LLMs is a frontier for behavioral scientists to explore and discover patterns too complex for traditional analysis. We can validate current psychological constructs to see if they make sense beyond what a handful of human raters might think and even discover new constructs. We can also begin to develop theories that account for the diversity of humanity across the globe, of different age groups, and within specific subpopulations, overcoming what we might call the “average person problem”.
I’m a big fan of data driven decision, but it suffers from the “average person problem”: findings are typically about the median or mean of population and therefore represent a generalized average individual. In the real world, the “average person” is ironically rare or perhaps doesn’t exist; each of us falls somewhere along a broad spectrum of behaviors and preferences. We are continually moving points on a copula—the joint probabilities of a multivariate distribution. Or to put it another way, the average human being would be a Christian Indian or Chinese man named Muhammad with an annual salary of $15,000. You get the point – the median and mean person isn’t that meaningful.
AI allows for insights tailored to specific segments, not just a non-existent average person. This personalized, segmented approach could mean a healthcare system where advice isn’t generic but informed by what works for people with specific profiles and histories. It could mean better cross-national teams in the military. It could help finance firms predict different sectors and markets.
Learning Through Simulation: Lessons from Other Sciences
Simulations have long been the key to scientific and technological advancement. Scientists and engineers studying the steam engine eventually led to thermodynamics, not the other way round. Even after the invention of the airplane, safely and efficiently training pilots required Ed Link to invent the flight simulator45. Training in the air was costly and time consuming, not to mention dangerous. Ed Link’s “Link Trainer” mechanical flight simulator brought the cost down by almost 2 orders of magnitude. Flight training became faster, safer, and cheaper thanks to the ability to work out best practices and experiment with new practices without risk and at low cost.
Until now, behavioral scientists have lacked comparable tools for simulating human thought and social processes. But now that we can simulate human behavior, we should expect similar breakthroughs that simulation offered other sciences and technology—in our case for the growing theory of everyone, and for policy, and interventions.
Imagine being able to simulate the evolution of social norms in different cultural groups or predict how cultural values might shift in response to global events. As prices fall, these kinds of simulations become feasible, turning cultural evolution into an applied science with predictive power that was previously out of reach.
For all of these reasons, I’m incredibly excited to join Electric Twin as a scientific advisor. It’s an exciting time to be able to bring my two careers in software engineering and natural language processing, and psychological and behavioral science to shape this new era at their intersection. And of course, all that we’ve learned about ethical behavioral science interventions also apply to this new frontier for behavioural science46–49. There are new possibilities, new challenges, and huge potential.
References
1. Schimmelpfennig, R. & Muthukrishna, M. Cultural Evolutionary Behavioural Science in Public Policy. Behavioural Public Policy 40 (2023).
2. Willroth, E. C. & Atherton, O. E. Best Laid Plans: A Guide to Reporting Preregistration Deviations. Advances in Methods and Practices in Psychological Science 7, 25152459231213802 (2024).
3. Nosek, B. A., Ebersole, C. R., DeHaven, A. C. & Mellor, D. T. The preregistration revolution. Proceedings of the National Academy of Sciences 115, 2600–2606 (2018).
4. Wagenmakers, E.-J. et al. Bayesian inference for psychology. Part I: Theoretical advantages and practical ramifications. Psychon Bull Rev 25, 35–57 (2018).
5. McElreath, R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. (Chapman and Hall/CRC, New York, 2016). doi:10.1201/9781315372495.
6. Henrich, J., Heine, S. J. & Norenzayan, A. The weirdest people in the world? The Behavioral and brain sciences 33, The weirdest people in the world? (2010).
7. Muthukrishna, M. et al. Beyond Western, Educated, Industrial, Rich, and Democratic (WEIRD) Psychology: Measuring and Mapping Scales of Cultural and Psychological Distance. Psychol Sci 31, 678–701 (2020).
8. Apicella, C. L., Norenzayan, A. & Henrich, J. Beyond WEIRD: A review of the last decade and a look ahead to the global laboratory of the future. Evolution and Human Behavior 41, 319–329 (2020).
9. Tinbergen, N. On aims and methods of ethology. Zeitschrift für tierpsychologie 20, 410–433 (1963).
10. Mayr, E. Cause and effect in biology. Science 134, 1501–1506 (1961).
11. Muthukrishna, M. & Henrich, J. A problem in theory. Nature Human Behaviour 9 (2019) doi:10/gfvdx8.
12. Schwartz, B. The Paradox of Choice: Why More Is Less. (HarperCollins, 2003).
13. Iyengar, S. S. & Lepper, M. R. When choice is demotivating: Can one desire too much of a good thing? Journal of Personality and Social Psychology 79, 995–1006 (2000).
14. Scheibehenne, B., Greifeneder, R. & Todd, P. M. Can There Ever Be Too Many Options? A Meta‐Analytic Review of Choice Overload. Journal of Consumer Research 37, 409–425 (2010).
15. Chernev, A., Böckenholt, U. & Goodman, J. Choice overload: A conceptual review and meta-analysis. Journal of Consumer Psychology 25, 333–358 (2015).
16. Poincaré, H. Science and Hypothesis. (Courier Corporation, 1905).
17. Forscher, B. K. Chaos in the Brickyard. Science 142, 339–339 (1963).
18. Smaldino, P. E., Lukaszewski, A., von Rueden, C. & Gurven, M. Niche diversity can explain cross-cultural differences in personality structure. Nat Hum Behav (2019) doi:10/gf8db9.
19. Lukaszewski, A. W., Gurven, M., von Rueden, C. R. & Schmitt, D. P. What Explains Personality Covariation? A Test of the Socioecological Complexity Hypothesis. Social Psychological and Personality Science 8, 943–952 (2017).
20. Flake, J. K. & Fried, E. I. Measurement Schmeasurement: Questionable Measurement Practices and How to Avoid Them. Advances in Methods and Practices in Psychological Science 3, 456–465 (2020).
21. Corneille, O. & Gawronski, B. Self-reports are better measurement instruments than implicit measures. Nat Rev Psychol 1–12 (2024) doi:10.1038/s44159-024-00376-z.
22. Parry, D. A. et al. A systematic review and meta-analysis of discrepancies between logged and self-reported digital media use. Nat Hum Behav 5, 1535–1547 (2021).
23. Muthukrishna, M. Cultural evolutionary public policy. Nat Hum Behav 4, 12–13 (2020).
24. Muthukrishna, M., Henrich, J. & Slingerland, E. Psychology as a Historical Science. Annu. Rev. Psychol. 72, 717–749 (2021).
25. Yarkoni, T. The generalizability crisis. Behavioral and Brain Sciences 45, e1 (2022).
26. Falk, C. F. & Muthukrishna, M. Parsimony in model selection: Tools for assessing fit propensity. Psychological Methods (2021) doi:10/gm6bzc.
27. Hallsworth, M. A manifesto for applying behavioural science. Nat Hum Behav 7, 310–322 (2023).
28. Van Bavel, J. J. et al. Using social and behavioural science to support COVID-19 pandemic response. (2020) doi:10/ggqt8q.
29. Vlasceanu, M. et al. Addressing climate change with behavioral science: A global intervention tournament in 63 countries. Science Advances 10, eadj5778 (2024).
30. Ruggeri, K. et al. A synthesis of evidence for policy from behavioural science during COVID-19. Nature (2023) doi:10.1038/s41586-023-06840-9.
31. Muthukrishna, M. A Theory of Everyone: The New Science of Who We Are, How We Got Here, and Where We’re Going. (MIT Press Books, Cambridge, MA, 2023).
32. Uchiyama, R., Spicer, R. & Muthukrishna, M. Cultural Evolution of Genetic Heritability. Behavioral and Brain Sciences 45, e152 (2022).
33. Muthukrishna, M. & Henrich, J. A problem in theory. Nat Hum Behav 3, 221–229 (2019).
34. Atari, M., Xue, M. J., Park, P. S., Blasi, D. & Henrich, J. Which Humans? Preprint at https://doi.org/10.31234/osf.io/5b26t (2023).
35. Tao, Y., Viberg, O., Baker, R. S. & Kizilcec, R. F. Cultural bias and cultural alignment of large language models. PNAS Nexus 3, pgae346 (2024).
36. Marjieh, R., Sucholutsky, I., van Rijn, P., Jacoby, N. & Griffiths, T. L. Large language models predict human sensory judgments across six modalities. Sci Rep 14, 21445 (2024).
37. Ke, L., Tong, S., Cheng, P. & Peng, K. Exploring the Frontiers of LLMs in Psychological Applications: A Comprehensive Review. Preprint at https://doi.org/10.48550/arXiv.2401.01519 (2024).
38. Abdurahman, S. et al. Perils and opportunities in using large language models in psychological research. PNAS Nexus 3, pgae245 (2024).
39. Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat Hum Behav 7, 1526–1541 (2023).
40. Grossmann, I. et al. AI and the transformation of social science research. Science 380, 1108–1109 (2023).
41. Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can AI language models replace human participants? Trends in Cognitive Sciences 27, 597–600 (2023).
42. Hewitt, L., Ashokkumar, A., Ghezae, I. & Willer, R. Predicting Results of Social Science Experiments Using Large Language Models. (2024).
43. Rahwan, I. et al. Machine behaviour. Nature 568, 477–486 (2019).
44. Brinkmann, L. et al. Machine culture. Nat Hum Behav 7, 1855–1868 (2023).
45. Madhavan, G. Wicked Problems: How to Engineer a Better World. (WW Norton & Co, Erscheinungsort nicht ermittelbar, 2024).
46. Sunstein, C. R. The Ethics of Influence: Government in the Age of Behavioral Science. (Cambridge University Press, New York, 2016).
47. Jachimowicz, J., Matz, S. & Polonski, V. The Behavioral Scientist’s Ethics Checklist. (2014).
48. Michie, S., van Stralen, M. M. & West, R. The behaviour change wheel: A new method for characterising and designing behaviour change interventions. Implementation Science 6, 42 (2011).
49. Lades, L. K. & Delaney, L. Nudge FORGOOD. Behavioural Public Policy 6, 75–94 (2020).