We should all be worried about synthetic data

Making up the world through made-up data

Synthetic data – the use of AI to create datasets that mimic real world data – is rapidly becoming a much bigger part of our daily lives. But this form of data raises critical philosophical and ethical questions that will shape the future for all of us, write Mikkel Krenchel and Maria Cury.


There’s a data revolution happening and nobody is talking about it. It revolves around synthetic data. Unless you work in the field of artificial intelligence (AI), you may have never heard of it. But this rapidly growing form of data raises critical philosophical and ethical questions that will shape the future for all of us. First, what is synthetic data? There are many types, but the basic premise is the use of AI to create datasets that mimic real world data. These datasets can then be used to feed the insatiable need for data that trains machine learning algorithms to make better predictions. Instead of training algorithms on messy, expensive real-world data riddled with privacy issues and bias, now one can supplement or supplant real-world data with “better,” “cheaper,” or “bigger” datasets constructed using AI. Put simply, synthetic data is artificial data feeding artificial intelligence. It’s similar to deep fakes, yet used for less nefarious purposes, and applied to not only videos and images but any type of data under the sun, from insurance data, to army intelligence, self-driving vehicles, or even patient health care records. It is as awe-inspiring as it is terrifying.


We are standing on the brink of a world where many of the technologies that surround us might not be built in response to reality, but to what a machine imagines that reality to be.


Synthetic data is not a new concept, but what’s new is the surging demand for it and the AI capabilities to support it. Organizations across the world are investing massively in training new AI systems in hopes of changing how we learn, heal, trade, drive, buy, wage war and much more. To train these systems, they will need ever expanding quantities of data. Yet, good data is harder than ever to come by, as concerns and regulations around privacy, bias, and questions around responsible AI are finally creating some constraints on data collection. As such, Gartner predicts that by 2024 no less than 60% of all data used for AI will be synthetic. Already, 96% of teams working on computer vision rely on synthetic data and another analysis suggests that the number of companies focused on supplying synthetic data nearly doubled between 2019 and 2020 alone. It’s not hyperbole to think the role of ‘synthetic data engineer’ will be the most in-demand profession one day.

That means we are standing on the brink of a world where many of the technologies that surround us might not be built in response to reality, but to what a machine imagines that reality to be. This begs the questions: What happens if and when there are gaps between the real world the AI operates in, and the synthetic world it was trained in? How do we narrow those gaps, and what are the ethical and safety guardrails we need to put in place? If data is the new oil, as some argue, what happens if large-scale datasets become a cheap commodity that anyone with the right AI can build? What might that mean for the business models of big tech companies centered around their unique access to real-world data? And what will happen to empirical disciplines like the social sciences if we increasingly rely on data that isn’t collected in the real world?

Perhaps most critically, in a world where we are already struggling with a lack of data literacy in society, growing misinformation, and a contested relationship with ‘truth,’ synthetic data will require that we re-evaluate what we mean by terms like ‘data’ and ‘reality’ and embrace a worldview where the quality, context, and origin of the data matter perhaps more so than the quantity.

Cheaper, safer, fairer

Before we get there, let’s consider the immense upside of synthetic data. It holds tremendous promise to solve a variety of very practical problems, from lowering the cost of developing helpful AI systems, to providing better (though not perfect) privacy protections, to allowing developers to build all kinds of products with data that is closer to real life (compared to the cruder sample data many teams used to use).

Synthetic data is especially helpful when the datasets are difficult to come by. Take car manufacturers, for example. Through synthetic datasets, car makers can mimic driver behaviour in virtual car simulations to train and iterate their models across a vast and richer set of situations to make driverless cars safer. They could also do so at a fraction of the time, cost and difficulty of acquiring actual data. The National Institutes of Health used synthetic data to replicate their database of more than 2.7 million COVID-19 patient records, creating a dataset with the same statistical properties but none of the identifying information that could be quickly shared and studied by researchers the world over. The aim was to help identify better treatments without infringing on the privacy of the people involved.

In these ways, synthetic data is proving its value across industries and sectors. John Deere, for example, has created synthetic images of plants to train its tractors to think like human farmers. JP Morgan is experimenting with synthetic data to detect payment fraud and money laundering. And healthcare companies are employing it to test medical cases for which there is insufficient data. Done right, synthetic data will help us bring about important new technologies for how we communicate, get around more safely, heal our bodies, and so much more. 

Synthetic data also has the potential to correct some of the glaring inconsistencies and biases in our current datasets. According to Gartner, for example, some 85% of the algorithms currently in use are error-prone due largely to bias — often a product of underrepresentation in the data sample of women, people of colour, or other minority groups. With synthetic data, engineers can artificially boost the number of underrepresented minorities within a data set, simply by generating new synthetic characteristics that are representative of the minority group in question. So many believe that synthetic data could go a long way in making data less biased and more fair, allowing us to build more accurate AI that reflects and manifests the world we want, rather than perpetuating the historical biases and inequalities of the one we have.


Even the best synthetic data may quickly grow obsolete if the real world evolves in a different direction from what the algorithms expect.


The reality gap

Even with its capacity to minimize known historical biases, it would be a mistake to think that synthetic data is bias-free. Data without bias is generally an illusion — people make decisions about what data to include, exclude, and how to analyze it, and those choices are based on what’s deemed important or relevant, which is usually biased. This continues to be the case when it comes to making decisions around synthetic datasets. Engineers generate synthetic data based on a smaller sample of ‘real data’ that is labelled with all the aspects deemed relevant for the AI to train on, and a set of rules that seek to counteract any obvious, known biases in the original dataset. But the whole point of bias is that we all suffer from it and often can’t see it ourselves. And there is more complexity and nuance in reality than we’ll ever be able to systematically reflect and account for in synthetic datasets. So long as humans are the ones making decisions on which of these datasets should be built, which problems they should solve, and what real-world data should be their basis, we will never be able to fully remove bias. And as such, synthetic data can reproduce patterns and biases from the data it is drawn from and even amplify them.

The world keeps changing and any data sample that forms the basis of a larger dataset will invariably be a portrait-in-time. Even the best synthetic data may quickly grow obsolete if the real world evolves in a different direction from what the algorithms expect, based on factors that the humans who designed the algorithms couldn’t account for or anticipate. In other words, synthetic data may help us represent — or amplify — what we already know and can foresee. But if that is all we rely on, we may miss the opportunity to discover something new about our constantly changing world.

In a worst case scenario, we get an echo chamber effect, whereby AI feeds the AI and the models that develop and control key aspects of our world — the information we consume, the digital worlds we frequent, the medical advice and products we receive, or the price we pay for insurance and many other products — increasingly respond to an internal logic divorced from the reality we inhabit.


If that dataset isn’t grounded in (or perhaps made from) a rigorous understanding of the most recent underlying human phenomenon, it risks simulating a social world that short-changes reality in ways that could cause real harm to everyday people.


A dangerous default

Used responsibly and carefully, it is likely that engineers can minimize the reality gap, and avoid many of the direct pitfalls associated with synthetic data. But we shouldn’t just be concerned with how synthetic should be used — we should be concerned with how it might be misused. What happens when engineers, scientists, and business leaders the world over can either turn to readily available and cheap synthetic data, or do the arduous work of collecting new, original real-world data? In particular, what happens if and when synthetic data builds a ‘reputation’ as a better alternative to real data? It does not take much to imagine that even the best intentioned engineers, scientists, and business leaders might start defaulting towards using synthetic data even in situations when they really shouldn’t.

Already today, we see many companies making decisions based on whatever available dataset they can find and calling it a ‘data-driven decision,’ even when the datasets are clearly biased, incomplete, or obsolete. It’s better than nothing, goes the thinking, particularly in scenarios where collecting new raw data is prohibitively difficult or expensive. In this way, the growing availability of synthetic data might make firms or organizations disinclined to do original research and data collection. And that’s dangerous because even the best synthetic dataset will never be a representation of our constantly changing reality that can answer all questions and inform all decision-making. If that dataset isn’t grounded in (or perhaps made from) a rigorous understanding of the most recent underlying human phenomenon — such as the differences between what people say and do, or the unexpected influence of tangential variables in our lives in the actions we take — it risks simulating a social world that shortchanges reality in ways that could cause real harm to everyday people. And this is before we even begin to contemplate more nefarious uses of synthetic data, such as deep-fakes or misinformation at massive scale. As a society, we are already struggling with data literacy and transparency, and with the growth of synthetic data it might be about to get a whole lot worse.

The case for thicker data

So how do we avoid the pitfalls of synthetic data and create the transparency and data literacy needed for all of us to make sense of this new world of data? This is where we believe the social and human sciences ought to get involved. The input most crucial to making sure the synthetic data revolution does not simulate low-quality reflections of the world we live in (or worse, create worlds we didn’t intend) is small, not big, data. In a synthetic data world, the quality of the initial, small dataset from which the synthetic data is derived, is absolutely paramount. And so is a deeply contextualized understanding of that dataset itself — where it came from, what it can be used for, what it explains, and what it doesn’t. This is the kind of context that is difficult to obtain, make sense of, or relate to underlying structures and biases.

Anthropologists are trained in the collection of ‘thick data’ — or what Clifford Geertz referred to as “thick description” — the messy, raw, real world data (usually with innumerable confounding variables) that you can only collect by going out into the world, observing the larger cultural meaning of what’s going on, and paying close attention to social norms, culture, and context. They are trained to understand the limitations of data in informing our decision-making, and how, if mishandled or misused, data can exacerbate hidden biases or have other unintended consequences. It could be exactly the type of input and expertise needed to guide the next generation of synthetic data-driven AI.

Social science in a synthetic data world

For anyone who is interested in the social sciences and making sense of humanity, AI that can generate high-quality synthetic data ought to inspire amazement, even awe. Because looking at real data of what people do or experience, and then deriving a set of predictions (or theories) about what other people (perhaps imagined, perhaps more generalized) would do, is — in our view — exactly what the best social scientists do. C. Wright Mills claimed that the most critical skill in the social sciences was what he called the sociological imagination — or the ability to draw on historical, social, and psychological data to make sense of what we do, and extrapolate what we might do next, or what might be different under different circumstances. The best social scientists often rely on limited datasets to understand and imagine entire social worlds. It is thought provoking that computers are now starting to make the same imaginative leaps, even if there are of course still significant limitations and issues with this approach.

However, how should we think about the inferences about people that machine learning algorithms are increasingly able to make? It is tempting to use these advances in machine learning to pit computers against humans and see which is ‘better’, or declare ‘the end of theory’ all together as some have done. On one hand, we know very little about how computers actually come up with the patterns they do, because of the opaque nature of the underlying neural networks. On the other, we know equally little about how our own minds function. So how can we meaningfully compare them? Is comparison even useful? There’s a reasonable chance that even if the outcomes seem comparable (e.g. worlds as imagined by people vs. worlds as imagined by machines) the way we get there is fundamentally different and will systematically produce different results over time. We simply don’t know. For the time being, perhaps the better heuristic is to think of human imagination and intuition and its machine counterpart as two fundamentally distinct and complementary approaches. This view suggests that the future of the social and human sciences might involve human researchers in a form of dialogue or “AI Dance” with machines to collectively build better models and explanations of the world, drawn from both real world data and synthetic datasets.

In the future, synthetic data will be a much bigger part of our daily lives. It has the potential to restructure everything from the algorithms that shape our experience of the world, to our understanding of data and reality, to the role of the social sciences in society. The stakes are too high to leave these important decisions to data scientists alone — social scientists and philosophers (as well as policymakers) have a role to play. Otherwise, the effects of this data revolution could be disastrous.

Latest Releases
Join the conversation

Anna Taylor 1 30 May 2022

AI is really the technology trend of the future, it makes us think about alternative technology.