Synthetic data v the real thing

ITS and smart cities thrive on data: but does all the data need to be real? Steve Harris of Mindtech explains why the answer could lie in combining elements of the real world with the synthetic

By Steve Harris January 9, 2023 Read time: 6 mins

Data is at the heart of innovation, and real-world data sources have always been the driving force of technological advancements. However, the time for change has arrived. Smart cities have been a topic of prolonged discussion, yet challenges we hope to resolve remain pervasive in our society. Traffic, air pollution, accidents – all continue to pose problems. Reflecting on what is lacking from current artificial intelligence deployments suggests the need for better training data solutions.

Real-world training data has its place and there is no taking away from its integral contribution to training AI systems. However, it can also be flawed. When it comes to training computer vision systems for smart cities, real-world data has been the key focus but has notably lacked in addressing corner cases. If we wish to continue to enhance AI innovation and cover applicability to any edge case, then there is a solution: synthetic data.

What is synthetic data?

Training AI systems requires large volumes of carefully labelled datasets. Compiling enough compliant, annotated real-world data to support these use cases can be incredibly labour-intensive and time-consuming. It can be impractical to spend the time collecting and labelling datasets that can consist of thousands or even millions of objects. Additionally, some use cases may consist of variables for which gathering data is not feasible or challenging to create by traditional means. As a result, synthetic data has emerged as a resource for businesses to solve these problems.

Synthetic data is generated from computer systems with the intention of resembling real data both statistically and structurally. Privacy issues can be a major setback of real data. Synthetic data avoids these issues by delivering privacy and GDPR-compliant training data.

A common misunderstanding is that synthetic data is not accurate or reliable, but in actuality it can be closer to the real world than gathered real-world data itself. Datasets can be created for any situation. Issues with geographical location, camera capture and deployment systems are not an issue. The process is constantly evolving and, as more data is artificially generated, the potential for creating more refined data sets that better resemble real-world use cases gets better. This approach offers an alternative that provides the opportunity to collect greater volumes at a quicker rate, and more inexpensively.

This does not eliminate real data

Preconceptions about real-world data being more reliable or accurate are highly prevalent. In comparison, synthetic data is faster, more cost-effective and more flexible, with greater potential for scaling. It provides data scientists with the ability to innovate and break barriers in place with the use of real data alone.

However, this does not eliminate the need for real-world data. Whilst there are evident advantages of artificially generating information to train AI models, if the aim is to optimise and build a truly intelligent system, a combination of the two is necessary. Real data combined with synthetic data delivers robust AI models covering the key use cases while being thoroughly tested against edge cases. The two go hand in hand and it is important not to dismiss the relevance of both.

The future of smart cities

Smart cities are intended to be safer and on the whole, better. This means technology is implemented to optimise everyday processes and improve citizens’ quality of life. Improved road safety and travel experience are high on the digital transformation agenda. Before adopting these digital solutions, relevant pain points must be identified in order to offer a better user experience. Often, when trained on real data, systems can fail to deal with infrequent scenarios - the ones that typically characterise the need to build smarter and safer cities.

Real data combined with synthetic data delivers robust AI models covering the key use cases while being thoroughly tested against edge cases

Synthetic data can be generated for any edge case needed, meaning AI systems can be prepared for these situations. It is ultimately an essential tool in driving the adoption of AI for smart cities. It can supply local authorities with information and valuable data about their infrastructure, particularly their mobility and travel routes. For example, it can help computer vision systems to predict where certain accidents are more likely to happen within the city's current infrastructure, or where potential cyclists will be at risk of injury. A lack of data for a given class, in this case cyclists, can lead to failures in detection, which can result in fatalities. Targeting these class imbalances through generating synthetic data for use cases like this within autonomous vehicles can make roads safer for all users.

Driving simulators are a critical part of developing autonomous vehicles. While automotive companies have trained vehicles for a long time on real-world data, certain circumstances can be difficult to account for. The collection and labelling of the data used for training this software is also expensive and time consuming. Corner cases reflecting real-world conditions like extreme weather or accidents create a gap, increasing the potential for error. This is where synthetic data comes in to fill gaps in the dataset and also reduce incumbent bias. Expanding the data pool for edge cases stretches the capabilities of an AI model, allowing for the development of solutions in the form of automated tolling, automated vehicles or intelligent traffic lights which will be integral to smart cities. Regardless of how this technology is deployed, synthetic data is guaranteeing these systems are truly intelligent and safe.

Being able to easily find a parking spot or preventing constant traffic jams is all in the pursuit of cities that are better to live in. The deployment of smart technology should come with considerations of how it will impact the lives of those living in the city. A human focus should be maintained throughout the process.

Aligning citizens’ and officials’ needs

Given synthetic data is not subject to the same privacy issues that real-world data is, there is also a greater opportunity for transparency. The lack of need for tracking personally identifiable information allows for models to be developed for meaningful causes, with safety in mind, without breaching these concerns. Its machine-generated nature saves valuable time and energy, whilst also providing the opportunity to counteract bias in the base data. Offering a viable solution for preserving data privacy means that datasets can be more openly published and analysed without having to mask or omit information. This further pushes for service-building that prioritises citizen wellbeing.

City officials should have the needs of citizens at the top of their agendas and so, optimising city functions for the sake of elevating quality of life and wellbeing should also be in their best interests. However, decisions will always come down to economic implications too. Promoting economic growth is understandably an important driver of implementing smart technologies. Smart cities come at a cost, but with the use of synthetic data this can be a more affordable process. As mentioned, real-world data is expensive to collect, both in terms of resources and time. Synthetic data can be collected inexpensively and can be readily available for whatever edge case may be needed. Therefore, leveraging the use of synthetic data can bring the needs of citizens and city officials into alignment to develop smart cities that benefit everyone.

Steve Harris Ultimately, AI systems are essential in the move towards smarter cities all over the world and currently, there is a need to revise training methods to allow the technology to work as effectively as it can. Synthetic data may be the answer to addressing the shortcomings of the system, pushing us closer to making this a reality in the near future. It is more flexible, cheaper and more convenient to collect. However, it is important to acknowledge that a diversity of datasets will always be the more optimal approach to accurately training models. Therefore, jointly training with synthetic data will pave the way for developing globally smarter and safer cities.