Simplify for Success - Conversation with Alexandra Ebert
Alexandra is an ethical AI, synthetic data & privacy expert, serving as the Chief Trust Officer at Mostly AI. She spoke about the use of synthetic data and the benefits it offers.
Alexandra also discussed the different approaches to creating synthetic data and ways to eliminate biases.
Thank you to Fesliyan Studios for the background music.
*Views and opinions expressed by guests do not necessarily reflect the view of Meru Data.*
Hello everyone, welcome to our podcast around simplifying for success. Simplification requires discipline and clarity of thought. This is not often easy in today's rapid-paced work environment. We've invited a few colleagues in data and information governance space to share their strategies and approaches for simplification.
Today, we've invited Alexandra Ebert to talk about synthetic data and how we can enable data privacy using synthetic data. Alexandra Ebert is an ethical AI, synthetic data and privacy expert and serves as a Chief Trust Officer at Mostly AI.
She regularly speaks at international conferences on AI, privacy and digital banking, and hosts the Data Democratization podcast. As a member of the company's senior leadership team, she is engaged in public policy issues in the emerging field of synthetic data and ethical AI and responsible for engaging with privacy community with regulators, the media and with customers.
Before joining the company, she researched GDPR’s impact on the deployment of artificial intelligence in Europe and its economic, societal and technological consequences. Besides being an advocate for privacy protection, Alexandra is deeply passionate about ethical AI and ensuring the fair and responsible use of machine learning algorithms. She is the Co-author of ICLR paper and a popular blog series on fairness in AI and Fair Synthetic Data which was featured in Forbes. IEEE spectrum.
Apart from her work at Mostly AI, she serves as a chair of IEEE Synthetic Data IC expert group and was pleased to be invited to join the Group of AI experts for Human Eyes Initiative, which aims to make AI more inclusive and accessible to everyone.
Hi, Alexandra. Welcome to the show.
Hi Priya, it's so nice to be here. Thanks for having me.
So tell me a little bit about synthetic data? What does it mean?
Sure, happy to do so. So synthetic data, the way we at Mostly AI do it, is a new, sophisticated form of data anonymization. And the reason synthetic data is needed nowadays is because we see it in Europe, in the European Union with GDPR, but also in the United States with California Consumer Privacy Act and all the other privacy laws, the regulatory pressure and the emphasis and the focus of privacy protection is increasing. On the other hand, the industry, or no matter which industry you're actually looking into, nowadays needs more data and artificial intelligence, needs data analytics to be more efficient, to be more customer-centric and to in general improve their products and service offerings.
And most people understand that there's a little bit of a conflict between the principles of data protection and privacy where it oftentimes is to lock away data and don't let people touch and see privacy sensitive personal data and the AI and analytics and data innovation side of the story, which can't get enough data to build all these exciting new products. So, reserves and synthetic data is kind of the puzzle piece in between that gives you access to anonymous data that's as good as your original data, so it's an anonymization technique, but one that doesn't have the pitfalls of all the legacy anonymization techniques that we have out there.
So you mean to say that it's fake data? So how do you make synthetic data? Do you take the original data and create some data out of it, or is it something different? How do you create this data set? Or is it the same for everyone? Or is it something specific that you create for each user or each use case? So how does that work?
Sure, good question. So, there are different approaches to create synthetic data, but the approach that's currently so hyped because it solves this challenge between privacy protection and data utilization is AI-generated data. And this synthetic data is generated with powerful neural networks learning algorithms and the process looks like this- so you have original data, for example, the consumer transactions in your e-commerce store or something like that. You have this original data set, but of course it's full of privacy sensitive information and this whole data set is digested and looked at by the software, by the powerful AI algorithm and it learns everything that's in there, so to say. It's simple. The algorithm learns how the customers of this specific company behave and interact. It learns the time dependencies to correlations. All the statistical structures and distributions.
That's the first step- Learning from an original data set. The second step, once this training is completed, you can use synthetic data generators to create an arbitrary number of new synthetic customers that reflect your original customer base but don't have any privacy sensitive information in there. So what this means is you get synthetic data out there that when you look at the statistics, when you look at the patterns, for example, the shopping patterns of your customers or who bought a black sweater or a red cheese or something like that, all of this information is preserved, but it's not attached to real individuals anymore and this is what you get with synthetic data generation.
So let me ask you, so why should I do this versus just anonymize my data and then use it? So what is the advantage of using synthetic data?
Excellent question Priya, so legacy anonymization techniques or traditional anonymization techniques like masking or obfuscation or other techniques are all, on the one hand, destructive techniques, so they work in a way that you take an original data set and to simplify the black marker and go over this data set to delete like the last name of a person or the Social Security number of a person, many other attributes and fields that the person performing this anonymization procedure finds to be sensitive, needs to be re identifying. And of course, since it's destructive, you're losing lots of valuable information, which is one disadvantage.
The second disadvantage is that researchers in the past few years have demonstrated over and over again that no matter how much information you delete with these legacy anonymization techniques, in the era of big data, it's still easy to re-identify people. To give you an example, in the financial services industry, think of deep transactions that you make with your credit card over the course of a year.
For most people, these are dozens, if not hundreds of transactions, and scientists were able to show that only with two out of these potentially 100 transactions per customer, plus if you preserve those two transactions and not even the whole transaction but only the date of the transaction and the merchant. For example, on the 2nd of February you bought something at Walmart and on the 7th of March you bought something at Amazon, these tiny pieces of information were sufficient to uniquely reidentify over 80% of customers in this data set. So, you can see out of the hundreds of data points companies nowadays collect about their customers, they can't even preserve three, or four or five pieces of information without running into this re-identification risk.
And then of course reputational damage that can come from those privacy fines and so on and so forth. And now to answer your initial question, why synthetic data is better? What's the benefit? Synthetic data doesn't have this re-identification risk and it preserves the information, so you don't destroy any elements of the data, you don't have to get rid of certain sensitive fields of the data. You create a completely synthetic data set that has all the columns and all the rows of the original data set, but they are filled out synthetically and therefore you have a rich information source that you can analyze and query and use to identify insights, but it's synthetic, artificial data of artificial customers that are not real people and therefore you can't really identify this data and don't run into risk of privacy fines or the reputational damage that correlates with privacy breaches.
I mean, are all synthetic data the same? How do I know that it's going to work well with my model or whatever I've built for data analytics, right? How do I evaluate good synthetic data versus not so good ones?
Good question. So as I mentioned earlier that there are different approaches to generating data and even in the category of deep learning, AI generated data, there are different techniques, different model pro tools that are used and also different vendor solutions, open-source solutions or solutions that are built in-house but different companies. Therefore, you can't speak about all synthetic data in equal terms, because of course you're right how it's generated matters.
We, at Mostly AI, are leading the synthetic data space, so we have quite so at both privacy metrics that are used during the generation project process, but also, privacy and accuracy evaluation metrics that are automatically or basically the quality assurance report that helps these metrics is automatically generated whenever a customer uses our software, and this gives the person an overview of how accurate it is and how private it is. But of course, the whole industry is in need of these kind of metrics and therefore also initiated standards initiative together with the IEEE Standards Organization, we were tackling exactly this challenge of identifying industry-wide accepted standards on how to assess synthetic data, privacy and accuracy, together with researchers and experts from the industry.
So once I start using synthetic data, I mean obviously there are some estimates that I saw that by 2024, 60% of the data used would probably be synthetic, but once I start using synthetic data, do I sort of immediately eliminate all of my issues like for example, how do I make sure the synthetic data does not have any biases? We talked a lot about privacy, you said that one of the metrics that you have at least from a synthetic data perspective, is to sort of look at privacy metrics to see if it has eliminated anything that would identify an individual, but let's also talk about biases. So, if let's say, my original data had some biases, how do I ensure that my synthetic data doesn't contain the same things?
Happy to do that because actually, fairness in AI and also responsibility, as a topic in general, that's very dear to my heart. So, in the initial process or the original process of synthetic data generation, you wouldn't address biases, but what you would do is you would create a kind of replica, an artificial replica of an original data set. Of course, that's privacy safe and the big benefit that you have here is that for many companies, for the very first time, once they have synthetic data, they are in a position that allows not only a small group of privileged people, being data analysts or other professionals within this organization to have the permission to look at the real data, but everybody can and this is one of the most important steps towards transparency that a diverse set of people can assess datasets and look into the data to see if there are some biases present or not.
But of course, you're right, in an ideal world, you want to get rid of biases in data, so we are doing also quite some research on fair synthetic data. When you look at an original data set and they had a task with fair synthetic data is not to create an exact replica that gives you a synthetic artificial data set of the world as it is but to give you a data set that reflects the world as you would like it to be. So for example, we have a quite popular series on fairness in the AI and fear synthetic data on our block at Mostly AI, which was also featured by Forbes and Android TV, AI experts and many others and there we used the US Census data set where you can clearly see that the fraction of females that are in low earning column is much higher than the males who were predominantly in the higher earning column. There we, for example, tweak the data set so that it's still super accurate and realistic but just try to eliminate this gender pay gap that we still have in the United States and Europe, and in many other countries and continents in the world. So fair synthetic data is one possibility.
How can you get rid of the biases? But of course, there are plenty of other tools out there that help you to eliminate biases. The first important step is that you can work with the data, see the data and can analyze and evaluate whether there are biases in there in the first place. And this is where synthetic data can help tremendously, because it gives you the transparency in a privacy safe form to have more people from inside the organization, even external experts, look at it and evaluate your data.
So, I keep hearing that generating synthetic data is not just about plugging in the AI tool, analyzing your datasets and there you go, you start getting synthetic data. It requires people with advanced knowledge of AI specialized skills to evaluate this data, and some sophisticated frameworks to sort of manage this process, right? So what are some of the biggest drawbacks and how do I educate myself to be able to understand synthetic data better and what are some places that you would recommend or techniques that you would recommend, and we could probably, also initially talk about this at a higher level, maybe delve a little deeper into the metrics that you were talking about as to how it evaluates in big data.
Sure, so you said that it requires expert knowledge and deep understanding of how the synchronization process works, I would tend to disagree. I mean, of course you want to have some knowledge about the algorithm, the inner workings and how to interpret the metrics that are in there. But from my experience, what I see in the industry we work with some of the largest banks and insurance providers both in the United States as well as in Europe and there is one quite apparent pattern that those organizations who are already quite mature in their data management practices, who have broken up their data silos, who have cleaned, data labeled, data structured, data that is intact and coherent are in a much better position to do exactly what you described, namely install our software or any other software for synthetic data, upload their data sets into the software and get synthetic data by the press of a few buttons so the process is rather simple and straightforward.
The bigger challenges for those organizations who are still at the beginning of their data maturity journey, who are still in the process of figuring out which data sources do we have? Can we trust our data is correct? Do we have some box in the data and so on and so forth. So, I would say the bigger challenge comes before the synthetic data generation process, but of course, once you start using synthetic data, initial understanding of what's going on and this model, how the data can be interpreted, assessed and analyzed. Whether everything worked as intended are also skills that companies want to build, and this is why we also, over the course of the year, started to offer more initial training and customer success activities for corporate clients to help them over the course of the first few weeks with synthetic data to really get a hang of it and learn everything. And that is to know about using and operating synthetic data.
So you're telling me that the harder part is not so much in generating the synthetic data, but in making sure that your data is good before you start using synthetic data or generating synthetic data, which is in itself a fairly big challenge for most companies because they have data silos and they don't have good labeled, high-quality data. Often that's a challenge that most people haven't solved.
Exactly, you're right. I mean, we had a few clients where actually synthetic data helped them to improve their data governance overall, because as I mentioned before, and with this transparency aspect of synthetic data, once they synthesize you to say it bluntly, correct they had, it was apparent to everybody looking at the data that there are some issues in there. So very simple things, for example, like female names being saved as male customers or something like that and some other errors that became quite apparent also gaps in the data and so on and so forth.
And since it finally was possible for those organizations to give this data set what they currently have to a broader group of people, the issues became more or the awareness for these issues was raised and it helped them to address them and eventually have the quality that they needed to not only synthesize data, because you can synthesize whatever you want, but do what you want to do with the data after synthesizing, namely using it for analytics or AI training, where of course, it's paramount that the data is reliable, accurate, and correct, so this is also one thing to maybe consider.
So let's talk a little bit more about metrics, right? You briefly touched on it, but can we talk a little bit about some of the minimum metrics somebody should have before they sort of delve into. And of course, understand this, because obviously it's easy to come up with metrics, but if you don't understand why you're tracking it and what it means, it's not, going to be helpful, but what are some of the basic metrics or at least minimal metrics that you recommend people use before they start using any type of synthetic data.
So, if somebody has generated synthetic data and wants to understand whether the state's trustworthy, both in regard to the degree of privacy protection, but of course also in its accuracy and fidelity, the absolute minimum metrics, or the absolute minimum things I think should be evaluated on the accuracy side are comparing all the univariate, bivariate, and overall correlation structures in the data. What I mean is that you basically have a program run over the synthetic data and the source data, the original data that it was generated based off. That gives you a statistical and also visualized overview of whether you have the same fraction of females and males in, like a particular age group, and so on and so forth. This was a super basic example, but then going down and down and down and becoming much more granular and looking on whether you have the same fraction of 74 between 76-year-old pensioners who tend to travel to Malaysia twice per year and, I don't know, buy some ice cream there.
So, it really goes from very top levels like age distributions and gender distributions to behavior-based comparisons. This is something that you cannot automate, but to take it one step further and really build this trust of whether what you've generated in the synthetic form is reliable and allows you to analyze it and trust the results as if you had retrieved these results from the original data. Actually, two more steps that I would recommend. And this is also what our clients usually do when they first start out with synthetic data. The one thing is to assess the realism of the data. They have domain experts evaluated, so we work a lot also in the insurance sector, with healthcare data and patient data and they are for example having a medical professional evaluate the synthetic patients to see whether the trajectory of a certain disease or which events happen in which order, which medication, which doses and so on and so forth were given to a patient would be realistic in just assessing a few individuals and the other part that needs to be done to really get to this internal state of trust into synthetic data is a very full comparison of the data, and here the most so comparison. The highest part of an organization and we can have to compare synthetic data is actually to conduct some AI training or some form of advanced analysis. When you perform one of the same analysis on the original data and on the synthetic data and see whether the results are equally accurate and give them the same results.
Of course, training a machine learning model every time we synthesize the data wouldn't be feasible, but just doing this in the beginning or maybe from time to time when you start working with completely new types of datasets helps you to build trust in synthetic data because you see that you get the same results and therefore this trust increases.
And maybe a nice anecdote from one of our clients from one of the largest and central European banks when they first used synthetic data when they wanted to solve a product development issue where they needed granular customer data to really make a great customer centric and personalized product and for privacy reasons that didn't have access. So they started using our Mostly AI software to generate a synthetic version of their thing, 14 million customers that they have and when they analyzed and looked into this state and assessed it, they found spending behavior and income streams in the customer data where they thought well, this can't be real. Nobody spends his money like that or earns money like that and they went back to us and told us. We're pretty sure you have a bug in there, this was already four years back when we started out with Mostly AI, but we were confident that there was a bug, and so we urged them to go back to the original data and then we went through all the bureaucratic processes necessary to look at the original data and guaranteed they found out that these spending behaviors and these income behaviors are what their actual customers are having. So they concluded that it was a journey to start trusting synthetic data, but now they really see this as an invaluable tool because they can get access to granular customer behavior and understand their customers much better than before. Also, what I just gave with these spending behaviors, they never ever thought of in the 1st place could be the spending behavior of a real person.
So what are some of the limitations that somebody should watch out for when they're using synthetic data?
Sure, so I think one thing that people should keep in mind when working with synthetic data, which is on the one hand, a blessing for synthetic data, but on the other hand, also limitation, it's the strongest form of anonymization that currently is technically out there. It can't be re-identified, it's impossible to go back from a synthetic customer to any of your original customers, which of course is the whole purpose of anonymizing data, but this also makes clear that synthetic data is not the right tool.
If you have some data projects where it's essential that you at one point in time can go back to the real individuals. So, for example, certain fraud analytics where you want to figure out who the fraudster is and want to have a clear name at the end of the process with synthetic data, you never get back to this clear name, but it can give you all the financial transactions, including fraud transactions to build your fraud detection models in a privacy preserving way. This is what synthetic data can do, but then applying them in practice, having real transactions being analyzed by that is something that you can't replicate with synthetic data, because in the end, you want to figure out who's the fraudster not getting any fictitious person who now the system claims to be a fraudster.
Any other closing thoughts that you want to share with the audience?
Maybe what I'm most happy to see in the world of synthetic data, when we first started out with Mostly AI, the topic was still running new, but we knew that anonymization and synchronization would become increasingly important because it's one of the only ways to balance privacy protection and compliance with GDPR and CCPA with advanced analytics and AI training. And we mentioned earlier, discounting prediction of in already two years' time, 60% of all data used for AI training is going to be synthetic and I'm super happy to see, on the one hand, validation points like this, but also all the European privacy protection authorities now advocating for synthetic data, including it in their guidelines and also is one of the most important technology trends to watch in the period of 2021 to 2022.
Like in the case of the European Data Protection Supervisor, I think these are all very nice indicators to see, aside from of course all the amazing projects that our customers already accomplished with synthetic data from an economical point of view. But also, from the societal point of view. So, for example, synthetic data is now increasingly used to be open access to healthcare data so that researchers can collaborate globally on fighting diseases like cancer, COVID or in the future, hopefully also Alzheimer's. And this is really something that drives me, it motivates me to further educate people about synthetic data and encourage the use of it.
Thank you so much for sharing information about synthetic data, thank you Alexandra. It's been a pleasure to talk to you.
Sure, it was a pleasure and if anybody of your listeners has some questions about it. They can just find me on LinkedIn and reach out with their questions and I’ll be happy to answer any follow-up questions there as well.