Decoding the data dilemma: Strategies for effective data deletion in the age of AI
Join leaders in Boston on March 27 for an exclusive night of networking, insights, and conversation. Request an invite here.
Businesses today have a tremendous opportunity to use data in new ways, but they must also look at what data they keep and how they use it to avoid potential legal issues. Even with the growth in generative AI, organizations are responsible for not only safeguarding their data, specifically personal data, but also strategically managing and deleting older information that comes with more risk than business value.
Forrester predicts a doubling of unstructured data in 2024, driven in part by AI. But the evolving data landscape and escalating cost of breaches and privacy violations call for a critical look at how to create an effective and robust data retention and deletion strategy.
Data explosion and escalating breach costs
While the expected volume of data is growing, so are the cost of data breaches and privacy violations. Ransomware criminals are taking over highly sensitive medical and government databases, including hacks of Australia’s courts, a Kentucky healthcare company, 23andMe and large enterprises like Infosys, Boeing and security-provider Okta. These breaches are getting more expensive too — IBM found that the average total cost of a breach was $4.45M in 2023 — a 15% jump over 2020.
To manage data effectively, organizations need to craft a policy to delete obsolete data. With gen AI, executives may ask if anything should ever be deleted given future opportunities. But the longer a company stores data, the more opportunities for a data breach or fines for violations of privacy law. The first step to minimize this risk is to take a comprehensive look at how a company is using its data, along with the nuanced considerations and tangible benefits of a data retention strategy.
VB Event
The AI Impact Tour – Atlanta
Request an invite
Why remove obsolete data?
Organizations often find themselves compelled to delete obsolete data due to legal requirements that are core to data protection laws. Regulations mandate the retention of personal data only for as long as necessary, driving companies to establish retention policies with periods that vary across business areas. Along with reducing legal liability, deleting obsolete data can reduce storage costs.
Identifying obsolete data
The best way to identify which data can be considered obsolete, and which data will add ongoing business value, is to start with a data map that outlines the sources and types of incoming data, which fields are included and which systems or servers the data is stored on. A comprehensive data map ensures a company knows where personal data lives, types of personal data processed, which types of protected or special category data are processed, the intended data processing purposes and the geographic locations of processing and applicable systems.
A meaningful data inventory and classification is the foundation for a solid privacy program and helps provide the data lineage needed to understand how data flows through a company’s systems.
Once a company has a map of their corpus of data, legal and technical teams can work with business stakeholders to determine how valuable specific data might be, what sort of regulatory restrictions apply to storing that data and the potential ramifications if that data is leaked, breached or retained longer than necessary.
Most business stakeholders will naturally be reluctant to delete anything, especially when technology is changing so quickly. The deletion and retention conversation needs to focus on what’s most useful for the business. As an example, imagine a data analytics team at a financial institution that wants to ensure lending eligibility models are trained on as much data as possible. Unfortunately, that approach is counter to the intention of data protection and privacy laws.
The reality is that given how much interest rates, lending practices and consumers’ individual circumstances have changed, data from 20 years ago may not provide an accurate assessment of today’s consumers. That company may be better off focusing on other sources of recent data like updated credit information to determine an accurate risk score.
The current commercial real estate market really brings this challenge to light. Many risk-prediction models were trained on pre-pandemic data, before the systemic shift to online shopping and remote work. To reduce the change of inaccurate predictions, discuss with business stakeholders how data becomes stale and less valuable over time and which data is most reflective of today’s world.
Handling obsolete data: Determine, delete or de-identify
To help decide how long to keep data, start with affirmative legal obligations around maintaining financial records or sector-specific regulations around transactions that entail personal data. Look at legal statute of limitation periods to determine how long to keep data if it’s needed to defend against a potential lawsuit, and only keep personal data that’s needed for a potential litigation defense, such as transaction logs or evidence of user consent, rather than every piece of data on individual users.
When it’s time to clear out less valuable information, data can be deleted manually based on the retention period for each data type defined in the retention schedule. Automating the process via a purge policy improves reliability. It’s also possible to use a deidentification process to remove identifiable personal data, or to use fully anonymized data, but this adds new challenges.
Truly deidentified data generally falls under exemptions in data protection laws, but doing this correctly requires stripping out so much value that there’s not much left to use. Deidentifying requires stripping out unique and direct identifiers like an SSN and name, but also indirect identifiers, including information like customer IP addresses. For example, to meet the HIPAA standard for safe harbor protection, an organization must remove a list of 18 identifiers. An organization may want to try this approach to maintain the performance of an analytics or AI model. But it’s important to discuss the pros and cons with stakeholders first.
Avoiding common pitfalls
The biggest mistake enterprises make in addressing obsolete data is rushing the process and skipping over those in-depth conversations. Project owners need to resist the urge to expedite and recognize that the right feedback from multiple groups is essential. Companies should work across legal, privacy and security teams, along with business leaders, to get feedback on what data is essential to keep — and avoid a retention policy and schedule that inadvertently deletes something the company needs. It’s easier to shorten retention periods over time and retain less personal data, but once it’s gone, it’s gone, so measure twice, and cut once.
As we’ve outlined above, there are several considerations in addressing obsolete data, including foundational data mapping and lineage, defining retention period criteria and working out how to implement these policies efficiently. Navigating the intricacies of data deletion requires a strategic and informed approach. By understanding the legal, cybersecurity and financial implications, organizations can develop a robust data retention strategy that not only complies with regulations but also effectively safeguards their digital assets.
Seth Batey is data protection officer and senior managing privacy counsel at Fivetran.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.
If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.
You might even consider contributing an article of your own!
Read More From DataDecisionMakers