Will “Leaky” Machine Learning Usher in a New Wave of Lawsuits?

A computer science professor at Cornell University has a new twist on Marc Andreessen’s 2011 pronouncement that software is “eating the world.”  According to Vitaly Shmatikov, it is “machine learning [that] is eating the world” today.  His personification is clear: machine learning and other applications of artificial intelligence are disrupting society at a rate that shows little sign of leveling off.  With increasing numbers of companies and individual developers producing customer-facing AI systems, it seems all but inevitable that some of those systems will create unintended and unforeseen consequences, including harm to individuals and society at large.  Researchers like Shmatikov and his colleagues are starting to reveal those consequences, including one–“leaky” machine learning models–that could have serious legal implications.

In this post, the causes of action that might be asserted against a developer who publishes, either directly or via a machine learning as a service (MLaaS) cloud platform, a leaky machine learning model are explored along with possible defenses, using the lessons of cybersecurity litigation as a jumping off point.

Over the last decade or more, the plaintiffs bar and the defendants bar have contributed to a body of case law now commonly referred to as cybersecurity law.  This was inevitable, given the estimated 8,000 data breaches involving 11 billion data records made public since 2005. After some well-publicized breaches, lawsuits against companies that reported data thefts began appearing more frequently on court dockets across the country.  Law firms responded by marketing “cybersecurity” practice groups whose attorneys advised clients about managing risks associated with data security and the aftermath of data exfiltrations by cybercriminals.  Today, with an estimated 70-percent of all data being generated by individuals (often related to those individuals’ activities), and with organizations globally expected to lose over 146 billion more data records between 2018 and 2023 if current cybersecurity tools are not improved (Juniper Research), the number of cybersecurity lawsuits is not expected to level off anytime soon.

While data exfiltration lawsuits may be the most prevalent type of cybersecurity lawsuit today, the plaintiffs bar has begun targeting other cyber issues, such as ransomware attacks, especially those affecting healthcare facilities (in ransomware cases, malicious software freezes an organization’s computer systems until a ransom is paid; while frozen, a business may not be able to effectively deliver critical services to customers).  The same litigators who have expanding into ransomware may soon turn their attention to a new kind of cyber-like “breach”: the so-called leaky machine learning models built on thousands of personal data records.

In their research, sponsored in part by the National Science Foundation (NSF) and Google, Shmatikov and his colleagues in early 2017 “uncovered multiple privacy and integrity problems in today’s [machine learning] pipelines” that could be exploited by adversaries to infer if a particular person’s data record was used to train machine learning models.  See R. Shokri, Membership Inference Attacks Against Machine Learning Models, Proceedings of the 38th IEEE Symposium on Security and Privacy (2017). They describe a health care machine learning model that could reveal to an adversary whether or not a certain patient’s data record was part of the model’s training data.  In another example, a different model trained on location and other data, used to categorize mobile users based on their movement patterns, was found to reveal by way of query whether a particular user’s location data was used.

These scenarios certainly raise alarms from a privacy perspective, and one can imagine other possible instances of machine learning models revealing the kind of personal information to an attacker that might cause harm to individuals.  While actual user data may not be revealed in these attacks, the mere inference that a person’s data record was included in a data set used to train a model, what Shmatikov and previous researchers refer to as “membership inference,” could cause that person (and the thousands of others whose data records were used) embarrassment and other consequences.

Assuming for the sake of argument that a membership inference disclosure of the kind described above becomes legally actionable, it is instructive to consider what businesses facing membership inference lawsuits might expect in terms of statutory and common law causes of action so they can take steps to mitigate problems and avoid contributing more cyber lawsuits to already busy court dockets (and of course avoid leaking confidential and private information).  These causes of actions could include invasion of privacy, consumer protection laws, unfair trade practices, negligence, negligent misrepresentation, innocent misrepresentation, negligent omission, breach of warranty, and emotional distress, among others.  See, e.g., Sony Gaming Networks & Cust. Data Sec. Breach Lit., 996 F.Supp. 2d 942 (S.D. Cal 2014) (evaluating data exfiltration causes of action).

Negligence might be alleged, as it often is in cybersecurity cases, if plaintiff (or class action members) can establish evidence of the following four elements: the existence of a legal duty; breach of that duty; causation; and cognizable injury.  Liability might arise where defendant failed to properly safeguard and protect private personal information from unauthorized access, use, and disclosure, where such use and disclosure caused actual money or property loss or the loss of a legally-protected interest in the confidentiality and privacy of plaintiff’s/members’ personal information.

Misrepresentation might be alleged if plaintiff/members can establish evidence of a misrepresentation upon which they relied and a pecuniary loss resulting from the reliance of the actionable misrepresentation. Liability under such a claim could arise if, for example, plaintiff’s data record has monetary value and a company makes representations about its use of security and data security measures in user agreements, terms of service, and/or privacy policies that turn out to be in error (for example, the company’s measures lack robustness and do not prevent an attack on a model that is found to be leaky).  In some cases, actual reliance on statements or omissions may need to be alleged.

State consumer protection laws might also be alleged if plaintiff/members can establish (depending on which state law applies) deceptive misrepresentations or omissions regarding the standard, quality, or grade of a particular good or service that causes harm, such as those that mislead plaintiff/members into believing that their personal private information would be safe upon transmission to defendant when defendant knew of vulnerabilities in its data security systems. Liability could arise where defendant was deceptive in omitting notice that its machine learning model could reveal to an attacker the fact that plaintiff’s/members’ data record was used to train the model. In certain situations, plaintiff/members might have to allege with particularity the specific time, place, and content of the misrepresentation or omission if the allegations are based in fraud.

For their part, defendants in membership inference cases might challenge plaintiff’s/members’ lawsuit on a number of fronts.  As an initial tactic, defendants might challenge plaintiff’s/members’ standing on the basis of failing to establish an actual injury caused by the disclosure (inference) of data record used to train a machine learning model.  See In re Science App. Intern. Corp. Backup Tape Data, 45 F. Supp. 3d 14 (D.D.C. 2014) (considering “when, exactly, the loss or theft of something as abstract as data becomes a concrete injury”).

Defendants might also challenge plaintiff’s/members’ assertions that an injury is imminent or certainly impending.  In data breach cases, defendants might rely on state court decisions that denied standing where injury from a mere potential risk of future identity theft resulting from the loss of personal information was not recognized, which might also apply in a membership inference case.

Defendants might also question whether permission and/or consent was given by a plaintiffs/members for the collection, storage, and use of personal data records.  This query would likely involve plaintiff’s/members’ awareness and acceptance of membership risks when they allowed their data to be used to train a machine learning model.  Defendants would likely examine whether the permission/consent given extended to and was commensurate in scope with the uses of the data records by defendant or others.

Defendants might also consider applicable agreements related to a user’s data records that limited plaintiff’s/members’ choice of forum and which state laws apply, which could affect pleading and proof burdens.  Defendants might rely on language in terms of service and other agreements that provide notice of the possibility of external attacks and the risks of leaks and membership inference.  Many other challenges to a plaintiff’s/members’ allegations could also be explored.

Apart from challenging causes of action on the merits, companies should also consider taking other measures like those used by companies in traditional data exfiltration cases.  These might include proactively testing their systems (in the case of machine learning models, testing for leakage) and implementing procedures to provide notice of a leaky model.  As Shmatikov and his colleagues suggest, machine learning model developers and MLaaS providers should take into account the risk that their models will leak information about their training data, warn customers about this risk, and “provide more visibility into the model and the methods that can be used to reduce this leakage.”  Machine learning companies should account for foreseeable risks and associated consequences and assess whether they are acceptable compared to the benefits received from their models.

If data exfiltration, ransomware, and related cybersecurity litigation are any indication, the plaintiffs bar may one day turn its attention to the leaky machine learning problem.  If machine learning model developers and MLaaS providers want to avoid such attention and the possibility of litigation, they should not delay taking reasonable steps to mitigate the leaky machine learning model problem.