Will “Leaky” Machine Learning Usher in a New Wave of Lawsuits?

A computer science professor at Cornell University has a new twist on Marc Andreessen’s 2011 pronouncement that software is “eating the world.”  According to Vitaly Shmatikov, it is “machine learning [that] is eating the world” today.  His personification is clear: machine learning and other applications of artificial intelligence are disrupting society at a rate that shows little sign of leveling off.  With increasing numbers of companies and individual developers producing customer-facing AI systems, it seems all but inevitable that some of those systems will create unintended and unforeseen consequences, including harm to individuals and society at large.  Researchers like Shmatikov and his colleagues are starting to reveal those consequences, including one–“leaky” machine learning models–that could have serious legal implications.

In this post, the causes of action that might be asserted against a developer who publishes, either directly or via a machine learning as a service (MLaaS) cloud platform, a leaky machine learning model are explored along with possible defenses, using the lessons of cybersecurity litigation as a jumping off point.

Over the last decade or more, the plaintiffs bar and the defendants bar have contributed to a body of case law now commonly referred to as cybersecurity law.  This was inevitable, given the estimated 8,000 data breaches involving 11 billion data records made public since 2005. After some well-publicized breaches, lawsuits against companies that reported data thefts began appearing more frequently on court dockets across the country.  Law firms responded by marketing “cybersecurity” practice groups whose attorneys advised clients about managing risks associated with data security and the aftermath of data exfiltrations by cybercriminals.  Today, with an estimated 70-percent of all data being generated by individuals (often related to those individuals’ activities), and with organizations globally expected to lose over 146 billion more data records between 2018 and 2023 if current cybersecurity tools are not improved (Juniper Research), the number of cybersecurity lawsuits is not expected to level off anytime soon.

While data exfiltration lawsuits may be the most prevalent type of cybersecurity lawsuit today, the plaintiffs bar has begun targeting other cyber issues, such as ransomware attacks, especially those affecting healthcare facilities (in ransomware cases, malicious software freezes an organization’s computer systems until a ransom is paid; while frozen, a business may not be able to effectively deliver critical services to customers).  The same litigators who have expanding into ransomware may soon turn their attention to a new kind of cyber-like “breach”: the so-called leaky machine learning models built on thousands of personal data records.

In their research, sponsored in part by the National Science Foundation (NSF) and Google, Shmatikov and his colleagues in early 2017 “uncovered multiple privacy and integrity problems in today’s [machine learning] pipelines” that could be exploited by adversaries to infer if a particular person’s data record was used to train machine learning models.  See R. Shokri, Membership Inference Attacks Against Machine Learning Models, Proceedings of the 38th IEEE Symposium on Security and Privacy (2017). They describe a health care machine learning model that could reveal to an adversary whether or not a certain patient’s data record was part of the model’s training data.  In another example, a different model trained on location and other data, used to categorize mobile users based on their movement patterns, was found to reveal by way of query whether a particular user’s location data was used.

These scenarios certainly raise alarms from a privacy perspective, and one can imagine other possible instances of machine learning models revealing the kind of personal information to an attacker that might cause harm to individuals.  While actual user data may not be revealed in these attacks, the mere inference that a person’s data record was included in a data set used to train a model, what Shmatikov and previous researchers refer to as “membership inference,” could cause that person (and the thousands of others whose data records were used) embarrassment and other consequences.

Assuming for the sake of argument that a membership inference disclosure of the kind described above becomes legally actionable, it is instructive to consider what businesses facing membership inference lawsuits might expect in terms of statutory and common law causes of action so they can take steps to mitigate problems and avoid contributing more cyber lawsuits to already busy court dockets (and of course avoid leaking confidential and private information).  These causes of actions could include invasion of privacy, consumer protection laws, unfair trade practices, negligence, negligent misrepresentation, innocent misrepresentation, negligent omission, breach of warranty, and emotional distress, among others.  See, e.g., Sony Gaming Networks & Cust. Data Sec. Breach Lit., 996 F.Supp. 2d 942 (S.D. Cal 2014) (evaluating data exfiltration causes of action).

Negligence might be alleged, as it often is in cybersecurity cases, if plaintiff (or class action members) can establish evidence of the following four elements: the existence of a legal duty; breach of that duty; causation; and cognizable injury.  Liability might arise where defendant failed to properly safeguard and protect private personal information from unauthorized access, use, and disclosure, where such use and disclosure caused actual money or property loss or the loss of a legally-protected interest in the confidentiality and privacy of plaintiff’s/members’ personal information.

Misrepresentation might be alleged if plaintiff/members can establish evidence of a misrepresentation upon which they relied and a pecuniary loss resulting from the reliance of the actionable misrepresentation. Liability under such a claim could arise if, for example, plaintiff’s data record has monetary value and a company makes representations about its use of security and data security measures in user agreements, terms of service, and/or privacy policies that turn out to be in error (for example, the company’s measures lack robustness and do not prevent an attack on a model that is found to be leaky).  In some cases, actual reliance on statements or omissions may need to be alleged.

State consumer protection laws might also be alleged if plaintiff/members can establish (depending on which state law applies) deceptive misrepresentations or omissions regarding the standard, quality, or grade of a particular good or service that causes harm, such as those that mislead plaintiff/members into believing that their personal private information would be safe upon transmission to defendant when defendant knew of vulnerabilities in its data security systems. Liability could arise where defendant was deceptive in omitting notice that its machine learning model could reveal to an attacker the fact that plaintiff’s/members’ data record was used to train the model. In certain situations, plaintiff/members might have to allege with particularity the specific time, place, and content of the misrepresentation or omission if the allegations are based in fraud.

For their part, defendants in membership inference cases might challenge plaintiff’s/members’ lawsuit on a number of fronts.  As an initial tactic, defendants might challenge plaintiff’s/members’ standing on the basis of failing to establish an actual injury caused by the disclosure (inference) of data record used to train a machine learning model.  See In re Science App. Intern. Corp. Backup Tape Data, 45 F. Supp. 3d 14 (D.D.C. 2014) (considering “when, exactly, the loss or theft of something as abstract as data becomes a concrete injury”).

Defendants might also challenge plaintiff’s/members’ assertions that an injury is imminent or certainly impending.  In data breach cases, defendants might rely on state court decisions that denied standing where injury from a mere potential risk of future identity theft resulting from the loss of personal information was not recognized, which might also apply in a membership inference case.

Defendants might also question whether permission and/or consent was given by a plaintiffs/members for the collection, storage, and use of personal data records.  This query would likely involve plaintiff’s/members’ awareness and acceptance of membership risks when they allowed their data to be used to train a machine learning model.  Defendants would likely examine whether the permission/consent given extended to and was commensurate in scope with the uses of the data records by defendant or others.

Defendants might also consider applicable agreements related to a user’s data records that limited plaintiff’s/members’ choice of forum and which state laws apply, which could affect pleading and proof burdens.  Defendants might rely on language in terms of service and other agreements that provide notice of the possibility of external attacks and the risks of leaks and membership inference.  Many other challenges to a plaintiff’s/members’ allegations could also be explored.

Apart from challenging causes of action on the merits, companies should also consider taking other measures like those used by companies in traditional data exfiltration cases.  These might include proactively testing their systems (in the case of machine learning models, testing for leakage) and implementing procedures to provide notice of a leaky model.  As Shmatikov and his colleagues suggest, machine learning model developers and MLaaS providers should take into account the risk that their models will leak information about their training data, warn customers about this risk, and “provide more visibility into the model and the methods that can be used to reduce this leakage.”  Machine learning companies should account for foreseeable risks and associated consequences and assess whether they are acceptable compared to the benefits received from their models.

If data exfiltration, ransomware, and related cybersecurity litigation are any indication, the plaintiffs bar may one day turn its attention to the leaky machine learning problem.  If machine learning model developers and MLaaS providers want to avoid such attention and the possibility of litigation, they should not delay taking reasonable steps to mitigate the leaky machine learning model problem.

Trump Signs John S. McCain National Defense Authorization Act, Provides Funds for Artificial Intelligence Technologies

By signing into law the John S. McCain National Defense Authorization Act for Fiscal Year 2019 (H.R.5515; Public Law No: 115-232; Aug. 13, 2018), the Trump Administration has established a strategy for major new national defense and national security-related initiatives involving artificial intelligence (AI) technologies.  Some of the law’s $717 billion spending authorization for fiscal year 2019 includes proposed funding to assess the current state of AI and deploy AI across the Department of Defense (DOD).  The law also recognizes that fundamental AI research is still needed within the tech-heavy military services.  The law encourages coordination between DOD activities and private industry at a time when some Silicon Valley companies are being pressured by their employees to stop engaging with DOD and other government agencies in AI.

In Section 238 of the law, the Secretary of Defense is to lead “Joint Artificial Intelligence Research, Development, and Transition Activities” to include developing a set of activities within the DOD involving efforts to develop, mature, and transition AI technologies into operational use.  In Section 1051 of the law, an independent “National Security Commission on Artificial Intelligence” is to be established within the Executive Branch to review advances in AI and associated technologies, with a focus on machine learning (ML).

The Commission’s mandate is to review methods and means necessary to advance the development of AI and associated technologies by the US to comprehensively address US national security and defense needs.  The Commission is to review the competitiveness of the US in AI/ML and associated technologies.

“Artificial Intelligence” is defined broadly in Sec. 238 to include the following: (1) any artificial system that performs tasks under varying and unpredictable circumstances without significant human oversight, or that can learn from experience and improve performance when exposed to data sets; (2) an artificial system developed in computer software, physical hardware, or other context that solves tasks requiring human-like perception, cognition, planning, learning, communication, or physical action; (3) an artificial system designed to think or act like a human, including cognitive architectures and neural networks; (4) a set of techniques, including machine learning, that is designed to approximate a cognitive task; and (5) an artificial system designed to act rationally, including an intelligent software agent or embodied robot that achieves goals using perception, planning, reasoning, learning, communicating, decision making, and acting.  Section 1051 has a similar definition.

The law does not overlook the need for governance of AI development activities, and requires regular meetings of appropriate DOD officials to integrate the functional activities of organizations and elements with respect to AI; ensure there are efficient and effective AI capabilities throughout the DOD; and develop and continuously improve research, innovation, policy, joint processes, and procedures to facilitate the development, acquisition, integration, advancement, oversight, and sustainment of AI throughout the DOD.  The DOD is also tasked with studying AI to make recommendations for legislative action relating to the technology, including recommendations to more effectively fund and organize the DOD in areas of AI.

For further details, please see this earlier post.

Legislators, Stockholders, Civil Right Groups, and a CEO Seek Limits on AI Face Recognition Technology

Following the tragic killings of journalists and staff inside the Capital Gazette offices in Annapolis, Maryland, in late June, local police acknowledged that the alleged shooter’s identity was determined using a facial recognition technology widely deployed by Maryland law enforcement personnel.  According to DataWorks Plus, the company contracted to support the Maryland Image Repository System (MIRS) used by Anne Arundel County Police in its investigation, its technology uses face templates derived from facial landmark points extracted from image face data to digitally compare faces to a large database of known faces.  More recent technology, relying on artificial intelligence models, have led to even better and faster image and video analysis used by federal and state law enforcement for facial recognition purposes.  AI-based models can process images and video captured by personal smartphones, laptops, home or business surveillance cameras, drones, and government surveillance cameras, including body-worn cameras used by law enforcement personnel, making it much easier to remotely identify and track objects and people in near-real time.

Recently, facial recognition use cases have led to privacy and civil liberties groups to speak out about potential abuses, with a growing vocal backlash aimed at body-worn cameras and facial recognition technology used in law enforcement surveillance.  Much of the concern centers around the lack of transparency in the use of the technology, potential issues of bias, and the effectiveness of the technology itself.  This has spurred state legislators in several states to seek to impose oversight, transparency, accountability, and other limitations on the tech’s uses.  Some within the tech industry itself have even gone so far as to place self-imposed limits on uses of their software for face data collection and surveillance activities.

Maryland and California are examples of two states whose legislators have targeted law enforcement’s use of facial recognition in surveillance.  In California, state legislators took a recent step toward regulating the technology when SB-1186 was passed by its Senate on May 25, 2018.  In remarks accompanying the bill, legislators concluded that “decisions about whether to use ‘surveillance technology’ for data collection and how to use and store the information collected should not be made by the agencies that would operate the technology, but by the elected bodies that are directly accountable to the residents in their communities who should also have opportunities to review the decision of whether or not to use surveillance technologies.”

If enacted, the California law would require, beginning July 1, 2019, law enforcement to submit a proposed Surveillance Use Policy to an elected governing body, made available to the public, to obtain approval for the use of specific surveillance technologies and the information collected by those technologies.  “Surveillance technology” is defined in the bill to include any electronic device or system with the capacity to monitor and collect audio, visual, locational, thermal, or similar information on any individual or group. This includes, drones with cameras or monitoring capabilities, automated license plate recognition systems, closed-circuit cameras/televisions, International Mobile Subscriber Identity (IMSI) trackers, global positioning system (GPS) technology, software designed to monitor social media services or forecast criminal activity or criminality, radio frequency identification (RFID) technology, body-worn cameras, biometric identification hardware or software, and facial recognition hardware or software.

The bill would prohibit a law enforcement agency from selling, sharing, or transferring information gathered by surveillance technology, except to another law enforcement agency. The bill would provide that any person could bring an action for injunctive relief to prevent a violation of the law and, if successful, could recover reasonable attorney’s fees and costs.  The bill would also establish procedures to ensure that the collection, use, maintenance, sharing, and dissemination of information or data collected with surveillance technology is consistent with respect for individual privacy and civil liberties, and that any approved policy be publicly available on the approved agency’s Internet web site.

With the relatively slow pace of legislative action, at least compared to the speed at which face recognition technology is advancing, some within the tech community have taken matters into their own hands.  Brian Brakeen, for example, CEO of Miami-based facial recognition software company Kairos, recently decided that his company’s AI software will not be made available to any government, “be it America or another nation’s.”  In a TechCrunch opinion published June 24, 2018, Brakeen said, “Whether or not you believe government surveillance is okay using commercial facial recognition in law enforcement is irresponsible and dangerous” because it “opens the door for gross misconduct by the morally corrupt.”  His position is rooted in the knowledge of how advanced AI models like his are created: “[Facial recognition] software is only as smart as the information it’s fed; if that’s predominantly images of, for example, African Americans that are ‘suspect,’ it could quickly learn to simply classify the black man as a categorized threat.”

Kairos is not alone in calling for limits.  A coalition of organizations against facial recognition surveillance published a letter on May 22, 2018, to Amazon’s CEO, Jeff Bezos, in which the signatories demanded that “Amazon stop powering a government surveillance infrastructure that poses a grave threat to customers and communities across the country. Amazon should not be in the business of providing surveillance systems like Rekognition to the government.”  The organizations–civil liberties, academic, religious, and others–alleged that “Amazon Rekognition is primed for abuse in the hands of governments. This product poses a grave threat to communities,” they wrote, “including people of color and immigrants….”

Amazon’s Rekognition system, first announced in late 2016., is a cloud-based platform for performing image and video analysis without the user needing a background in machine learning, a type of AI.  Among its many uses today, Rekognition reportedly allows a user to conduct near real-time automated face recognition, analysis, and face comparisons (assessing the likelihood that faces in different images are the same person), using machine learning models.

A few weeks after the coalition letter dropped, another group, this one a collection of individual and organizational Amazon shareholders, issued a similar letter to Bezos.  In it, the shareholders alleged that “[w]hile Rekognition may be intended to enhance some law enforcement activities, we are deeply concerned it may ultimately violate civil and human rights.”  Several Microsoft employees took a similar stand against Microsoft’s role in its software used by government agencies.

As long as questions surrounding transparency, accountability, and fairness in the use of face recognition technology in law enforcement continue to be raised, tech companies, legislators, and stakeholders will likely continue to react in ways that address immediate concerns.  This may prove effective in the short-term, but no one today can say what AI-based facial detection and recognition technologies will look like in the future or to what extent the technology will be used by law enforcement personnel.

Senate-Passed Defense Authorization Bill Funds Artificial Intelligence Programs

The Senate-passed national defense appropriations bill (H.R.5515, as amended), to be known as the John S. McCain National Defense Authorization Act for Fiscal Year 2019, includes spending provisions for several artificial intelligence technology programs.

Passed by a vote of 85-10 on June 18, 2018, the bill would include appropriations for the Department of Defense “to coordinate the efforts of the Department to develop, mature, and transition artificial intelligence technologies into operational use.” A designated Coordinator will serve to oversee joint activities of the services in the development of a Strategic Plan for AI-related research and development.  The Coordinator will also facilitate the acceleration of development and fielding of AI technologies across the services.  Notably, the Coordinator is to develop appropriate ethical, legal, and other policies governing the development and use of AI-enabled systems in operational situations. Within one year of enactment, the Coordinator is to complete a study on the future of AI in the context of DOD missions, including recommendations for integrating “the strengths and reliability of artificial intelligence and machine learning with the inductive reasoning power of a human.”

In other provisions, the Director of the Defense Intelligence Agency (DIA; based in Ft. Meade, MD) is tasked with submitting a report to Congress within 90 days of enactment that directly compares the capabilities of the US in emerging technologies (including AI) and the capabilities of US adversaries in those technologies.

The bill would require the Under Secretary for R&D to pilot the use of machine-vision technologies to automate certain human weapons systems manufacturing tasks. Specifically, tests would be conducted to assess whether computer vision technology is effective and at a level of readiness to perform the function of determining the authenticity of microelectronic parts at the time of creation through final insertion into weapon systems.

The Senate version of the 2019 appropriations bill replaces an earlier House version (passed 351-66 on May 24, 2018).

10 Things I Wish Every Legal Tech Pitch Would Include

Due in large part to the emergence of advanced artificial intelligence-based legal technologies, the US legal services industry today is in the midst of a tech shakeup.  Indeed, the number of advanced legal tech startups continues to increase. And so too are the opportunities for law firms to receive product presentations from those vendors.

Over the last several months, I’ve participated in several pitches and demos from leading legal tech vendors.  Typically delivered by company founders, executives, technologists, and/or sales, these presentations have been delivered live, as audio-video conferences, audio by phone with a separate web demo, or pre-recorded audio-video demos (e.g., a slide deck video with voiceover).  Often, a vendor’s lawyer will discuss how his or her company’s software addresses various needs and issues arising in one or more law firm practice areas.  Most presentations will also include statements about advanced legal tech boosting law firm revenues, making lawyers more efficient, and improving client satisfaction (ostensibly, a reminder of what’s at stake for those who ignore this latest tech trend).

Based on these (admittedly small number of) presentations, here is my list of things I wish every legal tech presentation would provide:

1. Before a presentation, I wish vendors would provide an agenda and the bios of the company’s representatives who will be delivering their pitch. I want to know what’s being covered and who’s going to be giving the presentation.  Do they have a background in AI and the law, or are they tech generalists? This helps prepare for the meeting and frame questions during Q&A (and reduces the number of follow-up conference calls).  Ideally, presenters should know their own tech inside and out and an area of law so they can show how the software makes a difference in that area. I’ve seen pitches by business persons who are really good at selling, and programmers who are really good at talking about bag-of-words bootstrapping algorithms. It seems that best person to pitch legal tech is someone who knows both the practice of law and how tech works in a typical law firm setting.

2. Presenters should know who they are talking to at a pitch and tailor accordingly.  I’m a champion for legal tech and want to know the details so I can tell my colleagues about your product.  Others just want to understand what adopting legal tech means for daily law practice. Find out who’s who and which practice group(s) or law firm function they represent and then address their specific needs.

3. The legal tech market is filling up with single-function offerings that generally perform a narrow function, so I want to understand all the ways your application might help replace or augment law firm tasks. Mention how your tech could be utilized in different practice areas where it’s best deployed (or where it could be deployed in the future in the case of features still in the development pipeline). The more capabilities an application has, the more attractive your prices begin to appear (and the fewer vendor roll-outs and training sessions I and my colleagues will have to sit through).

4. Don’t oversell capabilities. If you claim new features will be implemented soon, they shouldn’t take months to deploy. If your software is fast and easy, it had better be both, judged from an experienced attorney’s perspective. If your machine learning text classification models are not materially different than your competitors’, avoid saying they’re special or unique. On the other hand, if your application includes a demonstrable unique feature, highlight it and show how it makes a tangible difference compared to other available products in the market. Finally, if your product shouldn’t be used for high stakes work or has other limitations, I want to understand where that line should be drawn.

5. Speaking of over-selling, if I hear about an application’s performance characteristics, especially numerical values for things like accuracy, efficiency, and time saved, I want to see the benchmarks and protocols used to measure those characteristics.  While accuracy and other metrics are useful for distinguishing one product from another, they can be misleading. For example, a claim that a natural language processing model is 95% accurate at classifying text by topic should be backed up with comparisons to a benchmark and an explanation of the measurement protocol used.  A claim that a law firm was 40-60% more efficient using your legal tech, without providing details about how those figures were derived, isn’t all that compelling.

6. I want to know if your application has been adopted by top law firms, major in-house legal departments, courts, and attorneys general, but be prepared to provide data to back up claims.  Are those organizations paying a hefty annual subscription fee but only using the service a few times a month, or are your cloud servers overwhelmed by your user base? Monthly active users, API requests per domain, etc., can place usage figures in context.

7. I wish proof-of-concept testing was easier.  It’s hard enough to get law firm lawyers and paralegals interested in new legal tech, so provide a way to facilitate testing your product. For example, if you pitch an application for use in transactional due diligence, provide a set of common due diligence documents and walk through a realistic scenario. This may need to be done for different practice groups and functions at a firm, depending on the nature of the application.

8. I want to know how a legal tech vendor has addressed confidentiality, data security, and data assurance in instances where a vendor’s legal tech is a cloud-based service. If a machine learning model runs on a platform that is not behind the firm’s firewall and intrusion detection systems, that’s a potential problem in terms of safeguarding client confidential information. While vendors need to coordinate first with a firm’s CSO about data assurance/security, I also want to know the details.

9. I wish vendors would provide better information demonstrating how their applications helped others develop business. For example, tell me if using your application helped a law firm respond to a Request for Proposal (RFP) and won, or a client gave more work to a firm that demonstrated advanced legal tech acumen.  While such information may merely be anecdotal, I can probably champion legal tech on the basis of business development even if a colleague isn’t persuaded with things like accuracy and efficiency.

10. Finally, a word about design.  I wish legal tech developers would place more emphasis on UI/UX. It seems some of the offerings of late appear ready for beta testing rather than a roll-out to prospective buyers. I’ve seen demos in which a vendor’s interface contained basic formatting errors, something any quality control process would have caught. Some UIs are bland and lack intuitiveness when they should be user-friendly and have a quality look and feel. Use a unique theme and graphics style, and adopt a brand that stands out. For legal tech to succeed in the market, technology and design both must meet expectations.

[The views and opinions expressed in this post are solely the author’s and do not necessarily represent or reflect the views or opinions of the author’s employer or colleagues.]

Congress Looking at Data Science for Ways to Improve Patent Operations

When Congress passed the sweeping Leahy-Smith America Invents Act (AIA) on September 16, 2011, legislators weren’t concerned about how data analytics might improve efficiencies at one of the Commerce Department’s most data-heavy institutions: the US Patent Office. Patent reformers at the time were instead focused on curtailing patent troll litigation and conforming aspects of US patent law to those of other countries. Consequently, the Patent Office’s trove of pre-classified, pre-labeled, and semi-structured patent application and invention data–information ripe for big data analytics–remained mostly untapped at the time.

Fast forward to 2018 and Congress has finally put patent data in its cross-hairs. Now, Congress wants to see how “advanced data science analytics” techniques, such as artificial intelligence, machine learning, and other methods, could be used to analyze patent data and make policy recommendations. If enacted, the “Building Innovation Growth through Data for Intellectual Property Act” or the “BIG Data for IP Act” of 2018 (S. 2601; sponsored by Sen. Coons and Sen. Hatch) would require an investigation into how data science could help the Patent Office understand its current capabilities and whether its information technology systems need modernizing.

Those objectives, however, may be too narrow.  Silicon Valley tech companies, legal tech entrepreneurs, and even students have already seized upon the opportunities big patent data and machine learning techniques present, and, as a result, have developed interesting and useful capabilities.

Take, for example, the group of Stanford University students who in late 2011 developed a machine learning technique to automatically classify US patent applications based on an application’s written invention description. The students, part of Stanford’s CS229 Machine Learning class, proposed their solution around the same time Senators Leahy, Smith, and the rest of Congress were debating the AIA in the fall of 2011.  More recently, AI technologies used by companies like Cloem, AllPriorArt, AllPriorClaims, RoboReview, Specif.io, and others have shown how patent data and AI can augment traditional patent practitioner’s roles in the legal services industry.

Some of these AI tools may one day reduce much of the work patent practitioners have traditionally performed and could lead to fewer Examiners at the Patent Office whose jobs are to review patent applications for patentability. Indeed, some have imagined a world in which advanced machine learning models conceive inventions and prepare and file a patent application to protect those ideas without further human input.  In the future, advanced machine learning models, trained on the “prior art” patent data, could routinely examine patent applications for patentability, thus eliminating the need for costly and time-consuming inter partes reviews (a trial-like proceeding that has created much uncertainty since enactment of the AIA).

So perhaps Congress’ BIG Data for IP Act should focus less on how advanced data analytics can be used to “improve consistency, detect common sources of error, and improve productivity,” as the bill is currently written, and focus more globally on how patent data, powering new AI models, will disrupt Patent Office operations, the very nature of innovation, and how patent applications are prepared, filed, and examined.

In Your Face Artificial Intelligence: Regulating the Collection and Use of Face Data (Part II)

The technologies behind “face data” collection, detection, recognition, and affect (emotion) analysis were previously summarized. Use cases for face data, and reported concerns about the proliferation of face data collection efforts and instances of face data misuse were also briefly discussed.

In this follow-on post, a proposed “face data” definition is explored from a governance perspective, with the purpose of providing more certainty as to when heightened requirements ought to be imposed on those involved in face data collection, storage, and use.  This proposal is motivated in part by the increased risk of identity theft and other instances of misuse from unauthorized disclosure of face data, but also recognizes that overregulation could subject persons and entities to onerous requirements.

Illinois’ decade-old Biometric Information Privacy Act (“BIPA”) (740 ILCS 14/1 (2008)), which has been widely cited by privacy hawks and asserted against social media and other companies in US federal and various state courts (primarily Illinois and California), provides a starting point for a uniform face data definition. The BIPA defines “biometric identifier” to include a scan of a person’s face geometry. The scope and meaning of the definition, however, remains ambiguous despite close scrutiny by several courts. In Monroy v. Shutterfly, Inc., for example, a federal district court found that mere possession of a digital photograph of a person and “extraction” of information from such photograph is excluded from the BIPA:

“It is clear that the data extracted from [a] photograph cannot constitute “biometric information” within the meaning of the statute: photographs are expressly excluded from the [BIPA’s] definition of “biometric identifier,” and the definition of “biometric information” expressly excludes “information derived from items or procedures excluded under the definition of biometric identifiers.”

Slip. op. No. 16-cv-10984 (N.D. Ill. 2017). Despite that finding, the Monroy court concluded that a “scan of face geometry” under the statute’s definition includes a “scan” of a person’s face from a photograph (or a live scan of a person’s face geometry). Although not at issue in Monroy, the court did not address whether that BIPA applies when a scan of any part of a person’s face geometry from an image is insufficient to identify the person in the image. That is, the Monroy holding arguably applies to any data made by a scan, even if that data by itself cannot lead to identifying anyone.

By way of comparison, the European Union’s General Data Protection Regulation (GDPR), which governs “personal data” (i.e., any information relating to an identified or identifiable natural person), will regulate biometric information when it goes into effect in late May 2018. Like the BIPA, the GDPR will place restrictions on “personal data resulting from specific technical processing relating to the physical, physiological or behavioural characteristics of a natural person, which allow or confirm the unique identification of that natural person, such as facial images or dactyloscopic data” (GDPR, Article 4) (emphasis added).  Depending on how EU nation courts interpret the GDPR generally, and Article 4 specifically, a process that creates any biometric data that relates to, or could lead to, or that allows one to identify a person, or allows one to confirm an identity of a person, is a potentially covered process under the GDPR.

Thus, to enhance clarity for potentially regulated individuals and companies dealing with US citizens, “face data” could be defined, as set forth below, in a way that considers a minimum quantity or quality of data below which a regulated entity would not be within the scope of the definition (and thus not subject to regulation):

“Face data” means data in the possession or control of a regulated entity obtained from a scan of a person’s face geometry or face attribute, as well as any information and data derived from or based on the geometry or attribute data, if in the aggregate the data in the possession or control of the regulated entity is sufficient for determining an identity of the person or the person’s emotional (physiological) state.

The term “determining an identity of the person or the person’s emotional (physiological) state” relates to any known computational or manual technique for identifying a person or that person’s emotions.

The term “is sufficient” is interpretable; it would need to be defined explicitly (or, as is often the case in legislation, left for the courts to fully interpret). The intent of “sufficient” is to permit the anonymization or deletion of data following the processing of video signals or images of a person’s face to avoid being categorized as possessing regulated face data (to the extent probabilistic models and other techniques could not be used to later de-anonymize or reconstruct the missing data and identify a person or that person’s emotional state). The burden of establishing the quality and quantity of face data that is insufficient for identification purposes should rest with the regulated entity that possesses or controls face data.

Face data could include data from the face of a “live” person captured by a camera (e.g., surveillance) as well as data extracted from existing media (e.g., stored images). It is not necessary, however, for the definition to encompass the mere virtual depiction or display of a person in a live video or existing image or video. Thus, digital pictures of friends or family on a personal smartphone would not be face data, and the owner of the phone should not be a regulated entity subject to face data governance. An app on that smartphone, however, that uses face detection algorithms to process the pictures for facial recognition and sends that data to a remote app server for storage and use (e.g., for extraction of emotion information) would create face data.

By way of other examples, a process involving pixel-level data extracted from an image (a type of “scan”) by a regulated entity  would create face data if that data, combined with any other data possessed or controlled by the entity, could be used in the aggregate to identify the person in the image or that person’s emotional state. Similarly, data and information reflecting changes in facial expressions by pixel-level comparisons of time-slice images from a video (also a type of scan) would be information derived from face data and thus would be regulated face data, assuming the derived data combined with other data owned or possessed could be used to identify the person in the image or the person’s emotional state.

Information about the relative positions of facial points based on facial action units could also be data derived from or based on the original scan and thus would be face data, assuming again that the data, combined with any other data possessed by a regulated entity, could be used to identify a person or that person’s emotional state. Classifications of a person’s emotional state (e.g., joy, surprise) based on extracted image data would also be information derived from or based on a person’s face data and thus would also be face data.

Features extracted using deep learning convolutions of an image of a person’s face could also be face data if the convolution information along with other data in the possession or control of a regulated entity could be used to identify a person or that person’s emotional state.

For banks and other institutions that use face recognition for authentication purposes, sufficient face data would obviously need to be in the banks possession at some point in time to positively identify a customer making a transaction. This could subject the institution to face data governance during that time period. In contrast, a social media platform that permits users to upload images of people but does not scan or otherwise process the images (such as by cross-referencing other existing data) would not create face data and thus would not subject the platform to face data governance, even if it also possessed tagged images of the same individuals in the uploaded images. Thus, the mere possession or control over images, even if the images could potentially contain identifying information, would not constitute face data. But, if a platform were to scan (process) the uploaded images for identification purposes or sell or provide the images uploaded by users to a third party that scans the images to extract face geometry or attributes data for purposes such as targeted advertising, could subject the platform and the third party to face data governance.

The proposed face data definition, which could be modified to include “body data” and “voice data,” is merely one example that US policymakers and stakeholders might consider in the course of assessing the scope of face data governance in the US.  The definition does not exclude the possibility that any number of exceptions, exclusions, and limitations could be implemented to avoid reaching actors and actions that should not be covered, while also maintaining consistency with existing laws and regulations. Also, the proposed definition is not intended to directly encompass specific artificial intelligence technologies used or created by a regulated entity to collect and use face data, including the underlying algorithms, models, networks, settings, hyper-parameters, processors, source code, etc.

In a follow-on post, possible civil penalties for harms caused by face data collection, storage, and use will be briefly considered, along with possible defenses a regulated person or entity may raise in litigation.

Patenting Artificial Intelligence Technology: 2018 Continues Upward Innovation Trend

If the number of patents issued in the first quarter of 2018 is any indication, artificial intelligence technology companies were busy a few years ago filing patents for machine learning inventions.

According to US Patent and Trademark Office records, the number of US “machine learning” patents issued to US applicants during the first quarter of 2018 rose 17% compared to the same time period in 2017. The number of US “machine learning” patents issued to any applicant (not just US applicants) rose nearly 19% during the same comparative time period. Mostly double-digit increases were also observed in the case of US origin and total US patents mentioning “neural network” or “artificial intelligence.” Topping the list of companies obtaining patents were IBM, Microsoft, Amazon, Google, and Intel.

The latest patent figures include any US issued patent in which “machine learning,” “artificial intelligence,” or “neural network” is mentioned in the patent’s invention description (to the extent those mentions were ancillary to the invention’s disclosed utility, the above figures are over-inclusive). Because patent applications may spend 1-3 years at the US Patent Office (or more, if claiming priority to earlier-filed patent applications), the Q1 2018 numbers are reflective of innovation activity possibly several years ago.

Republicans Propose Commission to Study Artificial Intelligence Impacts on National Security

Three Republican members of Congress are co-sponsoring a new bill (H.R. 5356) “To establish the National Security Commission on Artificial Intelligence.” Introduced by Rep. Stefanik (R-NY) on March 20, 2018, the bill would create a temporary 11-member Commission tasked with producing an initial report followed by comprehensive annual reports, each providing issue-specific recommendations about national security needs and related risks from advances in artificial intelligence, machine learning, and associated technologies.

Issues the Commission would review include AI competitiveness in the context of national and economic security, means to maintain a competitive advantage in AI (including machine learning and quantum computing), other country AI investment trends, workforce and education incentives to boost the number of AI workers, risks of advances in the military employment of AI by foreign countries, ethics, privacy, and data security, among others.

Unlike other Congressional bills of late (see H.R. 4625–FUTURE of AI Act; H.R. 4829–AI JOBS Act) that propose establishing committees under Executive Branch departments and constituted with both government employees and private citizens, H.R. 5356 would establish an independent Executive Branch commission made up exclusively of Federal employees appointed by Department of Defense and various Armed Services Committee members, with no private citizen members (ostensibly because of national security and security clearance issues).

Congressional focus on AI technologies has generally been limited to highly autonomous vehicles and vehicle safety, with other areas, such as military impacts, receiving much less attention. By way of contrast, the UK’s Parliament seems far ahead. The UK Parliament Select Committee on AI has already met over a dozen times since mid-2017 and its members have convened numerous public meetings to hear from dozens of experts and stakeholders representing various disciplines and economic sectors.

Industry Focus: The Rise of Data-Driven Health Tech Innovation

Artificial intelligence-based healthcare technologies have contributed to improved drug discoveries, tumor identification, diagnosis, risk assessments, electronic health records (EHR), and mental health tools, among others. Thanks in large part to AI and the availability of health-related data, health tech is one of the fastest growing segments of healthcare and one of the reasons why the sector ranks highest on many lists.

According to a 2016 workforce study by Georgetown University, the healthcare industry experienced the largest employment growth among all industries since December 2007, netting 2.3 million jobs (about an 8% increase). Fourteen percent of all US workers work in healthcare, making it the country’s largest employment center. According to the latest government figures, the US spends the most on healthcare per person ($10,348) than any other country. In fact, healthcare spending is nearly 18 percent of the US gross domestic product (GDP), a figure that is expected to increase. The healthcare IT segment is expected to grow at a CAGR greater than 10% through 2019. The number of US patents issued in 2017 for AI-infused healthcare-related inventions rose more than 40% compared to 2016.

Investment in health tech has led to the development of some impressive AI-based tools. Researchers at a major university medical center, for example, invented a way to use AI to identify from open source data the emergence of health-related events around the world. The machine learning system they’d created extracted useful information and classified it according to disease-specific taxonomies. At the time of its development ten years ago, the “supervised” and “unsupervised” natural language processing models were leaps ahead of what others were using at the time and earned the inventors national recognition. More recently, medical researchers have created a myriad of new technologies from innovative uses of machine learning technologies.

What most of the above and other health tech innovations today have in common is what drives much of the health tech sector: lots of data. Big data sets, especially labeled data, are needed by AI technologists to train and test machine learning algorithms that produce models capable of “learning” what to look for in new data. And there is no better place to find big data sets than in the healthcare sector. According to an article last year in the New England Journal of Medicine, by 2012 as much as 30% of the world’s stored data was being generated in the healthcare industry.

Traditional healthcare companies are finding value in data-driven AI. Biopharmaceutical company Roche’s recent announcement that it is acquiring software firm Flatiron Health Inc. for $1.9 billion illustrates the value of being able to access health-related data. Flatiron, led by former Google employees, makes software for real-time acquisition and analysis of oncology-specific EHR data and other structured and unstructured hospital-generated data for diagnostic and research purposes. Roche plans to leverage Flatiron’s algorithms–and all of its data–to enhance Roche’s ability to personalize healthcare strategies by way of accelerating the development of new cancer treatments. In a world powered by AI, where data is key to building new products that attract new customers, Roche is now tapped into one of the largest sources of labeled data.

Companies not traditionally in healthcare are also seeing opportunities in health-related data. Google’s AI-focused research division, for example, recently reported in Nature a promising use of so-called deep learning algorithms (a computation network structured to mimic how neurons fire in the brain) to make cardiovascular risk predictions from retinal image data. After training their model, Google scientists said they were able to identify and quantify risk factors in retinal images and generate patient-specific risk predictions.

The growth of available healthcare data and the infusion of AI health tech in the healthcare industry will challenge companies to evolve. Health tech holds the promise of better and more efficient research, manufacturing, and distribution of healthcare products and services, though some have also raised concerns about who will benefit most from these advances, bias in data sets, anonymizing data for privacy reasons, and other legal issues that go beyond healthcare, issues that will need to be addressed.

To be successful, tomorrow’s healthcare leaders may be those who have access to data that drives innovation in the health tech segment. This may explain why, according to a recent survey, healthcare CIOs whose companies plan spending increases in 2018 indicated that their investments will likely be directed first toward AI and related technologies.