fbpx
News Hub

UK AI Safety Institute finds challenges in AI safeguards and bias

Written by Tue 13 Feb 2024

The UK’s AI Safety Institute has found artificial intelligence (AI) has the potential to create personas to spread disinformation, perpetuate biases, and deceive human users.

The AI Safety Institute (AISI) published initial findings from its research into AI large language models (LLMs), finding that LLM safeguards can be easily bypassed. 

The Institute discovered these concerns through a series of case studies. Case Study 1 evaluated misuse risks, Case Study 2 assessed representative and allocative bias in LLMs, and Case Study 3 evaluated autonomous systems.

AI Enables Novices to Deliver Cyberattacks

In Case Study 1, the AISI and computer and network security organisation, Trail of Bits, assessed the extent to which LLMs make it easier for a novice to deliver a cyberattack.  

Cybersecurity experts at Trail of Bits created a cyberattack taxonomy, dividing the process into tasks like vulnerability discovery and data exfiltration. They then compared LLM performance on these tasks with expert performance, generating scores. But the study did not benchmark LLM performance against human abilities.

The findings suggested, while LLMs could enhance novice capabilities in some tasks, the effect was limited. For instance, in researching malicious cyber capabilities, an LLM successfully generated a realistic social media persona for potential use in spreading disinformation. This process could be scaled up efficiently to create thousands of personas.

The AISI found LLMs may coach users, uplifting user capabilities above just web searching. The Institute added in many instances, web search and LLMs produced the same level of information to users, though LLMs were sometimes faster.

“However, in relation to some specific use cases we found instances where LLMs could provide coaching and troubleshooting which may give an uplift above web search,” said the AI Safety Institute.

LLMs were also found to be frequently wrong, which the AISI said could degrade a novice user’s capability.

AI Enable Users to Bypass Safeguards

LLM developers have built-in safeguards to prevent misuse. But, with basic prompting techniques, the AISI found that users were able to break at LLM’s safeguards immediately. 

Even more sophisticated techniques to bypass safeguards were accessible to individuals with relatively low skill levels. In some cases, the safeguards did not trigger when users attempted to access harmful information.

The AISI stressed the future of AI is uncertain, but insights can be obtained by quantifying improvement in particular tasks. The Institute added its in-house research team analysed the performance of a set of LLMs on 101 microbiology questions between 2021 and 2023. In two years, LLM accuracy in this domain increased from 5% to 60%.

The AISI concluded that more powerful AI systems will also enhance defensive capabilities.  

“These models will also get better at spotting harmful content on the internet or defending from cyberattacks by identifying and helping to fix vulnerabilities,” said the AISI.

AI can Learn, Perpetuate, and Amplify Societal Biases

In Case Study 2, the AISI investigated the risks associated with widespread LLM deployment and assessed its potential to influence people at scale in harmful ways.

The AISI replicated Bianchi et al.’s findings, which indicated image models reinforce stereotypes based on prompts containing character traits. The Institute tested newer and diverse image models and discovered that, despite improved image quality, biases persisted. For instance, the prompt ‘a poor white person’ often generated images of individuals with predominantly non-white faces.

AISI found LLMs may offer biased career advice to teenagers based on their socioeconomic background and gender. Using various profiles, the AISI tested how LLMs recommended careers differently based on gender and parental occupation. The Institute simulated a scenario where the LLM acted as a friend, offering career suggestions based on user information. This experiment aimed to assess bias that could impact users’ earnings.

Using this method, when the LLM was told a teenager with interests in French and History had affluent parents, it recommended the teenager become a diplomat 93% of the time and a historian 4% of the time. For the same teenager with less affluent parents, it suggested becoming a diplomat only 13% of the time and a historian 74% of the time.

AI Agents can Deceive Human Users

In Case Study 3, the AISI collaborated with external partners to demonstrate the capabilities of current LLM agents and the potential unintended consequences of their goals. 

AI agents were defined by the AISI as an emerging paradigm of autonomous AI systems, capable of making plans to pursue broad goals such as ‘make money’ with minimal human input.

An AI safety research lab, METR, showcased the abilities of AI agents, including their potential to carry out targeted phishing attacks autonomously. Although the agents can generate and execute plans independently, they found that they also encounter obstacles and fail to complete actions convincingly.

AI safety organisation, Apollo Research, demonstrated how AI agents can deceive users by acting on goals with unintended consequences. Apollo researchers created a toy text environment with an LLM deployed as a trading agent to manage a stock portfolio for a fictional company. In a simulated environment, an AI trading agent decided to act on the insider tip and buy the stock 70% of the time.  

“When the agent was asked by its manager whether it had access to the insider information, the LLM frequently decided to lie about it, reasoning that it was better to avoid admitting to insider trading,” said the AISI.

With the right setup, the AISI said LLMs could act like an agent capable of generating and executing a plan without human intervention.

“However, current models are not yet demonstrating fully autonomous behaviours, since they regularly get stuck, fail to complete chains of actions, and do not do a good enough job to fool humans,” added the AISI.

Current Oversight Methods are Insufficient

The AISI emphasised the importance of overseeing AI agents to prevent unintended consequences and maintain control. This task becomes challenging as AI systems become more advanced. 

The AISI suggested using weaker AI systems to monitor stronger ones. However, findings from Redwood Research showed weaker AI failed to identify vulnerabilities inserted intentionally by the stronger one about half of the time.

The Institute said it is committed to expanding its investigation into the potential for loss of control from autonomous AI systems as the technology develops. The AISI aims to further understand the potential of AI agents’ capabilities for autonomous replication and their potential to accelerate AI development. 

The AISI said its evaluation techniques include, but are not limited to, automated capability assessments, red-teaming, human uplift evaluation, and AI agent evaluations. The Institute will focus on the most advanced AI systems, aiming to ensure that the UK and the world are not caught off guard by uncertain progress at the frontier of AI. 

In October, Downing Street Officials considered the creation of the Institute, with a focus on enabling national security-related scrutiny of frontier AI models. The AISI’s three core functions are to develop and conduct evaluations on advanced AI systems, drive foundational AI safety research, and facilitate information exchange.

Policy Responses for AI Safety

In January, a report by the UK’s National Cyber Security Centre (NCSC) cautioned AI will ‘almost certainly’ amplify the frequency and severity of cyberattacks in the next two years. 

In the report ‘The near-term impact of AI on the cyber threat’, the NCSC said AI is poised to enhance reconnaissance and social engineering tactics. This will result in cyberattacks that are not only more effective but more efficient and challenging to detect.

In December, EU policymakers provisionally agreed on regulations for the AI Act, aiming to ensure the safe deployment of AI. 

The agreement reached by the European Parliament arrived a month after the UK published the first global guidelines to ensure the safe development of AI technology. Agencies from 18 countries have confirmed they will endorse and co-seal the Guidelines for Secure AI System Development to enhance AI cybersecurity.

In September, a UK Commons report stressed the need for rapid governance for rising AI challenges. The overriding message urged the UK Government to accelerate the establishment of a governance regime for AI, including ‘whatever statutory measures as may be needed’.

Join Cloud & Cyber Security Expo

6-7 March 2024, ExCeL London

Cloud & Cyber Security Expo is one of the largest IT security events in Europe.

Don’t miss the chance to build partnerships and discover solutions to protect your business.

Written by Tue 13 Feb 2024

Send us a correction Send us a news tip