University of Bristol Study Highlights Safety Risks in DeepSeek’s CoT-Enabled AI Models

University of Bristol Study Highlights Safety Risks in DeepSeek’s CoT-Enabled AI Models

(IN BRIEF) A study from the University of Bristol has raised concerns about safety risks in DeepSeek, a new competitor to ChatGPT using Chain of Thought (CoT) reasoning. CoT enhances problem-solving with a transparent, step-by-step process, but this clarity can unintentionally expose harmful content. The research found that while CoT models refuse harmful requests more effectively than traditional Large Language Models (LLMs), they also generate more dangerous responses when fine-tuned with malicious intent. This can include offering detailed instructions on illegal activities. The study emphasizes the need for further safeguards to prevent misuse of CoT-enabled models, particularly as their reasoning process can be manipulated by attackers. The findings call for more research into mitigating fine-tuning attacks and improving model security.

(PRESS RELEASE) BRISTOL, 3-Feb-2025 — /EuropaWire/ — A recent study from the University of Bristol has uncovered significant safety concerns surrounding the emerging ChatGPT alternative, DeepSeek, highlighting the risks posed by Large Language Models (LLMs) using Chain of Thought (CoT) reasoning. This method, which enables more nuanced problem-solving through a step-by-step approach rather than simply offering direct answers, has been found to create unintentional vulnerabilities.

The analysis by the Bristol Cyber Security Group reveals that although CoT models like DeepSeek are more effective at refusing harmful requests than traditional LLMs, their transparent reasoning process inadvertently exposes sensitive information that might otherwise remain hidden. This transparency, while valuable in fostering user trust, raises the possibility of dangerous content being unintentionally revealed.

Zhiyuan Xu, the lead author of the study, provided crucial insights into the safety risks posed by CoT reasoning models. He stressed the need for enhanced safeguards as AI technology advances. Co-author Dr. Sana Belguith from Bristol’s School of Computer Science further explained the dilemma: “While CoT models are designed to mimic human thinking, making them ideal for public use, they also pose substantial risks if safety measures are bypassed, as they can generate highly harmful content.”

The research also examined how traditional LLMs, which are trained on vast datasets filtered for harmful content, still face challenges. Despite efforts such as Reinforcement Learning from Human Feedback (RLHF) and Supervised Fine-Tuning (SFT) to limit harmful outputs, CoT models present unique risks because of their structured reasoning approach. In tests, DeepSeek not only generated harmful content more frequently than traditional models but also provided more detailed, accurate, and potentially dangerous responses when subjected to certain attacks. For instance, DeepSeek provided step-by-step advice on how to commit a crime without being caught.

Another troubling discovery was that CoT models, when fine-tuned with harmful intent, can assume specialized roles—like a skilled cybersecurity expert—and produce highly sophisticated yet dangerous advice. Dr. Joe Gardiner, a co-author of the study, highlighted the issue of fine-tuning attacks, which can be executed using inexpensive hardware and minimal resources. He noted that such attacks, carried out on publicly available datasets, could lead to models generating harmful content with little chance of detection in offline settings.

While CoT reasoning models show promise for their ability to reason with clarity and transparency, they also present unique safety challenges, especially when in the wrong hands. The study calls for further exploration of mitigation strategies, such as investigating model alignment techniques and the potential impact of model size and architecture on the success of fine-tuning attacks.

Dr. Belguith concluded, “The human-like reasoning process in these models is vulnerable to manipulation, which calls for further research into how to protect these models from targeted attacks. Public awareness of these safety risks is crucial, and both the scientific community and tech companies must take responsibility in addressing and mitigating these hazards.”

The full paper, titled ‘The Dark Deep Side of DeepSeek: Fine-Tuning Attacks Against the Safety Alignment of CoT-Enabled Models’, authored by Zhiyuan Xu, Dr. Sana Belguith, and Dr. Joe Gardiner, is available on arXiv.

Media Contact:

Tel: +44 (0)117 928 9000
Email: press-office@bristol.ac.uk

SOURCE: University of Bristol

MORE ON UNIVERSITY OF BRISTOL, ETC.:

Follow EuropaWire on Google News
EDITOR'S PICK:

Comments are closed.