
MLCommons, a collaborative effort of industry and academia, focuses on enhancing AI safety, efficiency, and accountability through rigorous measurement standards like MLPerf. Its AI Safety Working Group, established in late 2023, goals to develop benchmarks for assessing AI safety, tracking its progress over time, and incentivizing safety improvements. With expertise spanning technical AI knowledge, policy, and governance, the group goals to extend transparency and foster collective solutions to the challenges of AI safety evaluation. Given the various applications of AI in critical domains, ensuring secure and responsible AI development is imperative to mitigate potential harms, from deceptive scams to existential threats.
MLCommons, in collaboration with various institutions and organizations like Stanford University, Google Research, and others, has developed version 0.5 of the AI Safety Benchmark. This benchmark evaluates the security risks related to AI systems utilizing chat-tuned language models. It provides a structured approach to benchmark construction, including defining use cases, system types, language and context parameters, personas, tests, and grading criteria. The benchmark covers a taxonomy of 13 hazard categories, with tests for seven of those categories comprising 43,090 test items. Moreover, it offers an openly accessible platform and a downloadable tool called ModelBench for evaluating AI system safety against the benchmark. A principled grading system can also be provided to evaluate AI systems’ performance.
The study discusses immediate and future hazards AI systems pose, emphasizing physical, emotional, financial, and reputational harms. It highlights existing challenges in AI safety evaluation, including complexity, socio-technical entanglement, and difficulty accessing relevant data. Techniques for safety evaluation are categorized into algorithmic auditing, directed evaluation, and exploratory evaluation, each with strengths and weaknesses. The importance of benchmarks in driving innovation and research in AI safety is underscored, listing various projects like HarmBench, TrustLLM, and SafetyBench, which assess safety across dimensions reminiscent of red teaming, fairness, biases, and truthfulness in multiple languages.
The benchmark targets three key audiences: model providers, model integrators, and AI standards makers and regulators. Model providers like AI labs and developers aim to construct safer models, ensure model usefulness, communicate responsible usage guidelines, and comply with legal standards. Model integrators, including application developers and engineers, seek to match models, understand safety filtering impacts, minimize regulatory risks, and ensure product effectiveness and safety. AI standards makers and regulators give attention to comparing models, setting industry standards, mitigating AI risks, and providing effective safety evaluation across firms. Adherence to release requirements, including rules against training directly on benchmark data and discouragement of techniques prioritizing test performance over safety, is crucial for maintaining the benchmark’s integrity and ensuring accurate safety assessment.
The study evaluated AI systems utilizing chat-tuned language models) against a benchmark (v0.5) across various hazard categories. Thirteen models from 11 providers, released between March 2023 and February 2024, were tested. Responses were collected with controlled parameters to reduce variability. Results showed various levels of risk across models, with some graded as high risk, moderate risk, or moderate-low risk based on unsafe response percentages. Differences in unsafe responses were observed across user personas, with higher risks related to malicious or vulnerable users than typical users across hazard categories and systems.
In conclusion, the v0.5 release of the AI Safety Benchmark by the MLCommons AI Safety Working Group offers a structured approach to guage the security risks of AI systems employing chat-tuned language models. It introduces a taxonomy of 13 hazard categories, with seven tested in v0.5, aiming to drive innovation in AI safety processes. While v0.5 will not be intended for safety assessment, it’s a foundation for future iterations. Key components include use cases, SUT types, personas, tests, and a grading system. An openly available platform, ModelBench, facilitates evaluation, and feedback from the community is inspired to refine the benchmark further.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
In the event you like our work, you’ll love our newsletter..
Don’t Forget to affix our 40k+ ML SubReddit
For Content Partnership, Please Fill Out This Form Here..
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is captivated with applying technology and AI to handle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.