Generative AI: Evaluating Trustworthiness and Unveiling Vulnerabilities

Generative AI, while captivating for its potential, raises concerns about bias, misinformation, and hallucinations. However, a recent global study revealed that more than half of respondents expressed their willingness to use this emerging technology for sensitive areas such as financial planning and medical advice.

This brings to light an important question: How reliable are these large language models really? In an effort to address this, a team of researchers including Sanmi Koyejo from Stanford and Bo Li from the University of Illinois Urbana-Champaign conducted a comprehensive study on GPT models, specifically focusing on GPT-3.5 and GPT-4. Their findings, detailed in the study available on arXiv, shed light on the trustworthiness and vulnerabilities of these models.

Contrary to the perception that large language models (LLMs) are flawless and capable, the research reveals a different reality. Koyejo warns against the dangers of deploying these models in critical domains, emphasizing that they are not yet trustworthy enough for such important tasks.

The study evaluated the models on various trust perspectives including toxicity, stereotype bias, adversarial robustness, privacy, machine ethics, and fairness. While GPT-3.5 and GPT-4 exhibit reduced toxicity compared to previous models on standard benchmarks, they can still generate toxic and biased outputs and inadvertently disclose private information from training data and user conversations.

One notable finding is the enigmatic nature by which the models mitigate toxicity. The popular models operate behind closed doors, making it challenging to uncover their inner workings, which motivated the research team to delve further. By stress-testing the models as a Red Team, Koyejo and Li discovered that GPT-3.5 and GPT-4 demonstrate a significantly reduced toxic output. However, when prompted with adversarial instructions explicitly requesting the generation of toxic language, the toxicity probability skyrocketed to 100%.

Addressing bias, the researchers found that while GPT-4 shows progress in avoiding certain sensitive stereotypes, it still exhibits biases in other areas. For instance, GPT-4 often agrees with statements that propagate the stereotype that “women have HIV.”

Privacy leakage was another concern uncovered by the study. Both GPT models unintentionally leaked sensitive training data, including email addresses, although they exhibited greater caution when handling Social Security numbers. Interestingly, GPT-4 was more prone to privacy leaks, possibly due to its tendency to follow explicit user prompts.

In terms of fairness, the researchers examined the models’ performance using common metrics. They observed significant performance gaps and intrinsic biases when attributes like sex and race were manipulated, indicating that the models provided skewed predictions.

In conclusion, this study urges caution and a healthy dose of skepticism when relying on generative AI models. Although GPT-4 shows improvement over its predecessor, it still falls short of the desired trustworthiness. With further research and development, future models may address these vulnerabilities and enhance the reliability of generative AI.


Q: Are large language models trustworthy for critical tasks?

A: No, according to the research conducted by Koyejo and Li, large language models like GPT-3.5 and GPT-4 are not yet reliable enough for critical jobs.

Q: Do GPT models exhibit bias and toxicity?

A: Yes, although GPT-4 shows progress in avoiding certain stereotypes, it still manifests biases. Both GPT-3.5 and GPT-4 can generate toxic language, especially when prompted with adversarial instructions.

Q: Do GPT models leak private information?

A: Yes, the study revealed that both GPT models inadvertently leak sensitive training data, such as email addresses. GPT-4 exhibited a higher tendency for privacy leaks.

Q: Are GPT models fair in their predictions?

A: The study found that GPT-3.5 and GPT-4 displayed intrinsic biases when making predictions related to attributes like sex and race, indicating a lack of fairness.

Q: Will future models improve trustworthiness?

A: While GPT-4 shows improvement over its predecessor, further research and development are necessary to enhance the reliability and trustworthiness of generative AI models.

Subscribe Google News Channel