Scale AI, a leading training data company based in Silicon Valley, is making significant strides to solve the language problem faced by artificial intelligence (AI) developers. While AI tools like ChatGPT excel in English and Spanish, they struggle in low-resource languages that have limited representation on the internet. To tackle this issue head-on, Scale AI is actively hiring for nearly 60 contract writer roles in various languages that are under-represented. The aim of these hires is to train generative AI models to become better writers.
A standout feature of Scale AI’s hiring spree is its focus on languages from regions that are traditionally under-resourced in the tech industry. The job listings include languages such as Hausa, Punjabi, Thai, Lithuanian, Persian, Xhosa, Catalan, and Zulu, among others. Additionally, there is a specific focus on hiring experts in South Asian languages like Kannada, Gujarati, Urdu, and Telugu.
These low-resource languages often face pay disparities compared to Western languages, with the latter being compensated up to 15 times more. The discrepancy in pay highlights the need for attention and investment in diverse linguistic capabilities. Scale AI’s initiative to employ human workers for improving low-resource language models is a significant departure from traditional data collection methods. This shift is recognized by Julian Posada, an assistant professor at Yale University, and a member of the Information Society Project at the law school.
An explanation for the poor performance of generative AI systems in low-resource languages lies in the scarcity of unsupervised data. Dylan Hadfield-Mennell, an assistant professor of artificial intelligence and decision-making at MIT, suggests that languages like Bengali suffer from inadequate linguistic pattern models due to their limited representation on the internet.
To combat this problem, one of the tasks assigned to Scale AI’s contract writers is the creation of short stories in low-resource languages. By acquiring data that isn’t reliant on existing internet domains, Scale AI can generate a new body of digitized texts. This approach aims to address limitations in data availability and make AI systems more proficient in low-resource languages.
Q: How is Scale AI addressing the language problem in AI development?
A: Scale AI is actively hiring contract writers proficient in under-represented languages to train generative AI models and improve their writing abilities.
Q: Which regions and languages are the focus of Scale AI’s hiring spree?
A: Scale AI is targeting language experts from various regions, including South Asia, with languages like Kannada, Gujarati, Urdu, and Telugu being prioritized.
Q: Why do generative AI systems struggle with low-resource languages?
A: The scarcity of unsupervised data and limited online presence of low-resource languages contribute to the poor performance of generative AI systems in these languages.
Q: How is Scale AI acquiring data for low-resource languages?
A: Scale AI is assigning contract writers the task of writing short stories in these languages to create a new body of digitized texts that is independent of existing internet domains. This approach aims to overcome limitations in data availability.
Q: What is the purpose of Scale AI’s efforts to improve low-resource language models?
A: Scale AI aims to make significant progress in enhancing AI systems’ proficiency in low-resource languages, enabling more inclusive and accurate application of AI technologies across diverse linguistic contexts.