With the advancements in Natural Language Processing and Natural Language Generation, Large Language Models (LLMs) are being ceaselessly utilized in real-world applications. With their ability to mimic human behavior, these models, with their general-purpose nature, have stepped into every field and domain.
Though these models have gained significant attention, these models represent a constrained and skewed collection of human viewpoints and knowledge. The pretraining data’s composition is the rationale for this bias because it has a big effect on the model’s behavior.
Researchers have been putting in efforts to have an extra concentrate on understanding and documenting the transformations made to the info before pretraining. Pretraining data curation is a multi-step process with multiple decision points which are ceaselessly based on subjective text quality judgments or performance against benchmarks.
In a recent study, a team of researchers from the Allen Institute for AI, the University of California, Berkeley, Emory University, Carnegie Mellon University, and the University of Washington introduced a brand new dataset and framework called AboutMe. The study highlights the various unquestioned assumptions that exist in data curation workflows. With AboutMe, the team has attempted to document the consequences of knowledge filtering on text rooted in social and geographic contexts.
The shortage of intensive, self-reported sociodemographic data related to language data is certainly one of the issues facing sociolinguistic evaluation in Natural Language Processing. Text could be traced back to general sources comparable to Wikipedia, but at a more granular level, it’s ceaselessly unknown who created the data. The team on this study has found web sites, particularly ‘about me’ pages, by utilizing pre-existing patterns in web data. This enables an unprecedented understanding of whose language is represented in web-scraped text.
Using data from the ‘about me’ sections of internet sites, the team has performed sociolinguistic analyses to measure the topical interests, positioning individuals or organizations, self-identified social roles, and associated geographic locations of website authors. Ten quality and English ID filters from earlier research on LLM development have been used on these web pages to look at the effect of filtering on the kept or deleted pages.
The team has shared that their primary goal was to search out trends in website origin-related behavior each inside and between filters. The outcomes have shown that implicit preferences for specific subject areas are displayed by model-based quality filters, which causes text related to numerous professions and vocations to be removed at varied rates. Moreover, filtering techniques that presume pages are monolingual may unintentionally eliminate content from non-anglophone parts of the globe.
In conclusion, this research has highlighted the intricacies involved in data filtering during LLM development and its consequences for the portrayal of various viewpoints in language models. The study’s primary goal is to lift awareness of the intricate details that go into pretraining data curation procedures, particularly when considering social aspects. The team has stressed on the necessity for more research on pretraining data curation procedures and their social implications.
Try the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
For those who like our work, you’ll love our newsletter..
Don’t Forget to hitch our Telegram Channel
Tanya Malhotra is a final 12 months undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and demanding considering, together with an ardent interest in acquiring latest skills, leading groups, and managing work in an organized manner.