For the past couple of years, everyone from AI experts to the general public has been entranced by the often astonishing output of large language models (LLMs) like GPT-3 and DALL•E 2. These models, using narrative inputs, are able to produce everything from convincing artificial images to stories and poetry. However, the models have also largely been produced by large companies like Google (PaLM) or OpenAI (GPT-3), which routinely restrict access to their full models for a variety of business and ethical reasons. Now, the BigScience research workshop—a group of over 1,000 volunteers—is looking to change that status quo with a new LLM: BLOOM.
BigScience was born in 2021 out of discussions between researchers from Huggingface, Inc. (a Brooklyn startup focused on democratizing AI, and creator of the popular Craiyon—née DALL•E Mini—image generation tool) and representatives from GENCI and IDRIS, two French supercomputing organizations. Eventually, BigScience secured a grant for five million CPU-hours on the Jean Zay supercomputer, which weighed in around 14 aggregate peak petaflops ahead of a planned upgrade this year.
The goal: to democratize AI through the introduction of “the world’s largest open multilingual language model.” That became “BigScience Large Open-science Open-access Multilingual language model,” or “BLOOM.”
“Large language models (LLMs) have made a significant impact on AI research,” BigScience’s announcement reads. “These powerful, general models can take on a wide variety of new language tasks from a user’s instructions. However, academia, nonprofits and smaller companies’ research labs find it difficult to create, study, or even use LLMs as only a few industrial labs with the necessary resources and exclusive rights can fully access them.”
BLOOM, they said, was being released “to change this status quo” and was “the result of the largest collaboration of AI researchers ever involved in a single research project.”
And, as a result, BLOOM is no slouch—even compared to the big guns. The LLM is able to generate text in 46 human languages and 13 programming languages, and it contains 176 billion parameters: not quite the 540 billion found in Google’s PaLM model, but just ahead of the 175 billion parameters in GPT-3. On top of that, BLOOM is the first LLM with over 100 billion parameters for “almost all” of the languages it supports, including major languages like Arabic, French and Spanish.
Accomplishing this was no small feat, with BLOOM trained using that five million CPU-hour grant over a 117-day period.
BLOOM is now available for researchers to download, run and study under the terms of BigScience’s Responsible AI License (RAIL). Ethics were a major concern for the group, and have generally been a point of major worry for corporations and the public alike, given the often convincing results produced by LLMs—which lend themselves to unsavory applications like the production of realistic fraudulent media or text. During development of BLOOM, BigScience developed data governance structures for LLMs and the RAIL itself. The RAIL prohibits unlawful and otherwise harmful uses, and more specifically prohibits use of BLOOM for controversial applications such as “fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation” and “medical advice and medical results interpretation”.
BigScience intends to update BLOOM, as well. “This is only the beginning,” they wrote. “BLOOM’s capabilities will continue to improve as the workshop continues to experiment and tinker with the model.” Items on the agenda include easier instructability and compression. “BLOOM is the seed of a living family of models that we intend to grow, not just a one-and-done model, and we’re ready to support community efforts to expand it.”
While these efforts are unlikely to eclipse those of Google, Meta or OpenAI any time soon, one thing is for sure: the walls of the LLM garden are slowly but surely coming down. Only time will tell whether the benefits of open research on LLMs outweigh the costs of misuse.