At the identical time of TensorFlow’s rise, foreshadowing what was yet to are available open source AI, enterprise software went through an open source licensing crisis. Mostly because of AWS, which had mastered the craft of taking open source infrastructure projects and constructing business services around them, many open source projects exchanged their permissible licenses for “Copyleft” or “ShareAlike” (SA) alternatives.
Not all open source is created equal. Permissible licenses (like Apache 2.0 or MIT) allow anyone to take an open source project and construct a business service around it. “Copyleft” licenses (like GPL), much like Creative Common’s “ShareAlike” terms, are one technique to protect against this. They’re sometimes known as a “poison pill”, because they require any derivative product to be licensed the identical way. If AWS launched a service based on an open source project with a “Copyleft” license, the AWS service itself should be open sourced under the identical license.
So, partially in response to competitive cloud services, the company creators and maintainers of open source projects like MongoDB and Redis switched up their licenses to less permissible alternatives. This led to a painful but entertaining back-and-forth between AWS and people corporations on the principles and merits of open source, which has since calmed down a bit.
Note that this alteration in licensing had a deceptive impact on the open source ecosystem: There are still quite a lot of latest open source projects being announced, however the licensing implications on what can and can’t be done with those projects are more complicated than most individuals realize.
At this point you need to be asking yourself: If the company maintainers of open source infrastructure projects realized that others were reaping more of the business advantages than themselves, shouldn’t the identical be happening with AI? Isn’t this an excellent greater deal for open source AI models, which hold the combination value of compute and data that went into creating them? The answers are: Yes and yes.
Although there appears to be a Robin Hood-esque movement around open source AI, the info is pointing in a unique direction. Large corporations like Microsoft are changing licensing of a few of their hottest models from permissible to non-commercial (NC) licenses, and Meta has began to make use of non-commercial licenses for all of their recent open source projects (MMS, ImageBind, DINOv2 are all CC-BY-NC 4.0 and LLAMA is GPL 3.0). Even popular projects from universities like Stanford’s Alpaca are only licensed for non-commercial use (inherited by the non-permissible attributes of the dataset they used). Entire corporations change their business models in an effort to protect their IP and rid themselves of the duty to open source as a part of their mission — remember when a small non-profit called OpenAI transformed itself right into a capped-profit? Notice that GPT2 was open sourced, but GPT3.5 or GPT4 weren’t?
More generally speaking, the trend towards less permissible licenses in AI, although opaque, is noticeable. Below is an evaluation of model licenses on Hugging Face. The share of permissible licenses (like Apache, MIT, or BSD) has been on a persistent decline since mid 2022, while non-permissible licenses (like GPL) or restrictive licenses (like OpenRAIL) have gotten more common.
To make things worse, the recent frenzy around large language models (LLMs) has further muddied the waters. Hugging Face maintains an “Open LLM Leaderboard” which goals to focus on “the real progress that’s being made by the open-source community”. To be fair, all the models on the board are indeed open source. Nevertheless, a better look reveals that just about none are licensed for business use*.
*Between the writing of this post and its publication, the license for Falcon models modified to the permissible Apache 2.0 license. The general commentary remains to be valid.
If anything, the Open LLM Leaderboard highlights that innovation from big tech (LLaMA was open sourced by Meta with a non-commercial license) dominates all other open source efforts. The larger problem is that these derivative models usually are not as forthcoming about their licenses. Almost none declare their license explicitly, and you could have to do your personal research to seek out out that the models and data they’re based on don’t allow for business use.
There’s quite a lot of virtue-signaling locally, mostly by well-meaning entrepreneurs and VCs who hope that there’s a future that shouldn’t be dominated by OpenAI, Google, and a handful of others. It shouldn’t be obvious why AI models ought to be open sourced — they represent hard-earned mental property that corporations develop over years, spending billions on compute, data acquisition, and talent. Corporations could be defrauding their shareholders if they only gave all the pieces away at no cost.
“If I could put money into an ETF for IP lawyers I might.”
The trend towards non-permissible licenses in open source AI seems clear. Yet, the overwhelming volume of stories fails to indicate that the cumulative good thing about this work accrues almost entirely to academics and hobbyists. Investors and executives alike ought to be more aware of the implications and practice more care. I even have a powerful feeling that the majority startups within the emerging LLM cotton industry are constructing on top of non-commercially licensed technology. If I could put money into an ETF for IP lawyers I might.
My prediction is that the worth capture for AI (specifically for the most recent generation of huge generative models) will look much like other innovations that require significant capital investment and accumulation of specialised talent, like cloud computing platforms or operating systems. Just a few major players will emerge that provide the AI foundation to the remainder of the ecosystem. There’ll still be ample room for a layer of startups on top of that foundation, but just as there are not any open source projects dethroning AWS, I consider it most unlikely that the open source community will produce a serious competitor to OpenAI’s GPT and whatever comes next.