At DebConf17, John Sullivan, the executive director of the FSF, gave a talk on the supposed decline of the use of copyleft licenses use free-software projects. In his presentation, Sullivan questioned the notion that permissive licenses, like the BSD or MIT licenses, are gaining ground at the expense of the traditionally dominant copyleft licenses from the FSF. While there does seem to be a rise in the use of permissive licenses, in general, there are several possible explanations for the phenomenon.

When the rumor mill starts

Sullivan gave a recent example of the claim of the decline of copyleft in an article on Opensource.com by Jono Bacon from February 2017 that showed a histogram of license usage between 2010 and 2017 (seen below).

[Black Duck
  histogram]

From that, Bacon elaborates possible reasons for the apparent decline of the GPL. The graphic used in the article was actually generated by Stephen O'Grady in a January article, The State Of Open Source Licensing, which said:

In Black Duck's sample, the most popular variant of the GPL – version 2 – is less than half as popular as it was (46% to 19%). Over the same span, the permissive MIT has gone from 8% share to 29%, while its permissive cousin the Apache License 2.0 jumped from 5% to 15%.

Sullivan, however, argued that the methodology used to create both articles was problematic. Neither contains original research: the graphs actually come from the Black Duck Software "KnowledgeBase" data, which was partly created from the old Ohloh web site now known as Open Hub.

To show one problem with the data, Sullivan mentioned two free-software projects, GNU Bash and GNU Emacs, that had been showcased on the front page of Ohloh.net in 2012. On the site, Bash was (and still is) listed as GPLv2+, whereas it changed to GPLv3 in 2011. He also claimed that "Emacs was listed as licensed under GPLv3-only, which is a license Emacs has never had in its history", although I wasn't able to verify that information from the Internet archive. Basically, according to Sullivan, "the two projects featured on the front page of a site that was using [the Black Duck] data set were wrong". This, in turn, seriously brings into question the quality of the data:

I reported this problem and we'll continue to do that but when someone is not sharing the data set that they're using for other people to evaluate it and we see glimpses of it which are incorrect, that should give us a lot of hesitation about accepting any conclusion that comes out of it.

Reproducible observations are necessary to the establishment of solid theories in science. Sullivan didn't try to contact Black Duck to get access to the database, because he assumed (rightly, as it turned out) that he would need to "pay for the data under terms that forbid you to share that information with anybody else". So I wrote Black Duck myself to confirm this information. In an email interview, Patrick Carey from Black Duck confirmed its data set is proprietary. He believes, however, that through a "combination of human and automated techniques", Black Duck is "highly confident at the accuracy and completeness of the data in the KnowledgeBase". He did point out, however, that "the way we track the data may not necessarily be optimal for answering the question on license use trend" as "that would entail examination of new open source projects coming into existence each year and the licenses used by them".

In other words, even according to Black Duck, its database may not be useful to establish the conclusions drawn by those articles. Carey did agree with those conclusions intuitively, however, saying that "there seems to be a shift toward Apache and MIT licenses in new projects, though I don't have data to back that up". He suggested that "an effective way to answer the trend question would be to analyze the new projects on GitHub over the last 5-10 years." Carey also suggested that "GitHub has become so dominant over the recent years that just looking at projects on GitHub would give you a reasonable sampling from which to draw conclusions".

[GitHub
  graph]

Indeed, GitHub published a report in 2015 that also seems to confirm MIT's popularity (45%), surpassing copyleft licenses (24%). The data is, however, not without its own limitations. For example, in the above graph going back to the inception of GitHub in 2008, we see a rather abnormal spike in 2013, which seems to correlate with the launch of the choosealicense.com site, described by GitHub as "our first pass at making open source licensing on GitHub easier".

In his talk, Sullivan was critical of the initial version of the site which he described as biased toward permissive licenses. Because the GitHub project creation page links to the site, Sullivan explained that the site's bias could have actually influenced GitHub users' license choices. Following a talk from Sullivan at FOSDEM 2016, GitHub addressed the problem later that year by rewording parts of the front page to be more accurate, but that any change in license choice obviously doesn't show in the report produced in 2015 and won't affect choices users have already made. Therefore, there can be reasonable doubts that GitHub's subset of software projects may not actually be that representative of the larger free-software community.

In search of solid evidence

So it seems we are missing good, reproducible results to confirm or dispel these claims. Sullivan explained that it is a difficult problem, if only in the way you select which projects to analyze: the impact of a MIT-licensed personal wiki will obviously be vastly different from, say, a GPL-licensed C compiler or kernel. We may want to distinguish between active and inactive projects. Then there is the problem of code duplication, both across publication platforms (a project may be published on GitHub and SourceForge for example) but also across projects (code may be copy-pasted between projects). We should think about how to evaluate the license of a given project: different files in the same code base regularly have different licenses—often none at all. This is why having a clear, documented and publicly available data set and methodology is critical. Without this, the assumptions made are not clear and it is unreasonable to draw certain conclusions from the results.

It turns out that some researchers did that kind of open research in 2016 in a paper called "The Debsources Dataset: Two Decades of Free and Open Source Software" [PDF] by Matthieu Caneill, Daniel M. Germán, and Stefano Zacchiroli. The Debsources data set is the complete Debian source code that covers a large history of the Debian project and therefore includes thousands of free-software projects of different origins. According to the paper:

The long history of Debian creates a perfect subject to evaluate how FOSS licenses use has evolved over time, and the popularity of licenses currently in use.

Sullivan argued that the Debsources data set is interesting because of its quality: every package in Debian has been reviewed by multiple humans, including the original packager, but also by the FTP masters to ensure that the distribution can legally redistribute the software. The existence of a package in Debian provides a minimal "proof of use": unmaintained packages get removed from Debian on a regular basis and the mere fact that a piece of software gets packaged in Debian means at least some users found it important enough to work on packaging it. Debian packagers make specific efforts to avoid code duplication between packages in order to ease security maintenance. The data set covers a period longer than Black Duck's or GitHub's, as it goes all the way back to the Hamm 2.0 release in 1998. The data and how to reproduce it are freely available under a CC BY-SA 4.0 license.

[Debsource
  graph]

Sullivan presented the above graph from the research paper that showed the evolution of software license use in the Debian archive. Whereas previous graphs showed statistics in percentages, this one showed actual absolute numbers, where we can't actually distinguish a decline in copyleft licenses. To quote the paper again:

The top license is, once again, GPL-2.0+, followed by: Artistic-1.0/GPL dual-licensing (the licensing choice of Perl and most Perl libraries), GPL-3.0+, and Apache-2.0.

Indeed, looking at the graph, at most do we see a rise of the Apache and MIT licenses and no decline of the GPL per se, although its adoption does seem to slow down in recent years. We should also mention the possibility that Debian's data set has the opposite bias: toward GPL software. The Debian project is culturally quite different from the GitHub community and even the larger free-software ecosystem, naturally, which could explain the disparity in the results. We can only hope a similar analysis can be performed on the much larger Software Heritage data set eventually, which may give more representative results. The paper acknowledges this problem:

Debian is likely representative of enterprise use of FOSS as a base operating system, where stable, long-term and seldomly updated software products are desirable. Conversely Debian is unlikely representative of more dynamic FOSS environments (e.g., modern Web-development with micro libraries) where users, who are usually developers themselves, expect to receive library updates on a daily basis.

The Debsources research also shares methodology limitations with Black Duck: while Debian packages are reviewed before uploading and we can rely on the copyright information provided by Debian maintainers, the research also relies on automated tools (specifically FOSSology) to retrieve license information.

Sullivan also warned against "ascribing reason to numbers": people may have different reasons for choosing a particular license. Developers may choose the MIT license because it has fewer words, for compatibility reasons, or simply because "their lawyers told them to". It may not imply an actual deliberate philosophical or ideological choice.

Finally, he brought up the theory that the rise of non-copyleft licenses isn't necessarily at the detriment of the GPL. He explained that, even if there is an actual decline, it may not be much of a problem if there is an overall growth of free software to the detriment of proprietary software. He reminded the audience that non-copyleft licenses are still free software, according to the FSF and the Debian Free Software Guidelines, so their rise is still a positive outcome. Even if the GPL is a better tool to accomplish the goal of a free-software world, we can all acknowledge that the conversion of proprietary software to more permissive—and certainly simpler—licenses is definitely heading in the right direction.

[I would like to thank the DebConf organizers for providing meals for me during the conference.]

Note: this article first appeared in the Linux Weekly News.

Created . Edited .