RedMonk at FOSDEM: Lies, Damned Lies, and Statistics

Imagine you switch on the TV and you're just in time to catch the start of the weather report. The reporter is smiling cheerfully into the camera, while saying:

"We haven't been able to measure weather statistics in the whole country. In fact, we don't really know exactly how large the country is. Nevertheless, that being said, based on the weather statistics that we did manage to get hold of, here is the state of the weather today."

That, honestly, is what happened at FOSDEM yesterday in a packed out (people standing in the aisles, sitting on the floor, lined up outside in the corridor) RedMonk session entitled "What a Long Strange Trip It's Been: The Past, Present and Future of Java". (Here's the abstract.)

Now, Steve O'Grady was perfectly frank in saying exactly what he said (and says it again in his blog article), "to be included in this analysis, a language must be observable within both GitHub and Stack Overflow." And that's really all that's been analyzed, i.e., GitHub and Stack Overflow. (As a corollary, you then end up in the endlessly hilarious and inherently unresolvable discussion about what it means that there are many questions on Stack Overflow about a language -- (a) that it is popular, (b) that it is complicated, (c) that it is badly documented, (d) combinations of the above, (e) something else.) OK, so then I think the title "The RedMonk Programming Language Rankings: January 2014" is really too heavy for the analysis that's been done and the content it provides, especially since by their own admission they haven't surveyed closed-source corporate software development, nor have any clue about how to do so. I'd have thought they'd have built up a database of software companies around the world, that they've built relationships with these companies, and that thanks to the relationships they've developed they're able to regularly poll these companies on the software (and languages and frameworks and libraries and IDEs) they're using. However, this turns out to be quite clearly not the case at all.

And I don't think it's a credible defence to say, "hey, I told you that we didn't analyze anything other than GitHub and Stack Overflow" because a lot of organizations take the analysis of RedMonk (and similar organizations, such as Forrester, where the difference is you have to pay for the analysis results and you get an impressively authenticated PDF, rather than a free blog entry) very seriously, in the same way that the weather reporter saying "the weather today has been sunny with blue skies" is taken seriously, because no one reads the small print that says "hey, we really don't know how large the country is and we possibly haven't visited most of it". In the case of the weather report, that would be absurd and they wouldn't have a particularly strong basis to exist if they said that.

Personally, I have no reason to make this point other than a concern that we really do need usable and independent research to be done on the ubiquity of programming languages (and frameworks and libraries and IDEs), since the conclusions reached by RedMonk favor Java, which is my personal favorite language, the one I have been supporting and promoting for the past 10 years.

In other words, I have nothing to gain by calling RedMonk's bluff. (I'd argue that even if you call your own bluff, it's still bluff.) They reached a conclusion I would have wanted them to reach: Java is not only alive, it is vibrant, self-sustaining, used everywhere by everyone in all manner of application development. However, in a year or so, RedMonk may do another scan (or analysis or whatever) of GitHub and Stack Overflow and then come to completely different conclusions. And they would be as invalid and as silly to base any kind of conclusion on than the ones they have most recently come up with and were presented at FOSDEM yesterday:

OK, thanks, Java is very much alive and vibrant and is used for things other than enterprise applications. (Later that day I met up with John Kostaras from NATO, where they work on a massive Java desktop system that manages European air defence, which of course is yet another example of corporate work being done in Java that falls outside of RedMonk's research. In fact, if RedMonk were to research the ubiquity of Java desktop development, they'd say the Java desktop is dead, but only because NATO and all the other massive enterprises in the defence, aerospace, and banking domain don't use GitHub and Stack Overflow.) The call to action at the end was that we should spread the word that Java is alive. Great. Everyone except maybe analysts, who may have been scanning (or analyzing or whatever the verb is) some other repository somewhere and come up with contradictory conclusions, already knew that.

I'm sure I wasn't the only one left with the question marks outlined above, several of the discussions at the end of the session more or less asked the same thing, and the response was more or less the same, kind of along these lines: "Hey, we don't know how large the programming world is, we have no way of knowing that, we have to do the best we can with the data that we can get".

If that is really the case then, in the first place, I really appreciate the honesty. In the second place, however, I'd suggest that analysts start actually building relationships with companies and developers at companies, rather than with repositories and on-line discussion forums. In the third place, it's invalid to draw any broad conclusions from this analysis. The worst thing that one can do is to say, essentially, "yes, the weather statistics are incomplete and, what's worse, we have no clue how incomplete, nevertheless, that having been said, the skies are sunny and blue".

As a final point, the first photo above of the "packed out RedMonk session", (with people standing in the aisles, sitting on the floor, lined up outside in the corridor,) might indicate that there was a massive interest in the state of Java and especially in what RedMonk had to say about that, which might further underline (a) that Java is very popular and (b) that RedMonk is taken very seriously. Both those things may, of course, be true. But the photo doesn't prove that at all because... pretty much all the rooms at FOSDEM were "packed out". It's a free conference with a massive turnout and... surprisingly small rooms.

Comments:

Appreciate you coming to the talk, Geertjan, and the feedback. While your post purports to be about my FOSDEM talk, however, it is more practically considered a critique of our programming language rankings, as that is the only datapoint it discusses. The language plot was indeed included in the deck, but only along with data from Hacker News, Indeed.com, LinkedIn, Ohloh as well as secondary Java-specific GitHub repo data. The latter, in fact, was what I considered the most important chart in the presentation, as mentioned.

So instead of talking about the FOSDEM talk then, let's focus just on the language rankings. Your argument against them, essentially, is that we use (by our own admission) an incomplete data set - one for whom organizational usage is opaque. Specifically, you have the following concerns.

"I'd suggest that analysts start actually building relationships with companies and developers at companies, rather than with repositories and on-line discussion forums."

We do actually have relationships with companies and the developers who work for same. Lots of them, in fact. We talk to a wide variety of developers every day. But these conversations are by definition anecdotal and non-representative, so we prefer sources that are more quantitative in nature. Too many technology analyses historically, in fact, have consisted of opinions based on relationships with companies and developers. We value highly the conversations we have, but prefer hard data in terms of measuring behavior.

"It's invalid to draw any broad conclusions from this analysis. The worst thing that one can do is to say, essentially, 'yes, the weather statistics are incomplete and, what's worse, we have no clue how incomplete, nevertheless, that having been said, the skies are sunny and blue'."

How you view our rankings depends, ultimately, on two things. The value you place on the GitHub and Stack Overlow communities, one, and their predictive ability, two. You are clearly unpersuaded on their value, which is your prerogative, but we would argue that they represent collectively increasing centers of gravity from a developer perspective. It's difficult to argue the opposite, in fact, when you consider how quickly GitHub overtook Stack Overflow (http://redmonk.com/sogrady/2011/06/02/blackduck-webinar/). You seem convinced that they're volatile, mercurial manifestations of developer behavior - "in a year or so, RedMonk may do another scan (or analysis or whatever) of GitHub and Stack Overflow and then come to completely different conclusions" - but in actuality the reverse is true. Our rankings, which we've been doing for four years, are quite stable. So little changes, in fact, that we've stopped running them quarterly and have shifted to a bi-annual run.

Would it interesting to have perfect insight into the actual code distribution across the world's enterprises and governments? Of course. But knowing, for example, that huge portions of the US Goverment's codebase - our tax systems in particular - still run on COBOL seems to be of more historical and academic interest than having any real predictive value. With GitHub and Stack Overflow, on the other hand, we can indirectly observe in near real time growing communities in which millions of developers publicly share their work and their questions. Is it a perfect system? Of course not. But neither do we argue it to be such. It's one datapoint, and one ranking system, among many. It's interesting to us, and we think reasonably predictive, but reasonable minds may of course disagree.

As for your last contention, that the full room should not be taken to mean that "RedMonk is taken very seriously," that of course is true. But I'd still rather have a full smaller room than an emptier one ;)

Posted by stephen o'grady on February 11, 2014 at 06:59 AM PST #

Hi Steve and thanks for these comments. You'll be unsurprised that I agree with all of them, in the same way you did with my blog entry. What mainly struck me from the session was how much was omitted from it -- like I stated above, I was sitting next to a guy from NATO whose work, just like everyone else's work in the military sector, and aerodefense sector, and banking sector and so many other sectors, will never make it into the kinds of analysis that you did.

Possibly my only critique I really have is the title of your analysis: "The RedMonk Programming Language Rankings: January 2014".

Had it been "Analyzing Stackoverflow and GitHub and Drawing Some Tentative Conlusions", I would have had no problem with it at all.

But, let's change the discussion around and address the elephant in the room: "What can we do to get an accurate reflection of what the developer community is doing, across the board, in all sectors of development, in all verticals and horizontals?" We're maybe never going to have complete statistics but what would we need to do to get those?

Posted by Geertjan on February 11, 2014 at 07:18 AM PST #

By the way, I also see your point on the predictive nature of Stackoverflow and GitHub. What would also be cool is if a similar kind of analysis would be done on http://www.forge.mil. Now that would really be interesting, to see whether those results correlate with Stackoverflow and GitHub. (Would still be inconclusive, since I wonder how much work is done at forge.mil versus closed source military work, but still, interesting.)

Posted by Geertjan on February 11, 2014 at 07:25 AM PST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Geertjan Wielenga (@geertjanw) is a Principal Product Manager in the Oracle Developer Tools group living & working in Amsterdam. He is a Java technology enthusiast, evangelist, trainer, speaker, and writer. He blogs here daily.

The focus of this blog is mostly on NetBeans (a development tool primarily for Java programmers), with an occasional reference to NetBeans, and sometimes diverging to topics relating to NetBeans. And then there are days when NetBeans is mentioned, just for a change.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
12
13
14
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today