An old and new std::locale vs multithreading issue
By Paolo Carlini on Dec 21, 2009
and, first, sorry about the long time since my last post. In the present one anyway I'd like to change topic and summarize a performance issue with std::locale on which I have been working lately, quite interesting in my opinion: it shows that clearly some guidance from the community is needed in order to figure out the best approach going forward, in the framework of the new, forthcoming, C++ Standard.
In a separate post, which I mean to write soon, hopefully before the end of the year, I will update you a bit about the status of the effort toward the new C++. For now you may want to have a look to the new Working Draft, n3000, in case you didn't notice it yet,
and the other papers made available as part of the 2009-11 Mailing. Some words of reassuring anticipation: progress is definitely being made on the most thorny issues, among which that about throwing move constructors, which I briefly mentioned last time, Dave's (and Roni's and Doug's) paper is certainly worth reading,
Also very interesting this one about simplifying std::pair
which however is still being scrutinized for possible aliasing problems with the proposed piecewise scheme for constructing a std::pair in uninitialized memory: certainly in these times of very aggressively optimizing compilers we don't want to make mistakes in a new standard!
Coming finally to the main topic of this post, the issue with std::locale - actually its default constructor - it started with this PR, filed a lot of time ago - back in May - and which I really hoped to have "somehow" fixed in time for gcc4.5.0:
In fact, the real problem isn't new at all: we have this mutex since 2004, but apparently only now people care about it: this code is run more and more often by multithreaded programs. Indeed, it may seem really strange that just creating a default constructed std::stringstream may have performance issues, because apparently nothing is shared with other threads. The problem is caused by the default constructor of the std::locale part of std::streambuf from which std::stringstream derives: in our implementation, in order to avoid races with other threads changing the global locale at the same time, it has to take a global lock, in the general case.
As you can see in the audit trail, for gcc4.5.0 we are able, thanks to some help from Jimmy Guo, an external contributor, to provide a nice optimization for the common case of a "C" global locale, very important because it means that users not actually interested in std:.locale features pay much less when just using std::stringstream, consistently with one of the basic design tenets of the C++ programming language since its early days: performance-wise, pay only for the features which are explicitly used. Note that Ideally we could do away with reference counting too in this special case of the "C" locale, because in our implementation the data is statically allocated, never actually destroyed, thus updating the reference count it's a pure waste of time... but turns out we can't do that just now, because of ABI compatibility. Details.
The interesting technical point, however, which in my opinion has an importance which goes beyond the details of the performance issue we are facing right now in the implementation, is clarified by this piece of text in N3000, 22.3.1/9
"Whether there is one global locale object for the entire program or one global locale object per thread is implementation-defined. Implementations are encouraged but not required to provide one global locale object per thread. If there is a single global locale object for the entire program, implementations are not required to avoid data races on it (22.214.171.124)."
Thus, the new standard, being finally aware of the existence of threads - both in the core part of the specifications and in the library and not just in the specific subsections about std::thread - also says something about the possible ways to deal with the global locale in a modern implementation. I must admit that I haven't closely followed the work in the ISO Commitee which led to this piece of text, which looks of course quite sensible to me but at the same time a bit vague: certainly people interested in these topics have to give some serious thought to it, before attempting any new implementation
However, before recently getting from Jimmy the help for the issue in our Bugzilla, personally I took for granted that one global locale per thread was the "obvious" way of implementing the new specifications, probably even rather straightforward thanks to availability of thread locale storage. I was only waiting a bit because of the usual ABI requirements. As part of the discussion thread, however, Jimmy pointed out that separate global locales can present an "usability issue". Actually, that makes sense: after all, the name itself, *global*, conceptually clashes a bit with the per-thread idea, doesn't it? I'm thinking that maybe the new standard could be "fixed" by adding a new entity, a really-global locale so to speak, required to be global for the whole running program... but now it's too late for these crazy ideas ;)
To summarize then, my current understanding is that we should not hurry to change the implementation to one global per thread, instead we should poll the users and all the interested parties much more before taking any action with long term implications in this area. In other terms, given that we know for sure we can have good performance for the "C" locale, we should first seriously explore how much we can improve, in a new ABI, the performance of the current single-global scheme, by changing at the same time other parts of the implementation too, or whatever else is needed. Feedback welcome!