Monday Jan 28, 2008

Slight changes...

I hereby happily announce that starting this Friday, I'll be part of the great team at Novell. Being fully home-assigned and saving the ~2 hours commute every day, I can now hopefully push my hidden gsl agenda even more.

Wednesday Nov 28, 2007

Don't trust your tools, if they are software...

While hunting down broken images in the pdf import, I came across a very weird behaviour, that was seemingly caused by plain C stream IO: every once in a while, streams that were explicitely opnened in binary mode converted 0x0A into 0x0D. One-on-one substitution, not the usual 0x0A to 0x0A 0x0D lineend conversion, notably. I wasted about a day trying to track that down, suspecting everything from crazy filesystem flags, to symbol hijacking during link time (i.e. somebody sneaking in own fwrite/putc/whatever into the binary). Laboriously ruled out one after the other.

Finally, I did what I should have done in the first place: I used a different hex editor. And lo and behold, the trusty midnight commander showed all the files with correct 0x0A. Caught in the act and proven guilty: emacs' hexl-mode.

Well, kind of. It turned out to be a rather unlucky combination of things going wrong: emacs with its buffer encoding scheme, which tries to be smart and detect the actual encoding - if there's a 0x0A in the file before a 0x0D, emacs assumes raw-text-unix and leaves line ends alone. If there's a 0x0D in the file before any 0x0A, emancs assumes raw-text-mac, and converts all assumed lineends to 0x0D. And since hexl-mode seems to work on top of an existing buffer (and never loads the file directly from disk), this resulted in the described mess.

So, to save yourself some trouble: don't trust your tools. And make sure to have files you want to work on with hexl-mode match something in file-coding-system-alist's content, that enforces binary reads (elc, png, jpeg, ar, gz etc. for my setup).

Friday Nov 23, 2007

GSoC 2007 miscellanea

Being honoured to mentor Shane Matthews on his GSoC 2007 project Impress OpenGL rendered transitions, I today received the mentor gift - thank you Google for the tshirt!

On related news, the resulting CWS got recently merged to HEAD, so those of the adventurous kind can now build this as an optional feature: just issue "build ENABLE_OPENGL=TRUE" in the slideshow module to be blessed with an ogltrans shared lib. And thanks to Rene, after CWS configure22 is integrated, one can even give "--enable-opengl" on the configure line. After that, register the ogltrans lib into OOo via unopkg, and use this presentation file to see what others are missing.

This could even be shipped as a separate extension after 2.4 is out...

Monday Oct 15, 2007

A Dead Parrot Learns to Fly

After an impromptu dead parrot sketch at the OOoCon ("the beamer works" - "no, it's dead" - "bah, it's not dead, it just needs a reboot"... (you might look forward to the still-missing video)), I had a rather rough ride with the proverbial last 10 percent of the browser functionality - I'm still not 100% happy with the representation in OOo's Calc, and the xcs schema parser leaves something to desire. But surely the thing is now ready for config hackers to play with it - grab your copy here:

09e22fc6d8d9f9bfd3dc7cf838ce6590 ooconfig.oxt

(for those of you who don't know what I'm talking about: OoConfig is a Python extension, providing a tool quite similar in spirit to Mozilla's about:config page - a way of tweaking the complete (even hidden) set of configuration items. See also the wiki for further information)

Wednesday Aug 08, 2007

A hacker's way of app-bundling

Just found this. Pretty cool way of packaging an application to be completely relocatable (though without desktop integration, barring further hacks). Something for OOo Portable?

Tuesday Jul 24, 2007

Intel opensourced their Thread Building Blocks

Seems that Intel has set their TBB lib free - which solves a few of the basic issues outlined here (I've commented a bit in the context of c++).

Sunday Nov 12, 2006

OOo my Threading (3)

Okay, what does all of this really buy us, right now? Lets assume we want to speed up Calc's way of parsing and calculating cell values. Given the following dependency tree (cell 1 refers to the result of cell4, cell 6 to the result of cell2, and so forth):

    parse_cell_4                                parse_cell_5
         |                                           |
    parse_cell_1        parse_cell_2            parse_cell_3
         |                   |                       |
         -------------- parse_cell_6 -----------------

The partial ordering of cell evaluation equivalent to this tree is as follows:


Wanting to employ what we've talked about earlier, to parallelize this calculation, one would need a static expression (i.e. one fixed at compile time):


(with as_func denoting that the argument should be evaluated lazily, and parse being a Calc cell parsing unary function, expecting the cell number as its sole parameter).

Now, having the formula dependencies fixed in the code isn't of much value for a spreadsheet, thus we need to handle this a bit more dynamically. Futures and Actions in principle provide the functionality that we need here, only that there's one slight catch: the pool of threads processing the Actions or Futures might actually be too small to have all cells resolved, that in turn are referenced from other ones (with circular cell references being the extreme example. But see below). This is because forcing a Future or Action to evaluate will block the thread requesting that value, which will eventually lead to starvation, unless there's at least one more thread active to process cell values other Actions are waiting for.

Of course, one could remedy this by using N threads when dealing with N cells. But that would get out of hand pretty quickly. Alternatively, the cell parsing function can be split into two parts: the first generates a parse tree, thus extracting the referenced cells (depending on the heap allocation overhead, one could also throw away the parser outcome, except for the cell references. But OTOH, with a decent thread-local allocator, the full parsing could happen concurrently. YMMV). Given the list of references for each cell, one again gets the partial ordering over the value calculation:

vector< vector< int > >  intermediate;
parallel_transform( cells.begin(), cell.end(),
                    &parse_step_1 );

For each cell, this step yields a vector of preconditions (other cells, that need to be calculated before). Pushing the actual cell value calculation functions into a job queue, and handing it the dependency graph (represented by the individual cell's references) generated above, we arrive at a fully dynamic version of the concurrent cell parsing example:

int              i=0;
job_queue        queue;
vector< double > results(intermediate.size(),0.0);
transform( intermediate.begin(), 
           bind( &job_queue::add_job,
                 bind( &parse_step_2,
                       var(i)++ ),
                _1 ));;

This looks weird, but peeking at the prototypes of the involved functions might help clear things up:

/\*\* adds a job functor

    @param functor
    Functor to call when job is to be executed

    @param prerequisites
    Vector of indices into the job queue, that must be processed
    strictly before this job. Permitted to use not-yet-existing 
template job_queue::add_job( Func               functor,
                                            vector const& prerequisites );


/\*\* calculates cell value

    @param io_results
    In/out result vector. Receives resulting cell value, and is used
    to read referenced cell's input values.

    @param content
    Cell content

    @param cell_index
    Index of cell to process
parse_step_2( vector& io_results,
              string const&   content,
              int             cell_num );

This job queue can then decide, either globally or via policies, what to do in various situations:

    Circular dependencies: either refuse working on such a job, or use a default value for an arbitrary member of the cycle (to break it)

    Whether to execute jobs in parallel or not. Depending on the number of cores available (both physically and load-wise), the queue could decide to stay single-threaded, if the number of jobs is low, or multi-threaded for a larger number of jobs. Note that this decision might be highly influenced by the amount of work a single job involves, and therefore external hints to the queue might be necessary. Kudos to mhu for the hint, that it's wise to parallelize ten jobs that take one hour each, but not so for twenty jobs that only take a few microseconds to complete.

At any rate, fine-tuning this to various hardware, operating systems and deployment settings is much easier than for manual thread creations. Plus, given a few (falsifiable) attributes of the functions called, it's also much safer.

Saturday Nov 04, 2006

OOo my Threading (2)

I've already talked a bit about status quo of threading in OOo, and listed some largely useful stuff that others have been doing to remedy the problems shared-state multi-threading poses.

Why not going further? If there's a way to know that a function is pure, or that an object is thread-safe, then it would be nice to have automatic parallelization employed. The underlying issue that needs to be controlled here are races - prohibiting unserialized access to shared state. So, either a function is pure, i.e. does not depend on shared state at all, or the called subsystem takes care of itself against concurrent modification (this is most easily achieved by the environment concept of the UNO threading framework: the UNO runtime implicitely serializes external access to thread-unsafe appartements).

Although C++ provides no portable ways to express those concepts on a language level, for pure functions, there's a way of using a limited subset of lambda expressions, that inhibit access to shared state on the syntax level. And it's perfectly possible to mark objects (or even subsets of a class' methods) to be thread-safe. One straight-forward way to do this are specializations of UNO interface references, i.e. ones that denote thread-safe components.

Given all of this, we can form statements that contain:

    unsafe method calls

    thread-safe method calls

    pure function calls

    impure function calls

So, in fact, a runtime engine could reason about which subexpressions can be executed concurrently, and which must be serialized. If you treat method calls as what they are, e.g. implicitely carrying a this pointer argument, a possible data flow graph might look like this:

new object1                      new object2
 |                                |
 +->object1::methodA              +->object2::methodA
            |                               |

new object3

That is, the this pointer is carried along as a target for modifications, and as soon as two methods have access to the same object, they need to be called sequentially. This does not apply for UNO interface calls or objects that are tagged as thread-safe, of course. To be specific, a forest of data flow trees can be generated, which defines a partial ordering over the subexpressions. If neither exp1<exp2 nor exp1>exp2 can be deduced from this ordering, those two subexpressions can be executed in parallel. Really basic stuff, that compiler optimizers do, as well - only that plain C/C++ doesn't provide that many clues to safely parallelize. From the example above, it is obvious that object3::methodA can be executed concurrently to all other methods, that object1::methodC must be execute strictly after object2::methodA, and that object1::methodA and object2::methodA can also be executed concurrently.

Okay, this is largely crack-smoking. But there is something to be made of it. Stay tuned.

Wednesday Oct 25, 2006

OOo my Threading...

This starts where Stephan's OOoCon talk left off - by taking his problem statement and maybe a gist of the proposed concepts into OOo's core (which happens to be C++ instead of Haskell. Which some folks might bemoan, but still, ~90% of our code consists of that language).

To recap, the free lunch is over for us application software developers, i.e. either OOo gets seriously multi-threaded soon, or we stop adding features and only optimize for speed, or OOo will get significantly slower within the next few years. But as you, the gurus and yours faithfully have noticed by now, becoming concurrent with osl::MutexGuard, osl::Thread & friends is tedious, error-prone, and neither race nor dead-lock free for a project of OOo's size. With the expressive power of a general programming language, it is even theoretically impossible to verify against these properties (for a sufficiently large project. OOo probably satisfies this constraint).

So, Stephan already showed what's possible in other languages. To boot, the trick is to raise the level of abstraction above digging-in-the-dirt, hand-made, explicit, and ugly locking or thread fiddling. Whereever possible, parallelism should be detected either implicitely, or at least denoted declaratively (instead of manually telling the machine what to do). John Ousterhout has a nice writeup about that. It's basically another instance of the lack of transparency and ease of use that is so ingrained with day-to-day tasks in C++. For stuff like object lifetime management, serialization and so on, boost provides transparent solutions. For large-scale concurrency management, the UNO threading framework does. For local, statement-level parallelization, there ain't much - or is there?

In fact, there actually are several attempts to extend C++ in that direction. To name a few:

    OpenMP, a C/C++ language extension, which provides things like parallel loops, work sharing or master and synchronization constructs. Drawbacks: controlled via pragmas, expressiveness limited, C origins (e.g. there's no proper way to propagate exceptions out of OpenMP threads).

    Boost.Act, a template framework that, together with Boost.Threads, provides functionality on the same conceptual level as OpenMP (it actually builds upon OpenMP internally). Furthermore, Boost.Act has so-called active objects, which provide their results lazily - one can construct a whole tree of calculations using active objects, which internally enqueue those, run them asynchronously, and force a wait on the result only if a non-active object actually requests it.

    Concur, featuring active objects, messages, parallel STL algorithms and futures (with futures being values that are possibly unevaluated at the point of instantiation, and get their actual result only eventually - forcing futures evaluation might block the requester until the value becomes available). Probably served as an inspiration for Boost.Act.

    TBB, or Intel ThreadBuildingBlocks, another (closed-source) C++ template framework with similar scope & features as Concur and Boost.Act.

    Two proposals for extending the C++ standard: Multithreading API for C++0X (and related: Transporting Values and Exceptions between Threads)

    KDE's ThreadWeaver, a threaded job queue that offers some nice concepts, such as explicit job dependencies.

From a coarse inspection, Boost.Act seems appealing (TBB really isn't an option for OOo, no?) - boost is high-quality anyway, and I think that the Action concept is a richer one than Futures.

Basically, the advanced approaches listed above employ declarative semantics, i.e. one no longer needs to create threads by hand, assign a thread function, and manually serialize access to shared state. Instead, one simply states that a loop body shall be executed in parallel:

parallel_transform( src.begin(), src.end(), dest.begin(), bind(_1 \* 42) );

This multiplies each element in src with 42, and stores the result in dest. How many threads are created, and where they are taken from is completely up to the implementation - a typical threading runtime on a machine with N cores will hold a static pool of N threads, each of them receiving a chunk of the parallel_transform workload from above. In the old way of doing multi-threading, this behaviour would potentially be spread over the whole code base (making it a cross-cutting concern, just like explicit method guarding is today - with thousands of osl::MutexGuard(m_aMutex) invocations all over the place).

One of the crucial points of doing stuff like the above in a safe way is to avoid hidden access to shared state. Access to the container can be serialized by parallel_transform, but if the called functor modifies shared objects concurrently, all bets are off. Thus, the listed frameworks usually have some safe-guards in place, most of them using a way of explicitely annotating free-standing functions and classes to be thread-safe (note that this tagging of thread-safeness can easily be achieved by means of the UNO threading framework as well - Kay already has different uno::Reference types on his list, to distinguish between thread-safe and thread-unsafe components during compile-time). Another important concept noteworthy here are pure functions, i.e. functions that, given the same input parameters, always return the same result, no matter where or when called (related, but not exactly the same are referentially transparent functions, which could also be exploited by an optimizing parallel execution runtime). It's always safe to call pure functions on disjunct parameters concurrently with no extra synchronization effort. A safe (because syntactically limited) way of defining pure functions is via boost's bind or lambda library. There, expressions that modify shareable state could simply be banned from the available syntax (they aren't - but it's possible to do that).

Need an example? Good, take the standard one, that so naturally asks for parallelization, that it's usually been sped up via SIMD in the past (you might have noticed that those parallel_something algorithms principally apply the same instructions (the functor to run) in parallel to all container elements - that's SIMD), namely operating on image pixel:

// color-convert an image from RGB to CMYK color space

This is the archetype of inherent parallelism, having a huge amount of data (image pixel), that all need to have the same operation applied to. Disregarding memory bandwidth for the moment, this snippet scales linearly with the number of available cores.

Note the conversion_parms parameter to the bound function, which can pose problems if rgb2cmyk silently const_casts it to something mutable. This is the dark corner of C/C++'s non-enforcable constness, and therefore, a minimal amount of developer discipline is necessary: if objects or functions employ magic behind their public interfaces (on-demand creation of data, buffering, caching, copy-on-write), this still needs to be made thread-safe explicitely. Fortunately, for each of the aforementioned items, there's a thread-safe helper available in OOo (use rtl::Static for late-initialization of static data, boost::shared_ptr for lazy initialization of class members, caching and buffering, and cow_wrapper for copy-on-write).

To be continued.

Friday Sep 01, 2006

First impression: Draw/Impress on cairocanvas

After much prep work (kudos to you, Armin!), things start to look like it's working:

(if you inspect this carefully, you'll see the difference between the preview image on the left side, which is still rendered the old way, and Draw's edit pane content to the right, which now gets displayed via cairocanvas).

Gory details, live demo, and plans how to move on at this year's OOoCon. If you can't wait: OOoWiki has a brief hacker's guide.

Sunday Jul 16, 2006

The graphics brethren: basegfx and basebmp

Started in late 2002, OOo's module basegfx is now a pretty mature collection of data structures and algorithms for graphics processing. The set of exported headers sport abstract data types such as:
  • points, vectors, boxes, ranges, bezier curves
  • matrices
  • polygons and collections of polygons ("poly-polygons")
Those are (where sensible) provided in the flavors
  • 1D, 2D, 3D
  • integer and double-precision floating point

In addition, especially the polygon ADTs are shipped with a set of associated algorithms (provided as free functions, instead of member functions):

  • a poly-polygon clipper (comparable to what gpc is doing)
  • a poly-polygon triangulator (convert a polygon to triangles, which can be fed to a GPU)
  • a function to generate stroked paths from a given polygon
  • plus a lot of basic, but very useful stuff (like svg im/export, scan conversion, arc and circle generation, etc.)

What's missing, though, is everything related to bitmaps. That's where basebmp comes into play, started a few weeks ago and set to provide for pixel processing, what basegfx provides for the symbolic graphics primitives.

I've initially started this work by using AGG, but quickly found out that adding pixel formats other than integer true color ones is pretty hard (Because the code is largely undocumented. Because the assumption that pixel values can be modified by shift left and right operations, and are PODs (i.e. non-class types) is all over the place. And because basically, a way to somehow get a hold to the famous extra level of indirection is missing. Which, as usual, would have solved this software engineering problem). Instead of starting from scratch, we've based this on vigra, a framework for generic image processing, which provides an elegant way to add 'special processing' before an actual pixel gets assigned a value.

In order to really get to the point here, I first need to sketch the way generic bitmap processing works: it's the idea of separating data access and the actual bitmap manipulation algorithm, using C++ templates. The concept is borrowed from the STL - bitmap pixel can be accessed by (two-dimensional) iterators, which in a very basic sense behave like a pointer:

while( i!=end )
    \*i++ = rand(); 

If i is a bitmap iterator, the pixel will be set to random values with this algorithm. The type of i can be chosen freely, and so can the actual pixel format of the bitmap. If one wants to e.g. add palette lookup to this algorithm (or scale the random values into the appropriate color range) without changing the algorithm's code, either extremely ugly hacks involving proxy objects are necessary (because operator\*() is expected to return an lvalue, which gets assigned the new value directly), or one needs a direct way to 'see' both the iterator and the new value prior to assignment:

while( i!=end )
    acc.set(i++, rand());

The acc variable is a pixel accessor in vigra lingo, which takes the iterator and the new value, and accesses the corresponding pixel. Now, adding palette lookup, scaling, blending, mask operations and much more is as easy as providing a different accessor to the algorithms is. Plus, it's completely orthogonal to both the iterator implementation, and the algorithm used. That means, you define a set of iterators (for packed-pixel images, for r8g8b8 images, for r5g6r5 images), a set of accessors (for true color pixel, for palette lookup, for blitting through a binary mask), and a set of algorithms. And you combine all those freely, i.e. you have the cartesian product of variations to choose from.

Sunday Apr 23, 2006

Source code grokking, 3rd installment

Following my habit to write something about source code analysis every few weeks, here comes the next installment under that topic: Initially set off by KaiB's pain with OOo's build slowness, I remembered some rules from John Lakos' 'Large-Scale C++ Software Design' tome - and to first off assess the state of affairs regarding include dependencies, found this:
  • DEPS, a tool to extract dependencies (e.g. includes) from source code
  • cinclude2dot, a tool plotting a project's include dependencies as a GraphViz dot graph
Somewhat related (because they also facilitate getting an overview over large code bases) are
  • Bloof, an infrastructure for processing of version control data, and
  • StatCVS, which extracts statistical HTML reports from a CVS repository.
  • (thanks Pavel for pointing me to it).

Tuesday Feb 28, 2006

cairocanvas is integrated!

Thanks to Radek's tireless work, CWS cairocanvas is now finally integrated (into the 2.0.3 code line). Took about eight months to get there, and now provides a nicely anti-aliased slideshow experience, with even the alpha effects running at a decent speed.
The downside: you need a fairly recent X server and cairo 1.0.x on your system.

Tuesday Feb 14, 2006

More on source code grokking tools

After having played a bit with cpd, I went searching for more elaborate source code analysis tools. And there are some, even in the FLOSS realm:
  • pygccxml, a gcc plugin to generate XMLized output from the AST (plus a python framework to actually work with this representation).
  • gccxml, quite similar to pygccxml, focused on extracting declarations only.
  • synopsis, a standalone parser plus analysis framework - a bit lacking for c++, which is inherently hard to parse.
  • opencxx, same as above (actually a predecessor/inspiration for synopsis)
  • rtlcheck, a gcc plugin that extracts code on the RTL (register-transfer level). Currently, seems to be used only for c, extending that to c++ should be both easy & rewarding (as rtlcheck appears to be the most elaborate project of all listed).
  • Also quite interesting, though with different scope (being a runtime-analysis tool) is the Design Recovery Tool, roughly characterizable as a means to get from UI actions to the code involved processing them.
I do like the gcc patching/plugin approach the most, knowing how complex it is to get c++ parsing even half-way right. And as is such a beast, code-size wise, getting all the help we could from automatic tools appears wise to me.

Friday Feb 03, 2006

Rework of COW implementations

Quite coincidentally, after having rambled about everybody rolling their own copy-on-write implementations, I've found severe issues with some of the COW implementations in VCL. Most notably, they employ 16 bit refcounts - which is a bad idea indeed, when used for objects that occur ten thousands of times, e.g. in metafiles. brrr.
Thus, this needs some larger fixups, work for that is going to happen in CWS cowfixes01.



« April 2014