The Story of OpenGrok - the Wicked Fast Source Browser

This blog entry will act as the paper story of OpenGrok the wicked fast source browser that powers the OpenSolaris source browser. OpenGrok on OpenSolaris.org This would be explaining the technology behind it in more detail. It would be a living blog entry. i.e I would keep updating as and when there are updates. There would no other technical documentation, if you are fond of reading in postscript or PDF formats, I'll try to make one available sooner.

The Revenge of the Binaries

Being a security sentry of all Sun products, I keep a watch on reports of newly discovered security holes and then check if any Sun software is affected by them. Typically I would use cscope to find any fragments of code that I suspect to be vulnerable. In Solaris land, most of the ON(OS/Net) gate has cscope indexes pre-built nightly. Solaris is not just ON, it is a WOS (Wad Of Stuff); code and binaries flowing in from a dozen or so gates; and there are many more software products which I had no ready access to cscope indexes or sources.

I gathered Solaris Install images, JES install images, Sun Cluster and N1 software and many other good and great things under Sun. I wrote a Perl script, that would go down recursively on files and directory content. It would extract as much textual information as it could. For example it would convert stream packages to directories, uncompresses and extract tar, zip, gz, bzip files or run dis(1) on ELF (3) files, strings(1) on binary files. All this textual information was stored in a separate directory. This was gigabytes of characters. The initial perl script was named rob.pl (revenge of the binaries). It took more than a day to complete the process; but it worked.

Then to be able to search through all the text files rob.pl generated, I needed a really good text search engine. I evaluated a number of them, most were meant for searching generic plain text or html documents. They either had a poor performance or did not give accurate results. Lucene won the race. It is not a search engine as such. It is a library to create an inverted index and search it. You can use it to build your own search engine to suite your own domain of files. Initially I just used the example search engine that came with Lucene, and was amazed at its speed. Then I looked at the Lucene website and found that Doug Cutting was the author of Lucene. To any student of information retrieval (aka study of search engines) it is a familiar name. I took it for granted that Lucene is the right software to use.

The Search Engines

The search engine technology is no rocket science. It has been there, much before modern computers arrived. Take a book and turn the last few pages. Most likely you will find an "Index" section. That is an alphabetically sorted list of special terms and corresponding page numbers. If you are looking for some word in the whole book, (1) you would first look up the page numbers in the index, and then (2) look up the page and (3) scan through the page looking for your word. This is much faster than going page by page reading each line searching for your word.

All modern search engines to the same thing. For google, yahoo etc., the internet is a big book, each webpage is a page. They generate an Index, that for a given word gives you a list of pages that contain that word. It is called an inverted index since unlike seeing a document as a set of words, it sees as a set of documents for a given word.

The Program

The good thing about Lucene, is that it does not understand document content. You will have to write analyzers for your own content. So you have the control and freedom to interpret different kinds of files the way you want. Lucene does a good job storing your interpretation and searching it. To interpret a variety of languages under common terms, I came up with a very simple idea which fits all programming languages and executables. Programs have "symbol definitions", "symbol references", "human readable text", "path" and may be "revision history". Be it java class files starting with 0xcafebabe or C programs, it is possible to extract definitions, symbols, text, path and history.

The initial version of OpenGrok was a perl script named rob.pl that extracted the above 5 streams and piped them to a lucene search engine. rob.pl had become more intelligent. It was now running each file through ctags and extracting definitions. It also parsed out program identifiers. It would run dis(1) on ELF files and extract labels and call statement symbols.

I called it the Universal Program Search Engine. I was using this on my machine for quite some time. This system was used to confirm or deny existence of several vulnerabilities. For example I used it to confirm that no code in Solaris 7 was calling gzprintf() which was the cause of CVE-2003-0107 Now I could pinpoint affected areas in Solaris for each newly discovered security hole.

Perl to Java

I choose Perl because it was very easy and quick to code. I could use its efficient data structures. It was really quick to prototype a design and make sure it actually worked. I realized choosing perl for a long term solution was a mistake. Perl is great for onetime use and throw type of applications. When I profiled the processes, java process was mostly waiting for perl to parse the text. Processing the entire program tree source and binaries took a almost half a day. After some profiling the perl code and some optimizations, I could reduce the time to about 8-9 hours. Perl was consuming too many compute cycles, despite my script being only couple of hundred lines.

Then I realized that I can write custom Analyzers in Lucene that analyzed each type of program giving out token streams to the Lucene indexer. It was very efficient and much faster. It now takes about 20 minutes to both decompose the ON gate sources into 5 streams and then index them.

The other reason for choosing Java was that after the release of 1.5 it is much like Perl in terms of data structures (aka Collections) and ease of programming, but more efficient and faster.

OpenSolaris

Initially a LXR based solution was in mind for hosting OpenSolaris code. Due to many issues with LXR, a web front end for cscope was also prototyped. Internally we use cscope extensively. cscope has a good feature that is missing in many other code searching tools like LXR. When search results are shown, LXR dumps line number and gives no clue as to what matched. Where as cscope can show lines where a symbol or function is used. Ctags does not show symbol references, cscope does. Cscope works perfectly for a small project, but wasn't scalable for something as big as OpenSolaris.

Cscope had a number of minor dificiencies; its full text search is too slow. Giving the book analogy above, it is does something like linearly searching through each page looking for your words. It could not AND two search conditions, which was a severe limitation if you wanted to search only with in a hierarchy of source tree. To work around this limitation, we had pre-built cscope index in each major tree branch (like library or kernel code), which is very expensive and inefficient. Also it understood only C or C like languages, and could not search for definitions in Makefiles for example.

At that time, this summarizes the state of the art in code lookup and version control web interfaces that were deployed for large open source projects (like http://lxr.mozilla.org/ or http://cvs.gnome.org/)

Feature--- LXR ------ ctags --- --- cscope --- --- CVSview/web ---
Full text Search Y #
Definition Search # Y Y
Identifier Search Y Y
Path search Y Y
History Search
Shows matching lines Y Y
Hierarchical Search
query syntax like AND, OR, field:
Incremental update
Syntax highlighting-Xref Y #
Interface for SCM Y
open source Y Y Y
Usable URLs Y - -
Individual file download - - Y
Changes at directory level - - #
Multi language support# Y # -
Legend:
Y : Yes the feature is present
# : the feature may be partly present
- : not applicable
Note: OpenGrok provides all the above features.

The Lucene based program search engine readily had all the missing search features above. It did the correct way of doing full text searches. It could identify definitions in any language that ctags understood. It could limit searches to a branch of the source tree (aka hierarchical search). It showed the matching lines for a search instead of just line numbers. And it did an incremental update of its index.

At that time the only thing missing in OpenGrok was good color-coding of source and displaying version control information, which was not much difficult to implement. I evaluated a number of Lex like code generators and JFlex turned out be the fastest. Jflex is now used for tokenizing and hyper-text cross reference.

Early this year it was deployed to host the OpenSolaris source code on opensolaris.org and accessible only to OpenSolaris pilot community members. Many things improved with the comments and feedback received from community members. By June 14th this year it was ready for broadcasting millions of lines of OpenSolaris source code to the world. Now some consider OpenGrok as one of the 10 most important things about OpenSolaris.

"Make the common case faster and easier"

Before being considered for OpenSolaris it was just a program written for my own use. If it is going to be deployed for OpenSolaris, and publicly available, there will be hundreds of people and robots hitting it all the time. It had to be secure, fast and efficient and well as usable. OpenGrok then aimed to be a complete integrated and usable solution to source code search and browsing.

The secret behind OpenGrok is the principle "make the common case faster and easier". Call it Chandan's law. (Since Google says I am the first one to say so!). It combines a fundamental principle in Computer Architecture (i.e "make the common case fast" (also known as Amdahl's law) with a fundamental principle in Human Computer Interaction (i.e "make the common case easier"). To build your software pick the best tools available like Lucene and Java. The result is what you now see on http://cvs.opensolaris.org. Believe me, it takes lot more effort to make things simple than to make things complicated.

There are more smaller features and attention to detail that make big difference in usability and speed.

Refactoring

The initial working prototype was available to OpenSolaris pilot community members. It also easily withstood the slashdotting on the day OpenSolaris was launched. Allen who owns the webservers said the CPU usage was quite low even when we had tens of thousands of hits on that day. That was a sigh of relief for me, my efforts in making fast, secure and scalable did not go waste.

Then I started refactoring the code, to keep code organized, to help others extend the functionalities or easily add support for more language types and version control systems. I used Netbeans to do the refactoring. Netbeans has changed a lot since the very first time I used it, many years ago. It makes Java programing a more pleasant experience. I could easily move code around; safely delete unused methods and classes; rename things in more logical way and do much more. Refactoring took time, the core analysis engine changed beyond comparison. Once finished, it was open sourced under the name OpenGrok.
It is available for download here. The logo has a opening flower bracket. Which is a common block start symbol in languages like C and Java. It has no closing bracket to signify open source.

OpenGrok Internals

--- This section is to be written. It explain how OpenGrok groks the source code, how to write your own language analyzer and how to extend it to support a new source code versioning system ---

In Retrospect

If people can not easily understand how a piece of software source code works, or if people cannot easily and accurately search through it, then it is as bad as any proprietary software or data formats which are difficult for general public to understand.

OpenGrok in that sense is a true enabler for source code that is open source. It lets people easily and quickly find the source code, look at it, understand the history and changes made to the source. It makes a developers life easy. It certainly has made my life easy when I am looking for security holes in software.

Comments:

I'm afraid I have to dispute your assertion that perl is slow. I can scan every delta of every file in ON in less than 15 minutes using perl, and I'm doing something \*far\* more complex than simple tokenization.

Posted by Alan Burlison on December 07, 2005 at 08:08 AM PST #

I am not making a generic statement about slowness of perl. Its all in the implementation. If you implement it efficiently I am sure it could perform well. Here I am talking about a first prototype - and I might have done many things inefficiently, including running dis and parsing its output rather than reading an ELF file directly.
In the big picture, perl reading a file, processing it then constructing a second file, then java reading that second file and processing.. it again was inefficient; there is too much IO and intermediate memory required. It perhaps would have been much faster if everything was perl. I choose to do everything in Java. I did evaluate a perl port of Lucene, but do not recollect why choose not to use it.
IIRC, my analysis of this particular slowness, (ignoring the excessive IO time I mentioned above), was due to the fact I have storing the tokens in a list, so that at the end I could join them to be piped to Lucene. That was the bottle neck, since I had at least four big lists per file (full text, definitions, symbols, SCCS history); that is where most of the compute cycles were spent. I had no way to effectively control the memory management. It does not give me the system level insight that C or Java gives.
It is easy to write inefficient programs in in any language - and that is what makes people feel crib about slowness. However I found Java giving me more control in fine tuning and optimizing the bottlenecks.

Posted by Chandan on December 07, 2005 at 09:05 AM PST #

I downloaded OpenGrok and saw that there is a run.bat for generating the index under Windows. I successfully generated the index, modified the web.xml, and deployed the war file. However, when I click on the cross-references or try to browse the source I get the dreaded: Error 404: File not found! The requested resource is not available. Any ideas about what I might not be grokking? (Sorry, I couldn't resist.)

Posted by Richard Terry on January 05, 2006 at 01:39 AM PST #

Uhhhh... Never mind. I figured it out. :-) In my web.xml, my paths look like this:
DATA_ROOT C:\\Software\\Tmp
SRC_ROOT  C:\\Projects\\coxnet\\apps
In my run.bat, my paths look like this:
set SRC_ROOT=C:\\Projects\\coxnet\\apps
set DATA_ROOT=C:\\Software\\Tmp
set EXUB_CTAGS=C:\\Software\\ctags\\ctags.exe
My problem was that I had forward slashes ("/") in the web.xml file.

Posted by Richard Terry on January 05, 2006 at 03:57 AM PST #

Any idea when svn support will be available?

Posted by Mads Toftum on January 15, 2006 at 01:37 AM PST #

Is there a way to add wiki support like wikipedia for editing the source code and discussion form. It will be the encyclopedia for java or any language library, of course after agreeing on the license.

Posted by Faisal Akeel on January 15, 2006 at 03:41 PM PST #

Thanks for posting this. It was interesting reading the history behind OpenGrok.

Posted by Mike Kupfer on March 24, 2006 at 08:19 AM PST #

This is very good, please notify me newsletters by mail!!!

Posted by Arvind Umraniya on August 04, 2007 at 03:35 AM PDT #

How can I make my openGrok search results return XML formated or JSON formated output? I dont want to develop it.

Posted by Chris Hamm on September 19, 2007 at 12:04 AM PDT #

How to make opengrok to index the softlinks. Some of my source code have softlinks like asm/arm/xxx.h where the actual directory will be asm/arm-1136/xxx.h

Any help?

Posted by Suresh on December 02, 2007 at 03:01 PM PST #

I'm looking to add perforce support to the history app, and this post came up first when i asked google for some kind of howto... sadly, the part of the post that suggests a roadmap for scm/language extension is the TBD section.

Any chance of that being fleshed out soon? If not, i'll try and write something up as i go.

Thanks for making this open, we're limping along with the latest version of lxr (with mysql behind it), but the indexing times are killing our server.

Posted by leaf nunes on March 18, 2008 at 03:12 PM PDT #

I have added a Perforce History Reader to OpenGrok: http://www.emilmont.net/doku.php?id=java:opengrok

Posted by Emilio Monti on May 16, 2008 at 12:37 AM PDT #

Why do you list LXR in the feature matrix as not being open source? According to http://sourceforge.net/projects/lxr/ it is licensed under the GPL, which most certainly is open source. Unless I'm misunderstanding something you should correct that mistake.

Posted by alanc on July 30, 2008 at 12:06 AM PDT #

@alanc - LXR itself is open source, but without Glimpse (which isn't exactly closed, but isn't exactly GPL'd, either; seehttp://webglimpse.net/) you lose the freetext searches.

On the other hand, Swish-e based lxr \*is\* fully GPL, but loses the powerful freetext searching that glimpse provides.

Posted by leaf on July 30, 2008 at 12:20 AM PDT #

Post a Comment:
Comments are closed for this entry.
About


sayings of an hearer

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today