OpenGrok Internals

OpenGrok contains these main packages:

  • org.opensolaris.opengrok.analysis: Responsible for analyzing programs, source files, archives like ZIP, tar, documents like man pages, xml and html files, images etc.,
  • org.opensolaris.opengrok.index: Builds or updates the Lucene index, recursively going down the directory tree.
  • org.opensolaris.opengrok.search: Has utilities that provide interface for search results, matched context etc.
  • org.opensolaris.opengrok.history: Abstraction for source code version control.
  • org.opensolaris.opengrok.web: Utility routines used by the webapp.
While the source is the ultimate reference guide for how it works, here is a brief illustration of certain mechanisms.

Analysis

Think of the package as a separate department having engineers (called Analyzers) who are subject matter experts in each type of program or file type. CAnalyzer knows something about C programming language. The ELF Analyzer knows how to read a ELF symbol and string table from an ELF (executable) file.

Central to analysis, is the Analyzer Guru. He knows all the Analyzers by name. Most analyzers sit in static offices. Hiring a new Analyzer for each analysis work was a bit expensive.

When the indexer thinks he needs to analyze a file to be indexed, he calls Analyzer Guru. Analyzer Guru knows exactly who to send the file to. He creates a blank Lucene Document and a FileInputStream and gives it to the appropriate Analyzer. He knows the Analyzers by name, because they would initially tell him the file extensions and magic numbers of file types they are experts in.

On getting a document, an Analyzer sends it to his boss (super) to get it filled with more Lucene fields. A boss sends it to his boss, and so on. Higher up the management chain, they do less (specialized) work. For example while CAnalyzer knows a great deal about C keywords and comments and knows how to generate hyper-text cross-reference a C file, his boss Plain Analyzer can only read plain text files. He has poor punctuation skills and ignores most of the punctuation marks that he thinks are unnecessary. He knows what websites and email addresses are, so when he is asked to cross reference a file, he just hyper-links the URLs he can recognize.

Plain Analyzer also does an important job. He out sources finding definitions in a program file to Exuberant Ctags. He sends out the file to exuberant ctags, which tells him exactly what symbols are considered definitions. Plain Analyzer adds that information to the Lucene Document.

His boss, File Analyzer is the boss of the department. Every one reports to him. All he does is to stamp the Lucene document with file date and name and sends it back.

This package quite modular. It can be extended to analyze any program type. To add an Analyzer for a new type of programming language or file type, just extend or copy one of the suitable Analyzers and introduce his name to Analyzer Guru.

To get the version control history Analyzer Guru, calls his old friend the good old History Guru. This guy is much older than Analyzer Guru, and his silky long beard almost touches the ground. He has several assistants called History Readers. They just read the version control log history for a given file and directory. Currently there are assistants who can read Subversion, CVS and SCCS logs. To be able to support a new type of source code version control, just hire a new assistant reader who can read the logs.

Analyzer Guru on getting his Document returned back with different information like definitions, full-text, symbols, history, sends it back to the Indexer.

Index Update

The index is the inverted index of all files in the source trees. For every unique word in the source tree it contains a list of files where the word can be found. While tools like cscope and ctags also build indexes, they can't incrementally update it. They just rebuild it from scratch. OpenGrok uses the modern indexing methods and can incrementally update its index. (i.e update only the changed/added/deleted files since last index build)

To know what files changed or got created or deleted we keep a sorted list of files in the index tagged with the files last modified date (converted in to a alphanumeric string such that dateString(date1) < dateString(date2) if date1 < date2. We traverse the file tree depth first at each stage sorting the child nodes, by which we get a sorted list of file paths.

It boils down to finding the diffs between two sorted lists (list of files on disk Vs list of files in the index). Left hand side is the file tree whose index on the right side needs to be updated. To begin with assume we had a correctly indexed tree Yesterday.

Let us say Today we removed a makefile and added a new file foo.c and modified frotz.c. During first pass we see that makefile-yesterday and frotz.c-today on our index-list did not match tree-traversal list. So we delete those lines (i.e documents) from the index.

During second pass we see that foo.c-today and frotzc-today are new and we add the documents to the index.

The Lucene inverted index can be either opened to add more documents or delete existing documents at a time. To update a document you must delete it first, close the index and add it again. To optimize for faster updates, we first delete in the index all the changed or deleted files in the source tree. In the second pass we add all the documents not in the index.

Comments:

Chandan: IIRC opengrok can't be used when the src files are links, right? -- prasad

Posted by Prasad on February 22, 2006 at 04:36 AM PST #

Yes I made it ignores symlinks. If the symlinks were within the source tree, you dont lose anything. The problem is it can't detect duplicate files. May be indexing symlinks can be optionally enabled.

Posted by Chandan on February 22, 2006 at 10:32 AM PST #

The reason I ask is, our SCM (Telelogic's CCM) uses symlinks and I believe Clearcase does it too. Do you have any plans on supporting these commercial SCM's? -- prasad

Posted by Prasad on February 23, 2006 at 09:50 PM PST #

If the symlinks are part of the revision history then, it is upto the corresponding History plugin to handle it. Ignoring symlinks I was mentioning applied only to source files.

Posted by Chandan on February 23, 2006 at 11:46 PM PST #

Recently I had setup the source code search engine for JES4 Installer(CVS repository) using OpenGrok at: http://arc.india.sun.com/jes4src/xref/other_build/startbuild.sh But When I click on the History, It says "Error 404: File not found". Why is it so? FYI, I observed this behavior with CVS respository ony whereas with SCCS(or teamware workshop) repository the History is working fine. JES3 Installer (SCCS repository) http://arc.india.sun.com/jes3src/xref/jes3_apr11/external/other_build/startbuild.sh Also check out my blog that make use of SJS Web Server alongwith OpenGrok: http://blogs.sfbay/roller/page/nandak#setup_your_own_source_code

Posted by Nanda Kishor on May 09, 2006 at 11:23 PM PDT #

The CVSROOT must be available locally. Check the entries under .CVS directory.

Posted by chandan on May 10, 2006 at 12:39 AM PDT #

Hi. I am currently trying to extend OpenGrok's functionality to include VB. I understand the process of doing this but don't have any! experience with FLEX and really don't understand how to contribute to the existing source code. I have been reading up on the lexical analyzers but was wondering how I write Java code to extend OpenGrok and how the Flex stuff works. Also taking into mind what level of knowledge of Flex I need to write new classes for OpenGrok. I'm looking through the existing C and Java Analyzers, SymbolTokenizers, and Xrefs that go along with the languages and I don't understand how I can take the next step by adding the VB part. Thanks

Posted by Patrick Gallagher on February 15, 2007 at 11:00 PM PST #

Post a Comment:
Comments are closed for this entry.
About


sayings of an hearer

Search

Archives
« July 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
  
       
Today