Thursday May 21, 2009

extensible indexes in OpenDS



Extensible index is a new index introduced in OpenDS 2.0. Since there is not much of documentation available, I will explain here in detail. Before you dive into the content here, I will advise you to have a look at the wiki for collation matching rules.  A collation matching rule helps you in performing an internationalized search. As explained on wiki, you can perform equality, ordering (less than, less than equal to , greater than and greater than equal to) and substring-based searches using the matching rules. Another good thing about them is that you can index the matching rules for faster searches. I will also like to clarify that an extensible index is much more than a collation index; however, we have just collation-based indexes at the moment. For the sake of clarify I will use extensible index when I mean only a subset of it - collation indexes.  Interestingly, a collation index can contain all kinds of indexes ( except presence) I have blogged before. So you can configure to use an equality, substring or various kinds of ordering-based indexes. Before you get confused let us see how it works. Let us create an equality, substring and less-than-equal-to index for attribute "cn".


>>>> Configure the properties of the Local DB Index


        Property                        Value(s)


        -----------------------------------------------------------------------


    1)  attribute                       cn


    2)  index-entry-limit               4000


    3)  index-extensible-matching-rule  No extensible matching rules will be indexed.


    4)  index-type                      equality, extensible, ordering,presence, substring




    ?)  help


    f)  finish - apply any changes to the Local DB Index


    c)  cancel


    q)  quit




Enter choice [f]: 3


>>>> Configuring the "index-extensible-matching-rule" property


    The extensible matching rule in an extensible index.


    An extensible matching rule must be specified using either LOCALE or OID of the matching rule.


    Syntax:  LOCALE | OID - A Locale or an OID.




Do you want to modify the "index-extensible-matching-rule" property?


    1)  Keep the default behavior: No extensible matching rules will be indexed.


    2)  Add one or more values




    ?)  help


    q)  quit




Enter choice [1]: 2




Enter a value for the "index-extensible-matching-rule" property [continue]: en.eq




Enter another value for the "index-extensible-matching-rule" property


[continue]: en.sub


Enter another value for the "index-extensible-matching-rule" property


[continue]: en.lte


Enter another value for the "index-extensible-matching-rule" property


[continue]: 




>>>> Configuring the "index-extensible-matching-rule" property (Continued)


The "index-extensible-matching-rule" property has the following values:


    \*)  en.eq


    \*)  en.lte


    \*)  en.sub




Do you want to modify the "index-extensible-matching-rule" property?


    1)  Use these values


    2)  Add one or more values


    3)  Remove one or more values


    4)  Reset to the default behavior: No extensible matching rules will be


        indexed.


    5)  Revert changes




    ?)  help


    q)  quit




Enter choice [1]: 


Press RETURN to continue 


>>>> Configure the properties of the Local DB Index




        Property                        Value(s)


        -----------------------------------------------------------------------


    1)  attribute                       cn


    2)  index-entry-limit               4000


    3)  index-extensible-matching-rule  en.eq, en.lte, en.sub


    4)  index-type                      equality, extensible, ordering,


                                        presence, substring




    ?)  help


    f)  finish - apply any changes to the Local DB Index


    c)  cancel


    q)  quit




Enter choice [f]: 




The Local DB Index was modified successfully




Press RETURN to continue 




Rebuild the indexes because we already have some entries there with cn.


sin > bin/rebuild-index -b dc=example,dc=com -i cn


[21/May/2009:12:49:57 -0500] category=BACKEND severity=INFORMATION msgID=9437595 msg=Local DB backend ds-cfg-backend-id=userRoot,cn=Backends,cn=config does not specify the number of lock tables: defaulting to 53


[21/May/2009:12:49:57 -0500] category=BACKEND severity=INFORMATION msgID=9437594 msg=Local DB backend ds-cfg-backend-id=userRoot,cn=Backends,cn=config does not specify the number of cleaner threads: defaulting to 2 threads


[21/May/2009:12:49:57 -0500] category=JEB severity=NOTICE msgID=8847510 msg=Due to changes in the configuration, index dc_example_dc_com_cn is currently operating in a degraded state and must be rebuilt before it can be used


[21/May/2009:12:49:57 -0500] category=JEB severity=NOTICE msgID=8847497 msg=Rebuild of index(es) cn started with 3 total records to process


[21/May/2009:12:49:57 -0500] category=JEB severity=NOTICE msgID=8847493 msg=Rebuild complete. Processed 3 records in 0 seconds (average rate 115.4/sec)


See the list of all the databases including the extensible ones:


sin > bin/dbtest list-database-containers -n userRoot -b "dc=example,dc=com"


Database Name    Database Type  JE Database Name                   Entry Count


------------------------------------------------------------------------------


dn2id            DN2ID          dc_example_dc_com_dn2id            3


id2entry         ID2Entry       dc_example_dc_com_id2entry         3


referral         DN2URI         dc_example_dc_com_referral         0


id2children      Index          dc_example_dc_com_id2children      1


id2subtree       Index          dc_example_dc_com_id2subtree       1


state            State          dc_example_dc_com_state            23


aci.presence     Index          dc_example_dc_com_aci.presence     0


cn.equality      Index          dc_example_dc_com_cn.equality      2


cn.presence      Index          dc_example_dc_com_cn.presence      1


cn.substring     Index          dc_example_dc_com_cn.substring     7


cn.ordering      Index          dc_example_dc_com_cn.ordering      2


cn.en.shared     Index          dc_example_dc_com_cn.en.shared     2     ----> extensible index


cn.en.substring  Index          dc_example_dc_com_cn.en.substring  7   ---> extensible index




An extensible (only collation-based) index database is named as "attribute.locale.index_type". You may want to note that an equality or ordering index will created a "shared" index database (cn.en.shared) because the content is same in both.  Let us dump the content of each.


Using dbtest


sin > bin/dbtest dump-database-container -n userRoot -b "dc=example,dc=com" -d cn.en.shared


Indexed Value (10 bytes): STU


Entry ID List (8 bytes): 2 




Indexed Value (12 bytes): hfXe


Entry ID List (8 bytes): 3 






Total Records: 2


Total / Average Key Size: 24 bytes / 12 bytes


Total / Average Data Size: 16 bytes / 8 bytes


sin > bin/dbtest dump-database-container -n userRoot -b "dc=example,dc=com" -d cn.en.substring


Indexed Value (6 bytes): STU


Entry ID List (8 bytes): 2 




Indexed Value (4 bytes): TU


Entry ID List (8 bytes): 2 




Indexed Value (2 bytes): U


Entry ID List (8 bytes): 2 




Indexed Value (4 bytes): Xe


Entry ID List (8 bytes): 3 




Indexed Value (2 bytes): e


Entry ID List (8 bytes): 3 




Indexed Value (6 bytes): fXe


Entry ID List (8 bytes): 3 




Indexed Value (8 bytes): hfXe


Entry ID List (8 bytes): 3 






Total Records: 7


Total / Average Key Size: 34 bytes / 4 bytes


Total / Average Data Size: 56 bytes / 8 bytes


 


Using dbdump


sin > java com.sleepycat.je.util.DbDump -h db/userRoot/ -p -s dc_example_dc_com_cn.en.shared


VERSION=3


format=print


type=btree


dupsort=0


HEADER=END


 \\00S\\00T\\00U\\00\\00\\00\\00


 \\00\\00\\00\\00\\00\\00\\00\\02


 \\00h\\00f\\00X\\00e\\00\\00\\00\\00


 \\00\\00\\00\\00\\00\\00\\00\\03


DATA=END


 


sin > java com.sleepycat.je.util.DbDump -h db/userRoot/ -p -s dc_example_dc_com_cn.en.substring


VERSION=3


format=print


type=btree


dupsort=0


HEADER=END


 \\00S\\00T\\00U


 \\00\\00\\00\\00\\00\\00\\00\\02


 \\00T\\00U


 \\00\\00\\00\\00\\00\\00\\00\\02


 \\00U


 \\00\\00\\00\\00\\00\\00\\00\\02


 \\00X\\00e


 \\00\\00\\00\\00\\00\\00\\00\\03


 \\00e


 \\00\\00\\00\\00\\00\\00\\00\\03


 \\00f\\00X\\00e


 \\00\\00\\00\\00\\00\\00\\00\\03


 \\00h\\00f\\00X\\00e


 \\00\\00\\00\\00\\00\\00\\00\\03


DATA=END


 


If you look carefully you would find that the contents of an equality index and extensible equality indexes are different for same value of cn( i.e. 'user" ). Like a normal equality index, the extensible equality index also contains a normalized value as the key. However, the value is normalized according to Unicode standards (NFKC) which requires using java.text.Normalizer.


 

substring index in OpenDS

A substring index has the largest footprint in terms of the database size. When a substring index is created on a particular value, a number of combinations are tried to generate the keys. For example, a value "user" will have the keys such as : "user", "ser", "er"  and  "r".  A substring filter (such as "u\*s\*e\*r" ) typically has 3 parts


subInitial  -- It is the value before the \* , i.e. "u"


subAny -- List of all the values between subInitial and subFinal separated by \*, i.e. ["s", "e"]


subFinal -- Final value after the \*,  i.e. "r"


You are free to select any combinations of the these components. For example, "u\*r", "\*r"  and "u\*" etc. However, you may want to use a filter intelligently so that it filters out most of the entries while building a list of EntryIDs. If you are familiar with index limit, you may be aware that if the size of the list crosses this mark, the indexing won't be used and it may be a costly search.  See how the database dump looks like for substring indexes:


using dbtest 


sin > bin/dbtest dump-database-container -n userRoot -b "dc=example,dc=com" -d cn.substring


Indexed Value (3 bytes): abc


Entry ID List (8 bytes): 2 




Indexed Value (2 bytes): bc


Entry ID List (8 bytes): 2 




Indexed Value (1 bytes): c


Entry ID List (8 bytes): 2 




Indexed Value (2 bytes): er


Entry ID List (8 bytes): 3 




Indexed Value (1 bytes): r


Entry ID List (8 bytes): 3 




Indexed Value (3 bytes): ser


Entry ID List (8 bytes): 3 




Indexed Value (4 bytes): user


Entry ID List (8 bytes): 3 


Total Records: 7


Total / Average Key Size: 17 bytes / 2 bytes


Total / Average Data Size: 56 bytes / 8 bytes


Using dbdump 


 sin > java com.sleepycat.je.util.DbDump -h db/userRoot/ -p -s dc_example_dc_com_cn.substring


VERSION=3


format=print


type=btree


dupsort=0


HEADER=END


 abc


 \\00\\00\\00\\00\\00\\00\\00\\02


 bc


 \\00\\00\\00\\00\\00\\00\\00\\02


 c


 \\00\\00\\00\\00\\00\\00\\00\\02


 er


 \\00\\00\\00\\00\\00\\00\\00\\03


 r


 \\00\\00\\00\\00\\00\\00\\00\\03


 ser


 \\00\\00\\00\\00\\00\\00\\00\\03


 user


 \\00\\00\\00\\00\\00\\00\\00\\03


DATA=END



presence index in OpenDS

A presence index is used when the search filter uses an asterisk to ascertain that there is a value at the back-end for that attribute. For example, to search an existing entry cn=user,dc=example,dc=com if I use a filter "cn=\*", a presence index will be used behind-the-scene. Note that if you change the filter with a prefix or suffix (i.e. "cn=u\*", "cn=\*r" and  "cn="u\*r" etc) , it will rather use the substring index. 


While creating our entry, server will create a presence index key-value pair for all the attributes which have a presence index configured. Interestingly a presence index has "+" as the key  and the list of EntryIDs as the value. It means that each attribute will have only one record with a list of EntryIDs corresponding to "+" key. 


When you search for this entry using "cn=\*" filter, it figures out a presence index needs to be used and it fetches all values from the index database for "+" key ( since there is only 1 record). Once it retrieves set of EntryIDs from the database, the corresponding entry is grabbed from id2entry database and matched against filter prior to being returned to the client.


 As you see above, maintaining a presence index is quite costly both in terms of writing and reading. A substring index may be a better choice depending on the keys.


Dumping presence index using dbtest 


sin > bin/dbtest dump-database-container -n userRoot -b "dc=example,dc=com" -d cn.presenceIndexed Value (1 bytes): +


Entry ID List (16 bytes): 2 3 


Total Records: 1


Total / Average Key Size: 1 bytes / 1 bytes


Total / Average Data Size: 16 bytes / 16 bytes


Dumping presence index using dbdump 


 sin > java com.sleepycat.je.util.DbDump -h db/userRoot/ -p -s dc_example_dc_com_cn.presence


VERSION=3


format=print


type=btree


dupsort=0


HEADER=END


 +


 \\00\\00\\00\\00\\00\\00\\00\\02\\00\\00\\00\\00\\00\\00\\00\\03


DATA=END

ordering index in OpenDS

An ordering index is used for searches involving ">=" or "<=" operators. An attribute value is normalized using attribute's ordering matching rule to generate the key. Since the normalization techniques is same for both equality and ordering matching rules, both the indexes end up having the same data. This is something which can be avoided by new design which is already in place for extensible indexes. Hopefully, OpenDS 3.0 will avoid using the separate ordering indexes when not needed. 

equality index in OpenDS

An equality index is used when the search filter uses an equality match. For example, to search an existing entry cn=user,dc=example,dc=com if I use a filter "cn=user", an equality index will be used behind-the-scene. Let us see how an equality index works from its creation to the search. Assuming the first entry for this back-end is "dc=example,dc=com", the corresponding EntryID will be 1. If we create our entry next to this base dn, our entry with a dn: cn=user,dc=example,dc=com will have an EntryID as 2.


While creating our entry server will create an equality index key-value pair for all the attributes which have an equality index configured. Since attribute "cn" has it by default, the server will find its default equality matching rule and normalize the value "user" using that rule. This normalized value is converted into bytes and is available as a key. As rightly guessed, the value is the entryID which we just saw above i.e. 2. EntryID is generated prior to inserting anything into the database.


When you search for this entry using "cn=user" filter, it figures out an equality index needs to be used and searches the equality index database for a key with value "user". Once it retrieves any EntryID from the database, the actual entry is grabbed from id2entry database.


What if we add another entry which has a cn with a value = "user"? Will it overwrite our key in the equality index? No.. it doesn't. I was incorrect while saying that the value in the key-value pair is just an EntryID; it is actually a list of EntryIDs. If there are multiple EntryIDs corresponding to a key, all of them are concatenated. To find the number of EntryIDs, you can divide the length of concatenated byte array with 8. See an example:


I have following 3 entries in my userRoot backend:


dn: dc=example,dc=com
dn: cn=abc,dc=example,dc=com
dn: cn=user,dc=example,dc=com


Get the list of all the databases in this backend


sin > bin/dbtest list-database-containers -n userRoot


Database Name Database Type JE Database Name Entry Count
--------------------------------------------------------------------------------------
Base DN: dc=example,dc=com
dn2id DN2ID dc_example_dc_com_dn2id 3
id2entry ID2Entry dc_example_dc_com_id2entry 3
referral DN2URI dc_example_dc_com_referral 0
id2children Index dc_example_dc_com_id2children 1
id2subtree Index dc_example_dc_com_id2subtree 1
state State dc_example_dc_com_state 20
aci.presence Index dc_example_dc_com_aci.presence 0
cn.equality Index dc_example_dc_com_cn.equality 2
Dump the content of the equality index database


sin > bin/dbtest dump-database-container -n userRoot -b "dc=example,dc=com" -d cn.equality
Indexed Value (3 bytes): abc
Entry ID List (8 bytes): 2
Indexed Value (4 bytes): user
Entry ID List (8 bytes): 3


Using dbdump utility to see non-opends representation of index database contents


sin > export CLASSPATH=/Users/sin/opends/b2.0/opends/build/package/OpenDS-2.0.0/lib/OpenDS.jar:/Users/sin/opends/b2.0/opends/build/package/OpenDS-2.0.0/lib/je.jar:$CLASSPATH


sin > java com.sleepycat.je.util.DbDump -h db/userRoot/ -p -s dc_example_dc_com_cn.equality


VERSION=3
format=print
type=btree
dupsort=0
HEADER=END
abc
\\00\\00\\00\\00\\00\\00\\00\\02
user
\\00\\00\\00\\00\\00\\00\\00\\03
DATA=END


You can see the byte values of entryID against each of the keys.



Indexing in OpenDS

OpenDS wiki provides a good introduction to the index databases and their purposes.  I'll try to dig further into how a particular index database looks like at the sleepycat end. For those who are not familiar with sleepycat or Berkley DB, it is the embedded DB used by openDS. I will try to explain one index in each of my posts. If you are coming from the relational database background, you'd probably be aware of the benefits of the indexing. Typically an index is a B+ Tree where each node stores a key and a value pair.  


Before we begin with the index, it is a prerequisite to know about the EntryID. An EntryID is a long value which is unique for each entry inside an openDS database. While all of the indexes may have different keys based on types of indexes, all of them have the same values : EntryID. The idea behind an index is to search using a key and grab a list of EntryIDs. Further down, this list of EntryIDs is used to retrieve all the entries from another database called id2entry and the filter is applied on each one of those entries to see if they match. The matched ones are returned to the client.  OpenDS provides following indexes:


children index -- system index ( not used by users directly )


subtree index -- system index  (not used by users directly)


equality index -- user configurable


substring index --user configuration


ordering index  -- user configurable


presence index -- user configurable


extensible index -- user configurable


System indexes are used anyway. OpenDS needs to maintain them to build information related to scope of the entries. Since these are not attribute-based, users don't have any control over them. Rest of the indexes are available per attribute and hence available for configuration. For more information on how to create/modify the indexes above have a look at the openDS wiki.  


When you create an entry under a back-end, all the configured indexes are updated with the new entry information. For example, if you add an entry cn=user,dc=example,dc=com , equality indexes will have normalized value(s) of attribute "cn" as the key(s) and its EntryID (say 2), as the value. 



About

This is the blog of a software engineer, specialized in identity management. Kunal Sinha works in Directory Services Engineering (OpenDS) team from Austin,Texas.

Search

Archives
« May 2009
SunMonTueWedThuFriSat
     
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
22
23
24
25
26
27
28
29
30
31
      
Today
Bookmarks