Subscribe

Share

Database, SQL and PL/SQL

Old Dog, New Tricks

Take advantage of SQL extensions for hierarchy processing.

By Connor McDonald

January/February 2018

Many blog posts and magazine articles are about the newest releases of a database and the newest software features, focusing on new facilities and functions that have not been seen in commercial usage yet. But in this article, I’ll take a look back at a feature that has been in Oracle Database since way back in Oracle 2: processing hierarchical data by using the CONNECT BY syntax in SQL. And I’ll focus on some of the additional functionality that has been delivered over the course of the major releases of Oracle Database.

Fundamentals

Most database developers are aware of the fundamentals of the hierarchical query syntax. It is one of the first things they learn after they understand rudimentary data manipulation language (DML) elements (SELECT, INSERT, UPDATE, DELETE).

Listing 1 shows an example of the most typical introduction to the hierarchical query syntax. The two elements that identify a SQL query as a hierarchical query are the CONNECT BY and the START WITH clauses. The syntax reads very much like the intended action of the query:

START WITH. The query will “start with” a nominated position (or set of rows) in the hierarchy. In this case, it’s the MANAGER being null, meaning an employee who has no reporting manager, in other words, the CEO or top-level employee in the company. In this case, KING is the only person in this company who has no manager.

CONNECT BY. The query will then “connect” the current row to a subsequent row, “by” a nominated relationship. Once employee KING, with EMPNO = 7389, has been identified, the syntax CONNECT BY PRIOR EMPNO = MGR instructs the query processing to search for a row in the employee (EMP) table where the MGR column equals that of the employee number EMPNO of the prior row, namely KING. This continues recursively down through the hierarchy until it has been fully traversed for the defined conditions.

Code Listing 1: Introduction to hierarchical query syntax

SQL> select empno, ename, mgr
  2  from   emp
  3  connect by prior empno = mgr
  4  start  with mgr is null;

     EMPNO ENAME             MGR
—————————— —————————— ——————————
      7839 KING
      7566 JONES            7839
      7788 SCOTT            7566
      7876 ADAMS            7788
      7902 FORD             7566
      7369 SMITH            7902
      7698 BLAKE            7839
      7499 ALLEN            7698
      7521 WARD             7698
      7654 MARTIN           7698
      7844 TURNER           7698
      7900 JAMES            7698
      7782 CLARK            7839
      7934 MILLER           7782

An easy way to demonstrate the traversal of the hierarchy is to use the pseudocolumn called LEVEL, which returns the “depth” within the hierarchy of the row that is currently being traversed. Combining LEVEL with RPAD to provide a simple indentation, Listing 2 demonstrates how you can more clearly visualize the depth of the hierarchy starting with KING and navigating down to employees at the deepest level of the hierarchy, such as ADAMS, WARD, MARTIN, TURNER, and JAMES. The hierarchy is often described as analogous to a tree structure, with the deepest levels of the hierarchy being “leaf” elements, and the other levels—that is, those that sit logically above the leaves—being “branches.”

Code Listing 2: Demonstrating the hierarchy depth with RPAD

SQL> select empno, rpad(' ',level*3)||ename ename, mgr
  2  from   emp
  3  connect by prior empno = mgr
  4  start  with mgr is null;

     EMPNO ENAME                           MGR
—————————— ———————————————————————— ——————————
      7839    KING
      7566       JONES                    7839
      7788          SCOTT                 7566
      7876             ADAMS              7788
      7902          FORD                  7566
      7369             SMITH              7902
      7698       BLAKE                    7839
      7499          ALLEN                 7698
      7521          WARD                  7698
      7654          MARTIN                7698
      7844          TURNER                7698
      7900          JAMES                 7698
      7782       CLARK                    7839
      7934          MILLER                7782

Another way to visualize the hierarchy is with the SYS_CONNECT_BY_PATH function, which concatenates the nominated expression values within the hierarchy as it is traversed. Listing 3 shows the same employee hierarchy using SYS_CONNECT_BY_PATH to display the relationships.

Code Listing 3: Demonstrating the hierarchy depth with SYS_CONNECT_BY_PATH

SQL> select sys_connect_by_path(ename,'-')
  2  from   emp
  3  start with mgr is null
  4  connect by prior empno = mgr;

SYS_CONNECT_BY_PATH(ENAME,'-')
———————————————————————————————————————————
-KING
-KING-JONES
-KING-JONES-SCOTT
-KING-JONES-SCOTT-ADAMS
-KING-JONES-FORD
-KING-JONES-FORD-SMITH
-KING-BLAKE
-KING-BLAKE-ALLEN
-KING-BLAKE-WARD
-KING-BLAKE-MARTIN
-KING-BLAKE-TURNER
-KING-BLAKE-JAMES
-KING-CLARK
-KING-CLARK-MILLER

Beyond the Basics

For many developers, navigating the hierarchy and displaying the depth is where their usage and understanding of hierarchical queries come to an end, but there are many more features available. One of the reasons for reticence about using hierarchical queries is that in explorations beyond the simple examples using the EMP table, real-world data problems can cause a hierarchical query to crash with an error and return no data at all.

For example, in Listing 4 the manager for employee KING is set to 7499. This introduces a logical corruption in the hierarchy definition for the employees in this company. Traversing through the hierarchy in one direction indicates that KING is the manager of ADAMS, but traversing in the opposite direction through the hierarchy indicates that ADAMS is the manager of KING. There is now a cyclic relationship in the hierarchy. To continuously follow the CONNECT BY syntax would infinitely traverse this hierarchy and never come to an end. To avoid an endless loop, the database will return an error when CONNECT BY encounters such a cyclic relationship.

Code Listing 4: Error stops endless loop in a hierarchical query

SQL> update emp set mgr = 7499 
  2  where  ename = 'KING';

1 row updated.

SQL> select sys_connect_by_path(ename,'-')
  2  from   emp
  3  start with ename = 'KING'
  4  connect by prior empno = mgr
  5  /

ERROR:
ORA-01436: CONNECT BY loop in user data

To avoid this loop and error, there is a NOCYCLE extension to the CONNECT BY syntax. With NOCYCLE, CONNECT BY will traverse the hierarchy as much as possible. Listing 5 shows NOCYCLE in action, preventing an infinite loop and not generating an error in the SQL execution.

Code Listing 5: NOCYCLE prevents infinite loop and error

SQL> select sys_connect_by_path(ename,'-')
  2  from   emp
  3  start with mgr is null
  4  connect by NOCYCLE prior empno = mgr
  5  /

SYS_CONNECT_BY_PATH(ENAME,'-')
———————————————————————————————————————————
-KING
-KING-JONES
-KING-JONES-SCOTT
-KING-JONES-SCOTT-ADAMS
-KING-JONES-FORD
-KING-JONES-FORD-SMITH
-KING-BLAKE
-KING-BLAKE-ALLEN
-KING-BLAKE-WARD
-KING-BLAKE-MARTIN
-KING-BLAKE-TURNER
-KING-BLAKE-JAMES
-KING-CLARK
-KING-CLARK-MILLER

Another extension for hierarchical SQL queries is the ORDER BY clause, so that rather than ordering the entire result set, you can nominate an ordering system within each branch of the tree. The ORDER SIBLINGS BY clause allows for ordering to be isolated within one branch at a time. Listing 6 shows the five leaf entries of ALLEN, JAMES, MARTIN, TURNER, and WARD ordered alphabetically under the KING-BLAKE branch, without having any impact on the entries of other branches.

Code Listing 6: ORDER SIBLINGS BY isolates ordering to one branch

SQL> select rpad(' ',level)||sys_connect_by_path(ename,'-') tree,
  2      empno, dname
  3  from  emp e, dept d
  4  where d.deptno = e.deptno
  5  start with mgr is null
  6  connect by prior empno = mgr
  7  ORDER SIBLINGS BY ENAME;

TREE                                  EMPNO DNAME
———————————————————————————————— —————————— ——————————————
 -KING                                 7839 ACCOUNTING
  -KING-BLAKE                          7698 SALES
   -KING-BLAKE-ALLEN                   7499 SALES
   -KING-BLAKE-JAMES                   7900 SALES
   -KING-BLAKE-MARTIN                  7654 SALES
   -KING-BLAKE-TURNER                  7844 SALES
   -KING-BLAKE-WARD                    7521 SALES
  -KING-CLARK                          7782 ACCOUNTING
   -KING-CLARK-MILLER                  7934 ACCOUNTING
  -KING-JONES                          7566 RESEARCH
   -KING-JONES-FORD                    7902 RESEARCH
    -KING-JONES-FORD-SMITH             7369 RESEARCH
   -KING-JONES-SCOTT                   7788 RESEARCH
    -KING-JONES-SCOTT-ADAMS            7876 RESEARCH

Note in the previous example that even if a table contains a hierarchical relationship, it can still be joined to other tables.

Filtering Versus Navigation

Navigating the hierarchy is quite distinct from the filtering that comes with WHERE clause predicates in the SQL statement. To demonstrate this, the 14 rows in the EMP table have been duplicated to be a set of rows for multiple geographic regions, as shown in Listing 7. There are now 42 rows in the table: the standard set of 14 rows in each of the NORTH, SOUTH, and EAST regions.

Code Listing 7: Duplicating EMP table rows for multiple regions

SQL> alter table emp add region varchar2(10);

Table altered.

SQL> update emp set region = 'NORTH';

14 rows updated.

SQL>
SQL> insert into emp
  2  select
  3    empno, ename, job, mgr, hiredate, sal, comm, deptno,
  4    decode(r,1,'SOUTH',2,'EAST') region
  5  from emp,
  6       ( select rownum r from dual connect by level <= 2 )
  7  where region = 'NORTH';

28 rows created.

SQL>
SQL> select region, count(*)
  2  from   emp
  3  group by region;

REGION       COUNT(*)
—————————— ——————————
NORTH              14
SOUTH              14
EAST               14

Let’s assume a requirement to obtain the hierarchy for the 14 rows for the NORTH region only. It would seem intuitive to use a query identical to Listing 2, but with an additional predicate to restrict the REGION to NORTH. Listing 8 shows the result.

Code Listing 8: Revisiting Listing 2 with WHERE clause

SQL> select empno, rpad(' ',level*3)||ename ename, mgr
  2  from   emp
  3  where  region = 'NORTH'
  4  connect by prior empno = mgr
  5  start  with mgr is null;

     EMPNO ENAME                                           MGR
—————————— ———————————————————————————————————————— ——————————
      7839    KING
      7566       JONES                                    7839
      7788          SCOTT                                 7566
      7876             ADAMS                              7788
      ...
      ...
      7782       CLARK                                    7839
      7934          MILLER                                7782
      7934          MILLER                                7782
      7934          MILLER                                7782

136 rows selected.

Note that the output returned is not 14 rows but 136 rows, even though a predicate to restrict the results to the NORTH region only was applied.

To understand the cause of this problem, consider the very first NORTH region row that is being processed. The START WITH clause is MGR IS NULL, so three King rows, one for each region, will be located, and then the CONNECT BY operation will commence. The CONNECT BY clause is PRIOR EMPNO = MGR, so for each KING record, three related employees, one for each region, will be connected to, which rapidly increases the number of rows returned as the query traverses through the hierarchy. Only after the hierarchical relationship is established is the WHERE clause predicate applied, so even though the result set contains only rows for REGION = NORTH, a large amount of redundant work was performed doing erroneous connections to the other regions because the CONNECT BY and START WITH syntax did not do the following:

  • Restrict itself to starting with only the NORTH region
  • Make sure that, when connecting to the next entry in the hierarchy, the row maps from the same region to the subsequent row’s region

Listing 9 shows the corrected syntax for this query. It starts with managers that are null (MGR IS NULL) but also only in the NORTH region. It then ensures that when connecting from one manager to a subsequent employee, the definition for connecting also includes a common region. This is analogous to the way the ANSI join syntax is processed, namely, that there is a distinct difference between the hierarchical connecting condition and the WHERE clause predicates that will be applied subsequently as filters.

This gives rise to a common but incorrect criticism of hierarchical queries: they can run too slowly or consume too many resources. Commonly the cause is that the CONNECT BY information is incomplete, processing many more rows than need to be processed.

Code Listing 9: Limiting hierarchical queries in CONNECT BY and START WITH

SQL> select empno, rpad(' ',level*3)||ename ename, mgr
  2  from   emp
  3  connect by prior empno = mgr
  4         and  region = 'NORTH'
  5  start  with mgr is null
  6         and  region = 'NORTH';

     EMPNO ENAME                                           MGR
—————————— ———————————————————————————————————————— ——————————
      7839    KING
      7566       JONES                                    7839
      7788          SCOTT                                 7566
      7876             ADAMS                              7788
      7902          FORD                                  7566
      7369             SMITH                              7902
      7698       BLAKE                                    7839
      7499          ALLEN                                 7698
      7521          WARD                                  7698
      7654          MARTIN                                7698
      7844          TURNER                                7698
      7900          JAMES                                 7698
      7782       CLARK                                    7839
      7934          MILLER                                7782

14 rows selected.

You might ask, “What, then, is the use of a WHERE clause in a hierarchical query?” The presence of a WHERE clause is still valid, not to define the hierarchy but to define filtering of the data. For example, once the NORTH region hierarchy has been instantiated, the WHERE clause requirement might be to include only those employees earning above a certain SALARY.

Hierarchy Attributes

When dealing with hierarchical structures, there are common questions that need to be answered about the data. For the sample data in this article, questions might be along the lines of

  • Is this employee a manager? That is, are there any employees who report to this employee?
  • Which employee sits at the top of the hierarchy when starting a query from a given leaf?

With the initial set of sample data in this article, KING is the sole employee who sits at the top of the hierarchy, but in Listing 10, some rows have been added to demonstrate that this does not have to be the case for all hierarchies. Employee DENNIS has been added who also has no manager; that is, he sits at the same top level as KING. So it is possible that employees at the very bottom of the hierarchy structure might ultimately traverse up to a different highest manager.

Code Listing 10: Adding employees—including one who has no manager

SQL> insert into emp ( empno, ename, mgr, deptno)
  2  values (1000,'DENNIS',null, 10);

1 row created.

SQL> insert into emp ( empno, ename, mgr, deptno)
  2  values (1001,'CHRIS',1000, 10);

1 row created.

SQL> insert into emp ( empno, ename, mgr, deptno)
  2  values (1002,'CRAIG',1001, 10);

1 row created.

There are additional pseudofunctions available to answer the “is this a manager?” and “which employee is at the top of the hierarchy?” questions as the hierarchy is traversed:

  • The CONNECT_BY_ROOT function, followed by a column name, returns the attribute of the root-level node up to which this current node ultimately traverses.
  • The CONNECT_BY_LEAF function returns a zero if a value pertains to a branch in the hierarchy, and it returns 1 if it is a leaf—that is, if there are no subordinate entries as defined by the CONNECT BY clause.
  • The CONNECT_BY_ISCYCLE function allows the identification of which entries in the hierarchy are currently involved within a cyclic relationship. It can be used when dealing with NOCYCLE in a CONNECT BY clause.

Listing 11 demonstrates these three pseudofunctions.

Code Listing 11: Three pseudofunctions that answer questions

SQL> select  connect_by_root ename root,
  2          connect_by_isleaf     leaf,
  3          connect_by_iscycle    cyc,
  4          sys_connect_by_path(ename,'-') full
  5  from  emp
  6  start with ename in ('KING','DENNIS')
  7  connect by NOCYCLE prior empno = mgr;

ROOT         LEAF    CYC FULL
—————————— —————— —————— ———————————————————————
KING            0      0 -KING
KING            0      0 -KING-JONES
KING            0      0 -KING-JONES-SCOTT
KING            1      0 -KING-JONES-SCOTT-ADAMS
KING            0      0 -KING-JONES-FORD
KING            1      0 -KING-JONES-FORD-SMITH
KING            0      0 -KING-BLAKE
KING            1      1 -KING-BLAKE-ALLEN
KING            1      0 -KING-BLAKE-WARD
KING            1      0 -KING-BLAKE-MARTIN
KING            1      0 -KING-BLAKE-TURNER
KING            1      0 -KING-BLAKE-JAMES
DENNIS          0      0 -DENNIS
DENNIS          0      0 -DENNIS-CHRIS
DENNIS          1      0 -DENNIS-CHRIS-CRAIG

Summary

There is much more functionality for the hierarchical query syntax in SQL over and above the basic START WITH and CONNECT BY clauses. These extensions allow a far more sophisticated analysis of hierarchical data in your database systems.

My next article will explore a new mechanism to navigate hierarchical data structures without using the CONNECT BY syntax at all.

Next Steps

LEARN more about hierarchical query processing.

DOWNLOAD Oracle Database 12c Release 2.

Photography by Getty Images/iStockphoto