By Connor McDonald
Many blog posts and magazine articles are about the newest releases of a database and the newest software features, focusing on new facilities and functions that have not been seen in commercial usage yet. But in this article, I’ll take a look back at a feature that has been in Oracle Database since way back in Oracle 2: processing hierarchical data by using the CONNECT BY syntax in SQL. And I’ll focus on some of the additional functionality that has been delivered over the course of the major releases of Oracle Database.
Most database developers are aware of the fundamentals of the hierarchical query syntax. It is one of the first things they learn after they understand rudimentary data manipulation language (DML) elements (SELECT, INSERT, UPDATE, DELETE).
Listing 1 shows an example of the most typical introduction to the hierarchical query syntax. The two elements that identify a SQL query as a hierarchical query are the CONNECT BY and the START WITH clauses. The syntax reads very much like the intended action of the query:
START WITH. The query will “start with” a nominated position (or set of rows) in the hierarchy. In this case, it’s the MANAGER being null, meaning an employee who has no reporting manager, in other words, the CEO or top-level employee in the company. In this case, KING is the only person in this company who has no manager.
CONNECT BY. The query will then “connect” the current row to a subsequent row, “by” a nominated relationship. Once employee KING, with EMPNO = 7389, has been identified, the syntax CONNECT BY PRIOR EMPNO = MGR instructs the query processing to search for a row in the employee (EMP) table where the MGR column equals that of the employee number EMPNO of the prior row, namely KING. This continues recursively down through the hierarchy until it has been fully traversed for the defined conditions.
Code Listing 1: Introduction to hierarchical query syntax
SQL> select empno, ename, mgr 2 from emp 3 connect by prior empno = mgr 4 start with mgr is null; EMPNO ENAME MGR —————————— —————————— —————————— 7839 KING 7566 JONES 7839 7788 SCOTT 7566 7876 ADAMS 7788 7902 FORD 7566 7369 SMITH 7902 7698 BLAKE 7839 7499 ALLEN 7698 7521 WARD 7698 7654 MARTIN 7698 7844 TURNER 7698 7900 JAMES 7698 7782 CLARK 7839 7934 MILLER 7782
An easy way to demonstrate the traversal of the hierarchy is to use the pseudocolumn called LEVEL, which returns the “depth” within the hierarchy of the row that is currently being traversed. Combining LEVEL with RPAD to provide a simple indentation, Listing 2 demonstrates how you can more clearly visualize the depth of the hierarchy starting with KING and navigating down to employees at the deepest level of the hierarchy, such as ADAMS, WARD, MARTIN, TURNER, and JAMES. The hierarchy is often described as analogous to a tree structure, with the deepest levels of the hierarchy being “leaf” elements, and the other levels—that is, those that sit logically above the leaves—being “branches.”
Code Listing 2: Demonstrating the hierarchy depth with RPAD
SQL> select empno, rpad(' ',level*3)||ename ename, mgr 2 from emp 3 connect by prior empno = mgr 4 start with mgr is null; EMPNO ENAME MGR —————————— ———————————————————————— —————————— 7839 KING 7566 JONES 7839 7788 SCOTT 7566 7876 ADAMS 7788 7902 FORD 7566 7369 SMITH 7902 7698 BLAKE 7839 7499 ALLEN 7698 7521 WARD 7698 7654 MARTIN 7698 7844 TURNER 7698 7900 JAMES 7698 7782 CLARK 7839 7934 MILLER 7782
Another way to visualize the hierarchy is with the SYS_CONNECT_BY_PATH function, which concatenates the nominated expression values within the hierarchy as it is traversed. Listing 3 shows the same employee hierarchy using SYS_CONNECT_BY_PATH to display the relationships.
Code Listing 3: Demonstrating the hierarchy depth with SYS_CONNECT_BY_PATH
SQL> select sys_connect_by_path(ename,'-') 2 from emp 3 start with mgr is null 4 connect by prior empno = mgr; SYS_CONNECT_BY_PATH(ENAME,'-') ——————————————————————————————————————————— -KING -KING-JONES -KING-JONES-SCOTT -KING-JONES-SCOTT-ADAMS -KING-JONES-FORD -KING-JONES-FORD-SMITH -KING-BLAKE -KING-BLAKE-ALLEN -KING-BLAKE-WARD -KING-BLAKE-MARTIN -KING-BLAKE-TURNER -KING-BLAKE-JAMES -KING-CLARK -KING-CLARK-MILLER
Beyond the Basics
For many developers, navigating the hierarchy and displaying the depth is where their usage and understanding of hierarchical queries come to an end, but there are many more features available. One of the reasons for reticence about using hierarchical queries is that in explorations beyond the simple examples using the EMP table, real-world data problems can cause a hierarchical query to crash with an error and return no data at all.
For example, in Listing 4 the manager for employee KING is set to 7499. This introduces a logical corruption in the hierarchy definition for the employees in this company. Traversing through the hierarchy in one direction indicates that KING is the manager of ADAMS, but traversing in the opposite direction through the hierarchy indicates that ADAMS is the manager of KING. There is now a cyclic relationship in the hierarchy. To continuously follow the CONNECT BY syntax would infinitely traverse this hierarchy and never come to an end. To avoid an endless loop, the database will return an error when CONNECT BY encounters such a cyclic relationship.
Code Listing 4: Error stops endless loop in a hierarchical query
SQL> update emp set mgr = 7499 2 where ename = 'KING'; 1 row updated. SQL> select sys_connect_by_path(ename,'-') 2 from emp 3 start with ename = 'KING' 4 connect by prior empno = mgr 5 / ERROR: ORA-01436: CONNECT BY loop in user data
To avoid this loop and error, there is a NOCYCLE extension to the CONNECT BY syntax. With NOCYCLE, CONNECT BY will traverse the hierarchy as much as possible. Listing 5 shows NOCYCLE in action, preventing an infinite loop and not generating an error in the SQL execution.
Code Listing 5: NOCYCLE prevents infinite loop and error
SQL> select sys_connect_by_path(ename,'-') 2 from emp 3 start with mgr is null 4 connect by NOCYCLE prior empno = mgr 5 / SYS_CONNECT_BY_PATH(ENAME,'-') ——————————————————————————————————————————— -KING -KING-JONES -KING-JONES-SCOTT -KING-JONES-SCOTT-ADAMS -KING-JONES-FORD -KING-JONES-FORD-SMITH -KING-BLAKE -KING-BLAKE-ALLEN -KING-BLAKE-WARD -KING-BLAKE-MARTIN -KING-BLAKE-TURNER -KING-BLAKE-JAMES -KING-CLARK -KING-CLARK-MILLER
Another extension for hierarchical SQL queries is the ORDER BY clause, so that rather than ordering the entire result set, you can nominate an ordering system within each branch of the tree. The ORDER SIBLINGS BY clause allows for ordering to be isolated within one branch at a time. Listing 6 shows the five leaf entries of ALLEN, JAMES, MARTIN, TURNER, and WARD ordered alphabetically under the KING-BLAKE branch, without having any impact on the entries of other branches.
Code Listing 6: ORDER SIBLINGS BY isolates ordering to one branch
SQL> select rpad(' ',level)||sys_connect_by_path(ename,'-') tree, 2 empno, dname 3 from emp e, dept d 4 where d.deptno = e.deptno 5 start with mgr is null 6 connect by prior empno = mgr 7 ORDER SIBLINGS BY ENAME; TREE EMPNO DNAME ———————————————————————————————— —————————— —————————————— -KING 7839 ACCOUNTING -KING-BLAKE 7698 SALES -KING-BLAKE-ALLEN 7499 SALES -KING-BLAKE-JAMES 7900 SALES -KING-BLAKE-MARTIN 7654 SALES -KING-BLAKE-TURNER 7844 SALES -KING-BLAKE-WARD 7521 SALES -KING-CLARK 7782 ACCOUNTING -KING-CLARK-MILLER 7934 ACCOUNTING -KING-JONES 7566 RESEARCH -KING-JONES-FORD 7902 RESEARCH -KING-JONES-FORD-SMITH 7369 RESEARCH -KING-JONES-SCOTT 7788 RESEARCH -KING-JONES-SCOTT-ADAMS 7876 RESEARCH
Note in the previous example that even if a table contains a hierarchical relationship, it can still be joined to other tables.
Filtering Versus Navigation
Navigating the hierarchy is quite distinct from the filtering that comes with WHERE clause predicates in the SQL statement. To demonstrate this, the 14 rows in the EMP table have been duplicated to be a set of rows for multiple geographic regions, as shown in Listing 7. There are now 42 rows in the table: the standard set of 14 rows in each of the NORTH, SOUTH, and EAST regions.
Code Listing 7: Duplicating EMP table rows for multiple regions
SQL> alter table emp add region varchar2(10); Table altered. SQL> update emp set region = 'NORTH'; 14 rows updated. SQL> SQL> insert into emp 2 select 3 empno, ename, job, mgr, hiredate, sal, comm, deptno, 4 decode(r,1,'SOUTH',2,'EAST') region 5 from emp, 6 ( select rownum r from dual connect by level <= 2 ) 7 where region = 'NORTH'; 28 rows created. SQL> SQL> select region, count(*) 2 from emp 3 group by region; REGION COUNT(*) —————————— —————————— NORTH 14 SOUTH 14 EAST 14
Let’s assume a requirement to obtain the hierarchy for the 14 rows for the NORTH region only. It would seem intuitive to use a query identical to Listing 2, but with an additional predicate to restrict the REGION to NORTH. Listing 8 shows the result.
Code Listing 8: Revisiting Listing 2 with WHERE clause
SQL> select empno, rpad(' ',level*3)||ename ename, mgr 2 from emp 3 where region = 'NORTH' 4 connect by prior empno = mgr 5 start with mgr is null; EMPNO ENAME MGR —————————— ———————————————————————————————————————— —————————— 7839 KING 7566 JONES 7839 7788 SCOTT 7566 7876 ADAMS 7788 ... ... 7782 CLARK 7839 7934 MILLER 7782 7934 MILLER 7782 7934 MILLER 7782 136 rows selected.
Note that the output returned is not 14 rows but 136 rows, even though a predicate to restrict the results to the NORTH region only was applied.
To understand the cause of this problem, consider the very first NORTH region row that is being processed. The START WITH clause is MGR IS NULL, so three King rows, one for each region, will be located, and then the CONNECT BY operation will commence. The CONNECT BY clause is PRIOR EMPNO = MGR, so for each KING record, three related employees, one for each region, will be connected to, which rapidly increases the number of rows returned as the query traverses through the hierarchy. Only after the hierarchical relationship is established is the WHERE clause predicate applied, so even though the result set contains only rows for REGION = NORTH, a large amount of redundant work was performed doing erroneous connections to the other regions because the CONNECT BY and START WITH syntax did not do the following:
Listing 9 shows the corrected syntax for this query. It starts with managers that are null (MGR IS NULL) but also only in the NORTH region. It then ensures that when connecting from one manager to a subsequent employee, the definition for connecting also includes a common region. This is analogous to the way the ANSI join syntax is processed, namely, that there is a distinct difference between the hierarchical connecting condition and the WHERE clause predicates that will be applied subsequently as filters.
This gives rise to a common but incorrect criticism of hierarchical queries: they can run too slowly or consume too many resources. Commonly the cause is that the CONNECT BY information is incomplete, processing many more rows than need to be processed.
Code Listing 9: Limiting hierarchical queries in CONNECT BY and START WITH
SQL> select empno, rpad(' ',level*3)||ename ename, mgr 2 from emp 3 connect by prior empno = mgr 4 and region = 'NORTH' 5 start with mgr is null 6 and region = 'NORTH'; EMPNO ENAME MGR —————————— ———————————————————————————————————————— —————————— 7839 KING 7566 JONES 7839 7788 SCOTT 7566 7876 ADAMS 7788 7902 FORD 7566 7369 SMITH 7902 7698 BLAKE 7839 7499 ALLEN 7698 7521 WARD 7698 7654 MARTIN 7698 7844 TURNER 7698 7900 JAMES 7698 7782 CLARK 7839 7934 MILLER 7782 14 rows selected.
You might ask, “What, then, is the use of a WHERE clause in a hierarchical query?” The presence of a WHERE clause is still valid, not to define the hierarchy but to define filtering of the data. For example, once the NORTH region hierarchy has been instantiated, the WHERE clause requirement might be to include only those employees earning above a certain SALARY.
When dealing with hierarchical structures, there are common questions that need to be answered about the data. For the sample data in this article, questions might be along the lines of
With the initial set of sample data in this article, KING is the sole employee who sits at the top of the hierarchy, but in Listing 10, some rows have been added to demonstrate that this does not have to be the case for all hierarchies. Employee DENNIS has been added who also has no manager; that is, he sits at the same top level as KING. So it is possible that employees at the very bottom of the hierarchy structure might ultimately traverse up to a different highest manager.
Code Listing 10: Adding employees—including one who has no manager
SQL> insert into emp ( empno, ename, mgr, deptno) 2 values (1000,'DENNIS',null, 10); 1 row created. SQL> insert into emp ( empno, ename, mgr, deptno) 2 values (1001,'CHRIS',1000, 10); 1 row created. SQL> insert into emp ( empno, ename, mgr, deptno) 2 values (1002,'CRAIG',1001, 10); 1 row created.
There are additional pseudofunctions available to answer the “is this a manager?” and “which employee is at the top of the hierarchy?” questions as the hierarchy is traversed:
Listing 11 demonstrates these three pseudofunctions.
Code Listing 11: Three pseudofunctions that answer questions
SQL> select connect_by_root ename root, 2 connect_by_isleaf leaf, 3 connect_by_iscycle cyc, 4 sys_connect_by_path(ename,'-') full 5 from emp 6 start with ename in ('KING','DENNIS') 7 connect by NOCYCLE prior empno = mgr; ROOT LEAF CYC FULL —————————— —————— —————— ——————————————————————— KING 0 0 -KING KING 0 0 -KING-JONES KING 0 0 -KING-JONES-SCOTT KING 1 0 -KING-JONES-SCOTT-ADAMS KING 0 0 -KING-JONES-FORD KING 1 0 -KING-JONES-FORD-SMITH KING 0 0 -KING-BLAKE KING 1 1 -KING-BLAKE-ALLEN KING 1 0 -KING-BLAKE-WARD KING 1 0 -KING-BLAKE-MARTIN KING 1 0 -KING-BLAKE-TURNER KING 1 0 -KING-BLAKE-JAMES DENNIS 0 0 -DENNIS DENNIS 0 0 -DENNIS-CHRIS DENNIS 1 0 -DENNIS-CHRIS-CRAIG
There is much more functionality for the hierarchical query syntax in SQL over and above the basic START WITH and CONNECT BY clauses. These extensions allow a far more sophisticated analysis of hierarchical data in your database systems.
My next article will explore a new mechanism to navigate hierarchical data structures without using the CONNECT BY syntax at all.
LEARN more about hierarchical query processing.
DOWNLOAD Oracle Database 12c Release 2.
Photography by Getty Images/iStockphoto