Generating cluster names from a document clustering model, Part 3

My previous post Generating cluster names from a document clustering model, Part 2 provided a deeper look at SQL constructs used to retrieve model details from a k-Means model. In this post, we'll look at using the LEAD function to construct the cluster names and the cursor ClusterLeafIds definition for selecting the leaf clusters.
 
The following SQL builds upon the id, term, and centroid_mean subquery to concatenate the top five terms for each cluster. In this example, we've replaced the c.cluster_id variable (from the original script) with the specific cluster id 4.

SELECT 'SESSION09_PRE92765_CL' model_name, cluster_name, 4 cluster_id
FROM (
      SELECT id, term || '-' ||
           LEAD(term, 1) OVER (ORDER BY id) || '-' ||
           LEAD(term, 2) OVER (ORDER BY id) || '-' ||
           LEAD(term, 3) OVER (ORDER BY id) || '-' ||
           LEAD(term, 4) OVER (ORDER BY id) cluster_name
      FROM (  
       SELECT id, text term, centroid_mean
       FROM (
        SELECT rownum id, a.*
        FROM (
             SELECT cd.attribute_subname term,
                    cd.mean              centroid_mean
             FROM (
               SELECT *
               FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL',  4, null, 1, 0, 0)) ) a,
               TABLE(a.centroid) cd
             order by cd.mean desc) a
        WHERE rownum < 6) x,
        DM4J$VSESSION09_710479489 y
       WHERE x.term=y.attribute_id
       ORDER BY centroid_mean
       )
    )
WHERE id=1;


Which produces the result:

SESSION09_PRE92765_CL  FINANCIALMANAGEMENT-PEOPLESOFT-PROJECT-FINANCIAL-SUPPLY 4

from the subquery result

ID        TERM          CENTROID_MEAN
5    SUPPLY               0.14341
4    FINANCIAL            0.16519
3    PROJECT              0.16965
2    PEOPLESOFT           0.18200
1    FINANCIALMANAGEMENT  0.18574


How does the LEAD function work? LEAD is one of the analytical functions with its companion LAG. As noted in the documentation, these are used when the relative position of rows can be known reliably. The LEAD function provides access to a row at a given offset after the current position, whereas LAG provides access to a row at a given offset prior to the current position.

In the example above, we specified LEAD(term, 1) OVER (ORDER BY id), where term refers to the value expression and 1 refers to the offset from the current row. The OVER clause allows us to specify that we are ordering the rows by the id column. We concatenate the first term and the 4 leading terms.

Note, we could have just as easily used the LAG function by ordering on the CENTROID_MEAN:

      SELECT id, term || '-' ||
           LAG(term, 1) OVER (ORDER BY centroid_mean) || '-' ||
           LAG(term, 2) OVER (ORDER BY centroid_mean) || '-' ||
           LAG(term, 3) OVER (ORDER BY centroid_mean) || '-' ||
           LAG(term, 4) OVER (ORDER BY centroid_mean) cluster_name


The last part of the original script we'll discuss is the ClusterLeafIds cursor, which is used to select only the leaf clusters for assigning cluster names. Since Oracle Data Mining clustering algorithms are hierarchical, the results from dbms_data_mining.get_model_details_km provides details for all clusters in the model.

  CURSOR ClusterLeafIds IS
    --Obtain leaf clusters
    SELECT CLUSTER_ID, RECORD_COUNT
    FROM (
      SELECT distinct clus.ID AS CLUSTER_ID,
             clus.RECORD_COUNT RECORD_COUNT,
             clus.DISPERSION DISPERSION,
             clus.PARENT PARENT_CLUSTER_ID,
             clus.TREE_LEVEL TREE_LEVEL,
             CASE WHEN chl.id IS NULL THEN 'YES'
                                      ELSE 'NO' END IS_LEAF
       FROM (SELECT *
             FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL'))) clus,
             TABLE(clus.child) chl
     )
    WHERE is_leaf='YES'

    ORDER BY cluster_id;

The inner subquery produces the following results. Of course, record_count, dispersion, parent_cluster_id, and tree_level are not essential to the problem at hand, but included as an example of content available from get_model_details_km.

CLUSTER_ID RECORD_COUNT  DISPERSION  PARENT_CLUSTER_ID TREE_LEVEL IS_LEAF
28             70         0.717292         18               6       YES
32
             62         0.650029         17               6       YES
31
             33         0.598149         16               6       YES
            231         0.712922          3               3        NO
12  
           76         0.668168          6               4        NO
           1115         0.815162          1            null        NO

The distinct clause removes duplicates caused by non-leaf nodes, which will have two children. Note the result without distinct:

SELECT  clus.ID AS CLUSTER_ID, clus.RECORD_COUNT RECORD_COUNT,
        chl.id
FROM (SELECT *
      FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL'))) clus,
     TABLE(clus.child) chl


CLUSTER_ID RECORD_COUNT /*Child*/ID
1              1115            2
1   
          1115            3
2   
           699            4
2   
           699            5
3   
           416            6
...
24   
           75           43
25   
           80         null
26   
           43         null
27   
           49         null

Finally, we select those cluster ids where is_leaf is 'YES' to produce the list of clusters that we ultimately want to generate cluster names for.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today