Generating cluster names from a document clustering model, Part 2

My previous post, Generating cluster names from a document clustering model, included a SQL script that involved advanced SQL constructs. In this post, we'll look at those constructs used to retrieve model details from a k-Means model. These model details provide the top terms for naming each cluster. Let's get started...
 
The following SQL allows us to retrieve the text terms in each centroid in decreasing order of their centroid value, i.e., the importance of each term in describing the given cluster. In the script, we used this as a basis to skim off the top 5 terms for constructing the cluster name. Recall that the query below was within a FOR loop and was executed for each of the relevant cluster ids (c.cluster_id).

SELECT cd.attribute_subname  term,
      
cd.mean               centroid_mean

FROM (

     
SELECT *

     
FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL',  c.cluster_id, null, 1, 0, 0)) ) a,

      TABLE(a.centroid) cd
ORDER BY cd.mean DESC


The table function dbms_data_mining.get_model_details_km returns a set of rows that provide k-Means clustering model details. This function allows us to specify several parameters:
 
  • model_name - the name of the clustering model
  • cluster_id - the id of the cluster we want details for -- an invalid cluster_id returns details for all clusters
  • attribute - the name of the attribute we want details for -- we specified null since we don't care about a specific attribute
  • centroid - we specify 1 since we want details about the centroids
  • histogram - we specify 0 since we don't want histogram details
  • rules - we specify 0 since we don't want rules
The return type of get_model_details_km is DM_CLUSTERS. This contains rows of type DM_CLUSTER, which have the following columns available:

 id                   NUMBER
cluster_id VARCHAR2(4000)
record_count NUMBER
parent NUMBER
tree_level NUMBER
dispersion NUMBER
split_predicate DM_PREDICATES
child DM_CHILDREN
centroid DM_CENTROIDS
histogram DM_HISTOGRAMS
rule DM_RULE
We first convert the result of get_model_details_km to a table using the TABLE operator and then extract the centroid column (a.centroid) which has type DM_CENTROIDS. The rows, which are of type DM_CENTROID, have the following columns available:

 attribute_name     VARCHAR2(4000)
attribute_subname VARCHAR2(4000)
mean NUMBER
mode_value VARCHAR2(4000)
variance NUMBER
Since we provided a single column in the build (training) dataset, we're not interested in the attribute_name column. However, for text, we use the attribute_subname column to obtain the text term itself. We also use the centroid mean column to order the text terms from most important to least.

At this stage, our results are fairly cryptic since the algorithm deals with numeric term ids, not the actual text:

TERM    CENTROID_MEAN
1447    0.18574048897975001
1471    0.182003440515976
1453    0.16965578421072799
1460    0.16519875743986001
1418    0.14341226478354499
1500    0.141826045289632
1429    0.139553071123386


To obtain the corresponding text terms, we join this result with the mapping data table obtained from the Text step of the Oracle Data Miner Activity. In our example, this was the auto-generated table name DM4J$VSESSION09_710479489.

SELECT id, text term, centroid_mean
FROM (SELECT rownum id, a.*
      FROM (SELECT cd.attribute_subname      term,
                   cd.mean                   centroid_mean
            FROM (SELECT *
                  FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL',4, null,1,0,0)) ) a,
                 TABLE(a.centroid) cd
            order by cd.mean desc) a
       WHERE rownum < 6) x,
       DM4J$VSESSION09_710479489 y
WHERE x.term=y.attribute_id
ORDER BY centroid_mean


A couple other features to note from this query include selecting the top five terms using the statement WHERE rownum < 6, and the use of rownum to generate an id. This id column will be used for the LEAD functions to construct the cluster name.

The result now looks like:

ID   TERM                 CENTROID_MEAN
5    SUPPLY               0.14341
4    FINANCIAL            0.16519
3    PROJECT              0.16965
2    PEOPLESOFT           0.18200
1    FINANCIALMANAGEMENT  0.18574


Next, we'll look at using the LEAD function to construct the cluster names and the cursor ClusterLeafIds definition for selecting the leaf clusters.
Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

Everything about Oracle Data Mining, a component of the Oracle Advanced Analytics Option - News, Technical Information, Opinions, Tips & Tricks. All in One Place

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today