Generating cluster names from a document clustering model, Part 2
By Mark Hornick on Feb 17, 2010
The following SQL allows us to retrieve the text terms in each centroid in decreasing order of their centroid value, i.e., the importance of each term in describing the given cluster. In the script, we used this as a basis to skim off the top 5 terms for constructing the cluster name. Recall that the query below was within a FOR loop and was executed for each of the relevant cluster ids (c.cluster_id).
SELECT cd.attribute_subname term,
FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL', c.cluster_id, null, 1, 0, 0)) ) a,
ORDER BY cd.mean DESC
The table function dbms_data_mining.get_model_details_km returns a set of rows that provide k-Means clustering model details. This function allows us to specify several parameters:
- model_name - the name of the clustering model
- cluster_id - the id of the cluster we want details for -- an
invalid cluster_id returns details for all clusters
- attribute - the name of the attribute we want details for -- we specified null since we don't care about a specific attribute
- centroid - we specify 1 since we want details about the centroids
- histogram - we specify 0 since we don't want histogram details
- rules - we specify 0 since we don't want rules
id NUMBERWe first convert the result of get_model_details_km to a table using the TABLE operator and then extract the centroid column (a.centroid) which has type DM_CENTROIDS. The rows, which are of type DM_CENTROID, have the following columns available:
attribute_name VARCHAR2(4000)Since we provided a single column in the build (training) dataset, we're not interested in the attribute_name column. However, for text, we use the attribute_subname column to obtain the text term itself. We also use the centroid mean column to order the text terms from most important to least.
At this stage, our results are fairly cryptic since the algorithm deals with numeric term ids, not the actual text:
To obtain the corresponding text terms, we join this result with the mapping data table obtained from the Text step of the Oracle Data Miner Activity. In our example, this was the auto-generated table name DM4J$VSESSION09_710479489.
SELECT id, text term, centroid_mean
FROM (SELECT rownum id, a.*
FROM (SELECT cd.attribute_subname term,
FROM (SELECT *
FROM TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL',4, null,1,0,0)) ) a,
order by cd.mean desc) a
WHERE rownum < 6) x,
ORDER BY centroid_mean
A couple other features to note from this query include selecting the top five terms using the statement WHERE rownum < 6, and the use of rownum to generate an id. This id column will be used for the LEAD functions to construct the cluster name.
The result now looks like:
ID TERM CENTROID_MEAN
5 SUPPLY 0.14341
4 FINANCIAL 0.16519
3 PROJECT 0.16965
2 PEOPLESOFT 0.18200
1 FINANCIALMANAGEMENT 0.18574
Next, we'll look at using the LEAD function to construct the cluster names and the cursor ClusterLeafIds definition for selecting the leaf clusters.