X

Learn about Oracle Machine Learning for Oracle Database and Big Data, on-premises and Oracle Cloud

Generating cluster names from a document clustering model

Mark Hornick
Senior Director, Data Science and Machine Learning

Text mining is a hot topic, especially for document clustering. Say you have a
potentially large set of documents that you'd like to sort into some number of
related groups. Sometimes it is enough to know which documents are in the same
group (or cluster) and be able to assign new documents to the existing set of
groups. However, you may also want a description of the clusters to help
understand what types of documents are in those clusters. Automatically generating cluster names would be much easier than
examining cluster centroids or reading a sample of documents in each cluster.

Oracle Data Mining supports this use case and below is a
script that generates cluster names from a clustering model.

To use this script, you first need a clustering model and a
text mapping table. These are easily produced using the Oracle Data Miner
graphical user interface to automatically transform the data and then build the
model. To get started, provide a data table with two columns: a numeric id column and a VARCHAR2 column containing the document text.

Here are a few key screen captures to guide you. I'm
using a dataset from Oracle Open World that includes all the session text
(title and abstract concatenated). By the way, this session document clustering was part of the process for
producing the Session Recommendation Engine for Oracle Open World 2008 and 2009.

In Oracle Data Miner, start a build activity for clustering
using k-Means. Then, select the dataset and the unique identifier, and click
Next. (Click images to enlarge.)

ClusterNames 001.jpg

Check the SESSION_TEXT attributes as "input" and change the "mining
type" to "text."

 
ClusterNames 002.jpg

 

Click advanced settings at the end of the wizard to reveal settings
you can tailor. Since we have a single TEXT column, click on the tabs for "Outlier
Treatment," "Missing Values," and "Normalize" and disable each step by clicking
the box in the upper left-hand corner. Whereas these are often necessary for
k-Means, our single text column and text transformation eliminate the need these.

Clicking the "Text" tab, you may specify various text-specific settings. For
example, you may have a custom stopword list or lexer that you want to use, as
shown below.

ClusterNames 003.jpg

Clicking the "Feature Extraction" sub-tab allows you to
specify maximum number of terms to represent each document and the maximum
number of terms to represent all documents.

ClusterNames 004.jpg

Click the "Build" tab to specify the number of clusters
(groups) you want to have. For text, we recommend the "cosine" distance
function. Depending on your needs, you may want to specify the split criterion
to "size" to have clusters of more equal size. For a better model, set maximum
interactions to 20.

ClusterNames 005.jpg

Oracle Data Miner now generates an activity that performs
the text transformation and model building.

ClusterNames 006.jpg

To obtain the model name from the Build step, copy the text
next to "Model Name." To obtain the mapping table, click the "Output Data" link
under the Text step. Click the "Mapping Data" link and copy the name of the
table at the top of the window.

ClusterNames 007.jpg

Now, you're nearly ready to invoke the following script to
generate the cluster names.

Create a table like CLUSTER_NAME_MAP below to store the
results. Then, replace the model name used below ('SESSION09_PRE92765_CL')
with your model name, and the mapping table name used below (DM4J$VSESSION09_710479489)
with your mapping table name.

create
table cluster_name_map (model_name   VARCHAR(40),

                               cluster_name
VARCHAR2(1999),

                               cluster_id   NUMBER,

                               record_count
NUMBER);

 

Run this script on your model and table. Look below to see
some sample output from the Open World session data. (Note that some columns are included in the script below, even though not required, to highlight data available in the model.)

DECLARE

  CURSOR ClusterLeafIds IS

    --Obtain leaf clusters

    SELECT CLUSTER_ID, RECORD_COUNT

    FROM (

      SELECT distinct clus.ID AS CLUSTER_ID,

             clus.RECORD_COUNT RECORD_COUNT,

             clus.DISPERSION DISPERSION,

             clus.PARENT PARENT_CLUSTER_ID,

             clus.TREE_LEVEL TREE_LEVEL,

             CASE WHEN chl.id IS NULL THEN 'YES'

                                      ELSE 'NO' END IS_LEAF

       FROM (SELECT *

             FROM
TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL'))) clus,

                table(clus.child) chl

     )

    WHERE is_leaf='YES'

    ORDER BY cluster_id;

BEGIN

  FOR c IN ClusterLeafIds LOOP

    INSERT INTO cluster_name_map (model_name,
cluster_name,

                                  cluster_id, record_count)

    SELECT 'SESSION09_PRE92765_CL' model_name,
cluster_name,

            c.cluster_id cluster_id,
c.record_count record_count

    FROM (

      SELECT id, term || '-' ||

           LEAD(term, 1) OVER (ORDER BY id) ||
'-' ||

           LEAD(term, 2) OVER (ORDER BY id) ||
'-' ||

           LEAD(term, 3) OVER (ORDER BY id) ||
'-' ||

           LEAD(term, 4) OVER (ORDER BY id)
cluster_name

      FROM (  

       SELECT id, text term, centroid_mean

       FROM (

        SELECT rownum id, a.*

        FROM (

             SELECT cd.attribute_subname  term,

                    cd.mean               centroid_mean

             FROM (

               SELECT *

               FROM
TABLE(dbms_data_mining.get_model_details_km('SESSION09_PRE92765_CL',  c.cluster_id, null, 1, 0, 0, null)) ) a,

               TABLE(a.centroid) cd

             order by cd.mean desc) a

        WHERE rownum < 6) x,

        DM4J$VSESSION09_710479489 y

      WHERE x.term=y.attribute_id

      ORDER BY centroid_mean

      )

    )

    WHERE id=1;

  END LOOP;

END;

 

Each cluster name is the concatenation of the top 5 terms (words with the highest ranking centroid values) that represent the cluster. The the image below, the second column is the cluster id, and the third column is the count of documents assigned to that cluster.

ClusterNames 008.jpg

Cluster names can also be assigned to the model clusters directly in the model.

Assigning cluster names and the advanced SQL in the script will be covered in
future blog posts.

 


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.