X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • November 17, 2017

Explicit Semantic Analysis (ESA) for Text Analytics

Mark Hornick
Director, Advanced Analytics and Machine Learning

New in Oracle R Enterprise 1.5.1 with the Oracle Database 12.2 Oracle Advanced Analytics option is the text analytics algorithm Explicit Semantic Analysis or ESA. Compared to other techniques such as Latent Dirichlet Association (LDA) or Term Frequency-Inverse Document Frequency (TF-IDF), ESA offers some unique benefits. Most notably, it improves text document categorization by computing “semantic relatedness” between the documents and a set of topics that are explicitly defined and described by humans. An example of such a corpus of documents is the set of Wikipedia articles. Each article is equated with a topic - the article title. The ESA algorithm can discover topics related to a document from this set of Wikipedia topics. OAA provides a pre-built Wikipedia model that is based on a select subset of English Wikipedia articles. In a subsequent post (Text Analytics using a pre-built Wikipedia-based Topic Model), we'll explore using this pre-built model. Other encyclopedic sources can be used to improve text categorization in domain-specific contexts.

How does ESA compare with LDA?

ESA contrasts with LDA in that ESA uses such a knowledge base, making it possible to assign human-readable labels to concepts or topics. Topics discovered by LDA are latent, i.e., can often be difficult to interpret, since they are defined by their keywords, not abstract descriptions or labels. While LDA labels can be given meaning by extracting keywords, definitions solely based on keywords tend to be fuzzy with keywords from different topics overlapping and not yielding a convenient topic name. ESA, on the other hand, uses documents with clearly-labeled topics. Two common use cases for ESA include calculating the semantic similarity between text documents or between mixed data (text and structured data), and explicit topic modeling for a given document or text.

Further, with LDA, the topic set varies with changes to the training data, making comparison across different data sets difficult. Changing training data also changes the topic boundaries such that topics cannot be mapped to an existing knowledgebase.

The OAA implementation of ESA enables using text columns in combination with optional categorical and numerical columns. Users can also combine multiple knowledgebases, each with its own topic set, which may or may not overlap. Topic overlap in ESA does not affect ESA’s ability to detect relevant topics.

Consider the following simple example, where the objective is to extract topics from the titles provided. The ESA_TEXT variable is created as an ore.frame with an ID and the title text and serves as our training data or corpus.

title <- c('Aids in Africa: Planning for a long war',
           'Mars rover maneuvers for rim shot',
           'Mars express confirms presence of water at Mars south pole',
           'NASA announces major Mars rover finding',
           'Drug access, Asia threat in focus at AIDS summit',
           'NASA Mars Odyssey THEMIS image: typical crater',
           'Road blocks for Aids')
ESA_TEXT <- ore.push(data.frame(ID = seq(length(title)),
                                TITLE = title))

Next, we create a text policy, which requires the CTXSYS.CTX_DDL database privilege.

ore.exec("BEGIN ctx_ddl.create_policy('ESA_TXTPOL'); END;")

At this point we can build the model

esa.mod <- ore.odmESA(~., data = ESA_TEXT,
    odm.setting = list(CASE_ID_COLUMN_NAME = "DOC_ID",
                       ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
                       ESAS_MIN_ITEMS = 1),
    ctx.setting = list(TITLE = c("MIN_DOCUMENTS:1", "MAX_FEATURES:3")))

Note that the argument odm.setting specifies a list object for the Oracle Data Mining (ODM) parameter settings. This argument is available when building a model in Database 12.2 or later. Each list element's name and value refer to the parameter setting as specified for ODM. Parameter CASE_ID_COLUMN_NAME specifies the column name containing the unique identifier associated with each case of the data. Parameter ODMS_TEXT_POLICY_NAME specifies the name of a valid Oracle Text policy used for text mining. The argument ctx.setting is a list to specify Oracle Text attribute-specific settings. Similarly, this argument is applicable to building a model in Database 12.2 or later.

In the following execution, we explore a few variants to model building and scoring using ESA models.

R> title <- c('Aids in Africa: Planning for a long war',
+            'Mars rover maneuvers for rim shot',
+            'Mars express confirms presence of water at Mars south pole',
+            'NASA announces major Mars rover finding',
+            'Drug access, Asia threat in focus at AIDS summit',
+            'NASA Mars Odyssey THEMIS image: typical crater',
+            'Road blocks for Aids')
R>
R> # Text contents in character column
R> df <- data.frame(ID = seq(length(title)), TITLE = title)
R> ESA_TEXT <- ore.push(df)
R>
R> # Convert TITLE column to CLOB data type
R> attr(df$TITLE, "ora.type") <- "clob"
R> ESA_TEXT_CLOB <- ore.push(df)
R>

As above, we create our 'title' ore.frame, but define it with the text first as a character vector, and then as a CLOB data type. This second option can be useful when the data is stored in a table already with the CLOB data type. We first use ore.frame ESA_TEXT_CLOB to build the model below, and then ESA_TEXT.

Also, notice the difference in specifying odm.settings. The resulting esa.mod object is of type 'ore.odmESA'.

R> ore.exec("BEGIN ctx_ddl.create_policy('ESA_TXTPOL'); END;")
R> 
R> esa.mod <- ore.odmESA(~., data = ESA_TEXT_CLOB,
+                       odm.settings = list(case_id_column_name = "ID",
+                                           ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
+                                           ODMS_TEXT_MIN_DOCUMENTS = 1,
+                                           ODMS_TEXT_MAX_FEATURES = 3,
+                                           ESAS_MIN_ITEMS = 1,
+                                           ESAS_VALUE_THRESHOLD = 0.0001,
+                                           ESAS_TOPN_FEATURES = 3))
R> class(esa.mod)
[1] "ore.odmESA" "ore.model"
R> 
R> summary(esa.mod)

Call:
ore.odmESA(formula = ~., data = ESA_TEXT_CLOB, odm.settings = list(case_id_column_name = "ID",
    ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", ODMS_TEXT_MIN_DOCUMENTS = 1,
    ODMS_TEXT_MAX_FEATURES = 3, ESAS_MIN_ITEMS = 1, ESAS_VALUE_THRESHOLD = 1e-04,
    ESAS_TOPN_FEATURES = 3))

Settings:
                                               value
min.items                                          1
topn.features                                      3
value.threshold                                1e-04
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                             3
odms.text.min.documents                            1
odms.text.policy.name                     ESA_TXTPOL
prep.auto                                         ON

Features:
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS               1.0000000
2           2     TITLE.MARS               0.4078615
3           2    TITLE.ROVER               0.9130438
4           3     TITLE.MARS               1.0000000
5           4     TITLE.NASA               0.6742695
6           4    TITLE.ROVER               0.6742695
7           5     TITLE.AIDS               1.0000000
8           6     TITLE.MARS               0.4078615
9           6     TITLE.NASA               0.9130438
10          7     TITLE.AIDS               1.0000000
Using the summary function, we see the call, settings, and features. Functions settings and features allows obtaining these results explicitly as well. The features output provides an ID, the corresponding attribute name(s) as multiple rows, an optional attribute value, depending on whether categorical data was provided to the model, and the resulting coefficient. Note that the ATTRIBUTE_NAME column has values that include the text column attribute name prefix. This allows different columns to provide text, yet still differentiate coefficients.
R> settings(esa.mod)
                   SETTING_NAME                 SETTING_VALUE SETTING_TYPE
1                     ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS        INPUT
2                ESAS_MIN_ITEMS                             1        INPUT
3            ESAS_TOPN_FEATURES                             3        INPUT
4          ESAS_VALUE_THRESHOLD                         1e-04        INPUT
5  ODMS_MISSING_VALUE_TREATMENT       ODMS_MISSING_VALUE_AUTO      DEFAULT
6                 ODMS_SAMPLING         ODMS_SAMPLING_DISABLE      DEFAULT
7        ODMS_TEXT_MAX_FEATURES                             3        INPUT
8       ODMS_TEXT_MIN_DOCUMENTS                             1        INPUT
9         ODMS_TEXT_POLICY_NAME                    ESA_TXTPOL        INPUT
10                    PREP_AUTO                            ON        INPUT
R> 
R> features(esa.mod)
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS                 1.0000000
2           2     TITLE.MARS                 0.4078615
3           2    TITLE.ROVER                 0.9130438
4           3     TITLE.MARS                 1.0000000
5           4     TITLE.NASA                 0.6742695
6           4    TITLE.ROVER                 0.6742695
7           5     TITLE.AIDS                 1.0000000
8           6     TITLE.MARS                 0.4078615
9           6     TITLE.NASA                 0.9130438
10          7     TITLE.AIDS                 1.0000000
Using the overloaded predict function, we can score the same data, or interchangeably use data with character vector type, as found in ESA_TEXT, and depicted here. The feature identifier is predicted, supplemented by the ID column. The feature identifier can be joined with the features listed above to produce the desired text.
R> predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "ID")
  ID FEATURE_ID
1  1          1
2  2          2
3  3          3
4  4          4
5  5          1
6  6          6
7  7          1
This next example, we illustrate using the ESA_TEXT (character column) to build the model, but then predict using the ESA_TEXT_CLOB data. In this example, we reintroduce the use of the ctx.settings argument. Note that the model summary is the same as before.
R> esa.mod2 <- ore.odmESA(~., data = ESA_TEXT,
+                        odm.settings = list(case_id_column_name = "ID", 
+                                            ESAS_MIN_ITEMS = 1),
+                        ctx.settings = list(TITLE =
+                                              "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
R> 
R> summary(esa.mod2)

Call:
ore.odmESA(formula = ~., data = ESA_TEXT, odm.settings = list(case_id_column_name = "ID",
    ESAS_MIN_ITEMS = 1), ctx.settings = list(TITLE = "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))

Settings:
                                               value
min.items                                          1
topn.features                                   1000
value.threshold                            .00000001
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                        300000
odms.text.min.documents                            3
prep.auto                                         ON

Features:
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS               1.0000000
2           2     TITLE.MARS               0.4078615
3           2    TITLE.ROVER               0.9130438
4           3     TITLE.MARS               1.0000000
5           4     TITLE.MARS               0.3011997
6           4     TITLE.NASA               0.6742695
7           4    TITLE.ROVER               0.6742695
8           5     TITLE.AIDS               1.0000000
9           6     TITLE.MARS               0.4078615
10          6     TITLE.NASA               0.9130438
11          7     TITLE.AIDS               1.0000000
R>
Here, we predict using the ESA_TEXT_CLOB, but could also used the ESA_TEXT as input. When argument type is set to 'class', the feature with the maximum probability is returned. If set to 'raw', the probability for each feature returned is provided (not shown).
R> predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "ID")
  ID FEATURE_ID
1  1          1
2  2          2
3  3          3
4  4          4
5  5          1
6  6          6
7  7          1 
The function feature_compare returns an ore.frame containing a column that measures the relatedness between documents in the provided ore.frame, here ESA_TEXT_CLOB. The columns to compare are in compare.cols, in this case column TITLE. The supplemental column ID allows us to see the similarity between pairs of rows in ESA_TEXT_CLOB.
R> feature_compare(esa.mod2, ESA_TEXT_CLOB, compare.cols = "TITLE", supplemental.cols = "ID")
    ID_A ID_B SIMILARITY
1      1    1  1.0000000
1.1    1    2  0.0000000
1.2    1    3  0.0000000
1.3    1    4  0.0000000
1.4    1    5  1.0000000
1.5    1    6  0.0000000
1.6    1    7  1.0000000
2      2    3  0.6322308
2.1    2    2  1.0000000
2.2    2    1  0.0000000
2.3    2    4  0.8608680
2.4    2    7  0.0000000
2.5    2    6  0.5259416
2.6    2    5  0.0000000
3      3    1  0.0000000
3.1    3    2  0.6322308
3.2    3    3  1.0000000
3.3    3    6  0.6322308
3.4    3    5  0.0000000
...
R>
Finally, we can drop the text policy to clean up.
R> ore.exec("BEGIN ctx_ddl.drop_policy('ESA_TXTPOL'); END;")

In this blog, we introduced the ORE capability of using the new Explicit Semantic Analysis algorithm in the Oracle Advanced Analytics option, and illustrated a few ways in which models can be built, data scored, and similarity among text assessed using a custom-built model.

In a subsequent post (Text Analytics using a pre-built Wikipedia-based Topic Model), we'll look at using the pre-built Wikipedia model.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services