X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

Text Analytics using a pre-built Wikipedia-based Topic Model

By: Mark Hornick | Director, Advanced Analytics and Machine Learning

In my previous post, Explicit Semantic Analysis (ESA) for Text Analytics, we explored the basics of the ESA algorithm and how to use it in Oracle R Enterprise to build a model from scratch and use that model to score new text.  While creating your own domain-specific model may be necessary in many situations, others may benefit from a pre-built model based on millions of Wikipedia articles reduced to 200,000 topics. This model is downloadable here with details of how to install it here.

Installing the model

The installation link provided above describes other prerequisites such as directory object and table spaces. Once these are in place, when you load the model using impdp, you should see something like the following:

>impdp rquser/pswd@PDB01 dumpfile=wiki_model12.2.0.1.dmp directory=DATA_PUMP_DIR remap_schema=DMUSER:RQUSER remap_tablespace=TBS_1:ESA_MODELS_1 logfile=wiki_model_import_1.log

Import: Release 12.2.0.1.0 - Production on Tue Nov 28 11:11:25 2017

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

Connected to: Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production
Master table "RQUSER"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
import done in AL32UTF8 character set and AL16UTF16 NCHAR character set
export done in WE8DEC character set and AL16UTF16 NCHAR character set
Warning: possible data loss in character set conversions
Starting "RQUSER"."SYS_IMPORT_FULL_01":  rquser/********@PDB01 dumpfile=wiki_model12.2.0.1.dmp directory=DATA_PUMP_DIR remap_schema=DMUSER:RQUSER remap_tablespace=TBS_1:ESA_MODELS_1 logfile=wiki_model_import_1.log
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . imported "RQUSER"."DM$PYWIKI_MODEL"            485.9 MB 21346563 rows
. . imported "RQUSER"."DM$P5WIKI_MODEL"            251.7 MB  289705 rows
. . imported "RQUSER"."DM$PRWIKI_MODEL"            5.573 MB  200887 rows
. . imported "RQUSER"."DM$PZWIKI_MODEL"            5.300 MB  196904 rows
. . imported "RQUSER"."DM$PDWIKI_MODEL"            4.415 MB  200886 rows
. . imported "RQUSER"."DM$PPWIKI_MODEL"            6.765 KB       1 rows
. . imported "RQUSER"."DM$PMWIKI_MODEL"                0 KB       0 rows
Processing object type TABLE_EXPORT/TABLE/INDEX/INDEX
Processing object type TABLE_EXPORT/TABLE/INDEX/STATISTICS/INDEX_STATISTICS
Processing object type TABLE_EXPORT/TABLE/STATISTICS/TABLE_STATISTICS
Processing object type TABLE_EXPORT/TABLE/POST_INSTANCE/PROCDEPOBJ
Job "RQUSER"."SYS_IMPORT_FULL_01" successfully completed at Tue Nov 28 11:12:50 2017 elapsed 0 00:01:24

To create needed preference and policy objects, you then execute:

exec ctx_ddl.drop_policy('wiki_txtpol');
exec ctx_ddl.drop_preference( 'wiki_lexer');
exec ctx_ddl.create_preference('wiki_lexer', 'BASIC_LEXER'); 
exec ctx_ddl.set_attribute('wiki_lexer', 'INDEX_STEMS', 'ENGLISH');
exec ctx_ddl.create_policy(policy_name => 'wiki_txtpol', lexer => 'wiki_lexer'); 

Creating a proxy object

Once the model is installed in the database, we need an ORE object as a proxy for the in-database ESA model. You may know that models created using ORE's ore.odm* functions have underlying in-database first-class objects that are accessed by ORE proxy objects. Since the WIKI_MODEL is also an in-database model, it too needs to be accessed using a proxy object. To enable this, execute the following R function, which constructs a model object with class ore.odmESA.

ore.createESA.wiki_model <- function () {
  model.name <- "WIKI_MODEL"
  attr(model.name, "owner") <- OREbase:::.ore.schema()
  formula <- "~ TEXT"
  formula <- as.formula(formula, env = parent.frame())
  env.lst <- list()
  model.settings   <- OREdm:::.ore.getModelSettings(model.name,"esas_")
  model.attributes <- OREdm:::.ore.getModelAttributes(model.name)
  query <- paste(
           "SELECT ",
           "FEATURE_ID, ",
           "nvl2(attribute_subname, ", "attribute_name","||","'.'", "||",
           "attribute_subname,  attribute_name) attribute_name,",
           "COEFFICIENT ",
           "FROM ", "DM$VAWIKI_MODEL",
           " where ROWNUM <= 10 order by 1,2,3",
           sep="");
  features.all <- OREdm:::.ore.frame4query(query, extRef = env.lst)
  esa.mod <- OREdm:::.create.ore.odmESA(model.name,
                                        model.settings,
                                        model.attributes,
                                        formula,
                                        env.lst,
                                        match.call(),
                                        features=features.all)
}

WikiModel <- ore.createESA.wiki_model()
ore.save(WikiModel,name="WIKI_MODEL")

Like other ORE models, unless the proxy object is saved in an ORE datastore, the model object will be removed and any underlying in-database model will be dropped when the database connection terminates. To avoid having to reinstall the WIKI_MODEL, we save the proxy object in a datastore, here also named WIKI_MODEL, however, any datastore name could be used. Next time we connect to the database using ORE and want to use the WIKI_MODEL, we simply load the proxy object from the datastore using ore.load.

View model metadata

Now, we're ready to use the model. First, let's explore the model metadata. Here we see the results from functions class, summary, settings, and features as invoked on the proxy object. Note that the features listed start off with an odd set of text '!!!'. It turns out this corresponds to a dance-punk band on Wikipedia, and so is a valid topic.

> class(WikiModel)
[1] "ore.odmESA" "ore.model" 
>
> summary(WikiModel)

Call:
ore.createESA.wiki_model()

Settings: 
                                               value
min.items                                         25
topn.features                                   1000
value.threshold                            .00000001
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                        200000
odms.text.min.documents                            5
odms.text.policy.name                    wiki_txtpol
prep.auto                                         ON

Features: 
   FEATURE_ID   ATTRIBUTE_NAME COEFFICIENT
1         !!!  TEXT.AGGRESSIVE  0.06781725
2         !!!       TEXT.ALLAN  0.07732721
3         !!! TEXT.BANDMEMBERS  0.14586779
4         !!!   TEXT.BASSLINES  0.13941295
5         !!!        TEXT.BIKE  0.08124084
6         !!!        TEXT.CAKE  0.08896293
7         !!!         TEXT.CHK  0.36046651
8         !!!       TEXT.CLICK  0.09164429
9         !!!     TEXT.COMBINE  0.03672403
10        !!!     TEXT.COMPOSE  0.04161426
>
> settings(WikiModel)
                   SETTING_NAME                 SETTING_VALUE SETTING_TYPE
1                     ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS        INPUT
2                ESAS_MIN_ITEMS                            25        INPUT
3            ESAS_TOPN_FEATURES                          1000        INPUT
4          ESAS_VALUE_THRESHOLD                     .00000001      DEFAULT
5  ODMS_MISSING_VALUE_TREATMENT       ODMS_MISSING_VALUE_AUTO      DEFAULT
6                 ODMS_SAMPLING         ODMS_SAMPLING_DISABLE      DEFAULT
7        ODMS_TEXT_MAX_FEATURES                        200000        INPUT
8       ODMS_TEXT_MIN_DOCUMENTS                             5        INPUT
9         ODMS_TEXT_POLICY_NAME                   wiki_txtpol        INPUT
10                    PREP_AUTO                            ON      DEFAULT

> res <- features(WikiModel)
> dim(res)
[1] 21346563        4
> head(res)
  FEATURE_ID   ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1        !!!  TEXT.AGGRESSIVE              0.06781725
2        !!!       TEXT.ALLAN              0.07732721
3        !!! TEXT.BANDMEMBERS              0.14586779
4        !!!   TEXT.BASSLINES              0.13941295
5        !!!        TEXT.BIKE              0.08124084
6        !!!        TEXT.CAKE              0.08896293
> tail(res)
                   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
21346558 ?????ng Nai Province    TEXT.TROPIC              0.05638469
21346559 ?????ng Nai Province  TEXT.TROPICAL              0.08299051
21346560 ?????ng Nai Province   TEXT.UPGRADE              0.05149154
21346561 ?????ng Nai Province   TEXT.VIETNAM              0.05590181
21346562 ?????ng Nai Province     TEXT.WIDEN              0.04419549
21346563 ?????ng Nai Province        TEXT.XU              0.06124203

Assigning topics to documents using predict

Using the same test data from the previous blog post, we can get topic assignments, i.e., score, for these titles using the WikiModel object and the predict function. However, there is a caveat, in the previous example, the column containing the text was called "TITLE". However, the WIKI_MODEL requires the input column to be called "TEXT", because the model was built using input data with column "TEXT".

The error in the first predict execution below results because of the ESA_TEXT ore.frame does not have the needed column "TEXT". We can easily address this by renaming "TITLE" to "TEXT". Then, we receive the topics extracted using the prebuilt model.

title <- c('Aids in Africa: Planning for a long war',
           'Mars rover maneuvers for rim shot',
           'Mars express confirms presence of water at Mars south pole',
           'NASA announces major Mars rover finding',
           'Drug access, Asia threat in focus at AIDS summit',
           'NASA Mars Odyssey THEMIS image: typical crater',
           'Road blocks for Aids')
ESA_TEXT <- ore.push(data.frame(ID = seq(length(title)),
                                TITLE = title))

> predict(WikiModel, ESA_TEXT, type = "class", 
             supplemental.cols = "TITLE")
Error in eval(expr, envir, enclos) : object 'TEXT' not found
>
> names(ESA_TEXT2) <- c("ID","TEXT")
> predict(WikiModel, ESA_TEXT2, type = "class", 
             supplemental.cols = "TEXT")
                                      TEXT                  FEATURE_ID
1  Aids in Africa: Planning for a long war                South Africa
2        Mars rover maneuvers for rim shot      Mars Exploration Rover
3 Mars express confirms presence of water at Mars south pole      Mars
4   NASA announces major Mars rover finding          Curiosity (rover)
5 Drug access, Asia threat in focus at AIDS summit      World AIDS Day
6 NASA Mars Odyssey THEMIS image: typical crater     2001 Mars Odyssey
7                     Road blocks for Aids              World AIDS Day

Comparing terms or documents

Next, we look at the feature_compare function, which returns an ore.frame containing the similarity measure resulting from the cross-product of terms / documents provided in the named column, in this case "TEXT" provided to argument compare.cols. First, we create an ore..frame of the two terms we want to compare. If more than two terms or documents are provided in the TEXT column, the supplemental.cols attribute is essential to know which terms are being considered. This can be based on, e.g., an ID column or the actual terms.

> text <- c('street', 'avenue')
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> TERMS <- ore.push(df)
> res <- feature_compare(WikiModel, TERMS, compare.cols = "TEXT", 
                         supplemental.cols = "ID")
> res
    ID_A ID_B SIMILARITY
1      1    1  1.0000000
1.1    1    2  0.2353364
2      2    1  0.2353364
2.1    2    2  1.0000000
> res[res[[1]] < res[[2]],]
  ID_A ID_B SIMILARITY
1    1    2  0.2353364
> 
> text <- c('street', 'farm')
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> TERMS <- ore.push(df)
> res <- feature_compare(WikiModel, TERMS, compare.cols = "TEXT", 
                         supplemental.cols = "ID")
> res[res[[1]] < res[[2]],]
  ID_A ID_B  SIMILARITY
1    1    2 0.003524473

Where the values are relatively short, the actual text can be provided as a supplemental column to make interpreting results easier. In this case we added a third term "road" into TERMS:

> res <- feature_compare(WikiModel, TERMS, compare.cols = "TEXT", supplemental.cols = "TEXT")
> res[res[[1]] < res[[2]],]
    TEXT_A TEXT_B SIMILARITY
2   avenue street 0.23533639
2.1 avenue   road 0.05528705
3     road street 0.04758687

In the following examples, we show comparing larger text documents. Notice that the two documents in the first result are more similar than in the second, highlighting the model's ability to identify similar documents using this model.

> text <- c('Senior members of the Saudi royal family paid at least $560 
+           million to Osama bin Laden terror group and the Taliban for 
+           an agreement his forces would not attack targets in Saudi 
+           Arabia, according to court documents. The papers, filed in 
+           a $US3000 billion ($5500 billion) lawsuit in the US, allege 
+           the deal was made after two secret meetings between Saudi 
+           royals and leaders of al-Qa ida, including bin Laden. The 
+           money enabled al-Qa ida to fund training camps in Afghanistan 
+           later attended by the September 11 hijackers. The disclosures 
+           will increase tensions between the US and Saudi Arabia.',
+           'The Saudi Interior Ministry on Sunday confirmed it is holding 
+           a 21-year-old Saudi man the FBI is seeking for alleged links to 
+           the Sept. 11 hijackers. Authorities are interrogating Saud 
+           Abdulaziz Saud al-Rasheed "and if it is proven that he was 
+           connected to terrorism, he will be referred to the sharia 
+           (Islamic) court," the official Saudi Press Agency quoted an 
+           unidentified ministry official as saying.')
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> DOCS <- ore.push(df)
> res <- feature_compare(WikiModel, DOCS, compare.cols = "TEXT", supplemental.cols = "ID")
> res[res[[1]] < res[[2]],]
  ID_A ID_B SIMILARITY
1    1    2    0.58258
> 
> text <- c('Senior members of the Saudi royal family paid at least $560 
+           million to Osama bin Laden terror group and the Taliban for 
+           an agreement his forces would not attack targets in Saudi 
+           Arabia, according to court documents. The papers, filed in 
+           a $US3000 billion ($5500 billion) lawsuit in the US, allege 
+           the deal was made after two secret meetings between Saudi 
+           royals and leaders of al-Qa ida, including bin Laden. The 
+           money enabled al-Qa ida to fund training camps in Afghanistan 
+           later attended by the September 11 hijackers. The disclosures 
+           will increase tensions between the US and Saudi Arabia.',
+           'Russia defended itself against U.S. criticism of its economic 
+           ties with countries like Iraq, saying attempts to mix business 
+           and ideology were misguided. "Mixing ideology with economic 
+           ties, which was characteristic of the Cold War that Russia 
+           and the United States worked to end, is a thing of the past," 
+           Russian Foreign Ministry spokesman Boris Malakhov said Saturday, 
+           reacting to U.S. Defense Secretary Donald Rumsfeld statement 
+           that Moscow economic relationships with such countries sends a 
+           negative signal.')
> 
> df <- data.frame(ID = seq(length(text)), TEXT = text)
> DOCS <- ore.push(df)
> res <- feature_compare(WikiModel, DOCS, compare.cols = "TEXT", supplemental.cols = "ID")
> res[res[[1]] < res[[2]],]
  ID_A ID_B SIMILARITY
1    1    2 0.09541735

In this post, we've explored several key aspects of using the pre-built ESA Wikipedia-based model:

  • Installing the model
  • Creating a proxy object for using the model in ORE
  • Viewing the model metadata
  • Assigning topics to documents using predict
  • Comparing terms or documents

Try it on your own documents!


 

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services