Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • Tuesday, September 5, 2017

Introducing a dplyr interface to Oracle R Enterprise

By: Mark Hornick | Director, Advanced Analytics and Machine Learning

While Oracle R Enterprise already provides seamless access to Oracle Database tables using standard R syntax and functions, new interfaces arise that make it conceptually easier for users to manipulate tabular data. The R package dplyr is one such package in the tidyverse that has gained wide adoption. It provides a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is intended to interface to database management systems, operating on data.frame or numeric vector objects.

New with Oracle R Enterprise 1.5.1, OREdplyr provides much of the dplyr functionality extending the ORE transparency layer. OREdplyr accepts, e.g., ore.frame objects instead of data.frames for in-database execution of dplyr function calls. Like the ORE transparency layer in general, OREdplyr allows users to avoid costly movement of data while scaling to larger data volumes because operations are not constrained by R client memory, the latency of data movement, or single-threaded execution, but leverage Oracle Database as a high performance compute engine.

OREdplyr maps closely to dplyr for both functions and arguments, and operates on both ore.frame and ore.numeric objects. Like dplyr, functions support both the non-standard evaluation (NSE) and standard evaluation (SE) interface. Note the NSE functions are good for interactive use, while SE functions are convenient for programming. See this dplyr vignette for details.

So what does this interface look like with ORE? Here are just a few examples:


library(nycflights13)   # contains data sets

# Import data to Oracle Database

ore.drop("FLIGHTS") # remove database table, if exists
ore.create(as.data.frame(flights), table="FLIGHTS") # create table from data.frame

dim(FLIGHTS)    # get # rows and # columns
names(FLIGHTS)  # view names of columns
head(FLIGHTS)   # verify data.frame appears as expected

# Basic operations

select(FLIGHTS, year, month, day, dep_delay, arr_delay) 
       %>% head() # select columns
select(FLIGHTS, -year,-month, -day) 
       %>% head() # exclude columns

select(FLIGHTS, tail_num = tailnum) 
        %>% head() # rename columns, but drops others
rename(FLIGHTS, tail_num = tailnum) 
        %>% head() # rename columns

filter(FLIGHTS, month == 1, day == 1) 
        %>% head() # filter rows
filter(FLIGHTS, dep_delay > 240) %>% head()
filter(FLIGHTS, month == 1 | month == 2) %>% head()

arrange(FLIGHTS, year, month, day) 
        %>% head() # sort rows by specified columns
arrange(FLIGHTS, desc(arr_delay)) 
        %>% head() # sort in descending order

distinct(FLIGHTS, tailnum) 
         %>% head() # see distinct values
distinct(FLIGHTS, origin, dest) 
         %>% head() # see distinct pairs

mutate(FLIGHTS, speed = air_time / distance) 
       %>% head()  # compute and add new columns
mutate(FLIGHTS,    # keeps existing columns
       gain = arr_delay - dep_delay,
       speed = distance / air_time * 60) %>% head()

transmute(FLIGHTS, # only keeps new computed columns
  gain = arr_delay - dep_delay,
  gain_per_hour = (arr_delay - dep_delay) / (air_time / 60)) 
      %>% head()

summarise(FLIGHTS, # aggregates the specified column values
          mean_delay = mean(dep_time,na.rm=TRUE),  
          min_delay = min(dep_time,na.rm=TRUE),
          max_delay = max(dep_time,na.rm=TRUE),
          sd_delay = sd(dep_time,na.rm=TRUE)) 

Functions supported include:

  • Data manipulation: select, filter, arrange, rename, mutate, transmute, distinct, slice, desc, select_, filter_, arrange_, rename_, mutate_, transmute_ , distinct_, slice_, inner_join, left_join, right_join, full_join
  • Grouping: group_by, groups, ungroup, group_size, n_groups, group_by_
  • Aggregation: summarise, summarise_, tally, count, count_
  • Sampling: sample_n, sample_frac
  • Ranking: row_number, min_rank, dense_rank, percent_rank, cume_dist, ntile, nth, first, last, n_distinct, top_n

With OREdplyr, ORE expands into the tidyverse giving R users another powerful way to manipulate database data, while avoiding costly data movement. This enables R users and data scientists to work with larger data volumes and not be constrained by R client memory, the latency associated with data movement, or single threaded execution, but can take advantage of the powerful database technology present in Oracle Database.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha