While Oracle R Enterprise already provides seamless access to Oracle Database tables using standard R syntax and functions, new interfaces arise that make it conceptually easier for users to manipulate tabular data. The R package dplyr is one such package in the tidyverse that has gained wide adoption. It provides a grammar for data manipulation while working with data.frame-like objects, both in memory and out of memory. The dplyr package is intended to interface to database management systems, operating on data.frame or numeric vector objects.
New with Oracle R Enterprise 1.5.1, OREdplyr provides much of the dplyr functionality extending the ORE transparency layer. OREdplyr accepts, e.g., ore.frame objects instead of data.frames for in-database execution of dplyr function calls. Like the ORE transparency layer in general, OREdplyr allows users to avoid costly movement of data while scaling to larger data volumes because operations are not constrained by R client memory, the latency of data movement, or single-threaded execution, but leverage Oracle Database as a high performance compute engine.
OREdplyr maps closely to dplyr for both functions and arguments, and operates on both ore.frame and ore.numeric objects. Like dplyr, functions support both the non-standard evaluation (NSE) and standard evaluation (SE) interface. Note the NSE functions are good for interactive use, while SE functions are convenient for programming. See this dplyr vignette for details.
So what does this interface look like with ORE? Here are just a few examples:
library(OREdplyr) library(nycflights13) # contains data sets # Import data to Oracle Database ore.drop("FLIGHTS") # remove database table, if exists ore.create(as.data.frame(flights), table="FLIGHTS") # create table from data.frame dim(FLIGHTS) # get # rows and # columns names(FLIGHTS) # view names of columns head(FLIGHTS) # verify data.frame appears as expected # Basic operations select(FLIGHTS, year, month, day, dep_delay, arr_delay) %>% head() # select columns select(FLIGHTS, -year,-month, -day) %>% head() # exclude columns select(FLIGHTS, tail_num = tailnum) %>% head() # rename columns, but drops others rename(FLIGHTS, tail_num = tailnum) %>% head() # rename columns filter(FLIGHTS, month == 1, day == 1) %>% head() # filter rows filter(FLIGHTS, dep_delay > 240) %>% head() filter(FLIGHTS, month == 1 | month == 2) %>% head() arrange(FLIGHTS, year, month, day) %>% head() # sort rows by specified columns arrange(FLIGHTS, desc(arr_delay)) %>% head() # sort in descending order distinct(FLIGHTS, tailnum) %>% head() # see distinct values distinct(FLIGHTS, origin, dest) %>% head() # see distinct pairs mutate(FLIGHTS, speed = air_time / distance) %>% head() # compute and add new columns mutate(FLIGHTS, # keeps existing columns gain = arr_delay - dep_delay, speed = distance / air_time * 60) %>% head() transmute(FLIGHTS, # only keeps new computed columns gain = arr_delay - dep_delay, gain_per_hour = (arr_delay - dep_delay) / (air_time / 60)) %>% head() summarise(FLIGHTS, # aggregates the specified column values mean_delay = mean(dep_time,na.rm=TRUE), min_delay = min(dep_time,na.rm=TRUE), max_delay = max(dep_time,na.rm=TRUE), sd_delay = sd(dep_time,na.rm=TRUE))
Functions supported include:
With OREdplyr, ORE expands into the tidyverse giving R users another powerful way to manipulate database data, while avoiding costly data movement. This enables R users and data scientists to work with larger data volumes and not be constrained by R client memory, the latency associated with data movement, or single threaded execution, but can take advantage of the powerful database technology present in Oracle Database.