X

Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • Tuesday, June 6, 2017

Diabetes Data Analysis in R

By: Jie Liu | Data Scientist

Data collected from diabetes patients has been widely investigated nowadays by many data science applications. Popular data sets include PIMA Indians Diabetes Data Set or Diabetes 130-US hospitals for years 1999-2008 Data Set. Both data sets are aggregated, labeled and relatively straightforward to do further machine learning tasks. However, in the real world, diabetes data are often collected from healthcare instruments attached to patients. The raw data can be sporadic and messy. Analyzing such data requires more preprocessing. In this blog, we will explore an interesting diabetes data set to demonstrate the powerful data manipulation capability of R with Oracle R Enterprise (ORE), component of Oracle Advanced Analytics - an option to Oracle Database Enterprise Edition. Note that this data analysis is for machine learning study only. We are not medical researchers or physicians in the diabetes domain. Our knowledge on this disease so far comes from the material included with the data set.

Data Overview

The data is from the UCI archive. It is collected from electronic recording devices as well as paper records for 70 diabetes patients. For each patient, there is a file that contains 3-4 months of glucose level measurements and insulin dosages, as well as other special events (exercise, meal consumption, etc). First, we need to construct a data frame from the 70 separate files. This can be readily accomplished in R as follows; however, if the data were provided as several database tables, the function rbind overloaded by ORE to work on ore.frame objects could be used to union these tables.

dd.list <- list(0)
for(i in 1:70) {
  fileName <- sprintf("data-%02d", i)
  dd <- read.csv(fileName,header=FALSE,sep='\t')
  datetime.vec <- paste(dd$V1, dd$V2)
  dd$datetime <- as.POSIXct(strptime(datetimeVec, "%m-%d-%Y %H:%M"))
  colnames(dd) <- c('DATE', 'TIME', 'CODE', 'VALUE', 'DATETIME')
  dd$CODE <- as.factor(dd$CODE)
  dd.list[[i]] <- data.frame(ID=i, dd)
}
dd.df <- do.call("rbind", dd.list)
dd.df <- subset(dd.df, !is.na(dd.df$DATETIME))
dd.df$NO <- row.names(dd.df)
head(dd.df)
ID       DATE  TIME CODE VALUE            DATETIME
1 04-21-1991  9:09   58   100 1991-04-21 09:09:00
1 04-21-1991  9:09   33     9 1991-04-21 09:09:00
1 04-21-1991  9:09   34    13 1991-04-21 09:09:00
1 04-21-1991 17:08   62   119 1991-04-21 17:08:00
1 04-21-1991 17:08   33     7 1991-04-21 17:08:00
1 04-21-1991 22:51   48   123 1991-04-21 22:51:00

We can store the data frame into Oracle Database using ORE create.

library(ORE)
ore.connect(...)  # connect to Oracle Database
ore.drop(table="DD")
ore.create(dd.df, table="DD")

The column ID represents the patient ID and DATETIME is the timestamp when the event/measurement occurred. The field CODE stands for the particular type of measurement and the exact mapping can be found in the 'Data-Codes' file. Here is an example of some of the codes.

33 = Regular insulin dose
34 = NPH insulin dose
35 = UltraLente insulin dose
48 = Unspecified blood glucose measurement
58 = Pre-breakfast blood glucose measurement
62 = Pre-supper blood glucose measurement
65 = Hypoglycemic symptoms
66 = Typical meal ingestion
69 = Typical exercise activity

What can we do with this type of data? In the raw data file, data points are recorded in a 'transaction' style and the time interval is irregular. Also the data set does not have any clear label or indicator. This makes the data at hand difficult to work with. Therefore, we need to preprocess the data for machine learning tasks.
Next, we will show how to leverage the data to carry out analysis.

Clustering Analysis

Since the patients may have different levels of symptoms and also vary in treatment (such as insulin dose), we first conduct a clustering analysis to see if there are underlying groups.
For now, we ignore the timestamps and just do an aggregation on the patient level. We calculate the average value for each code and thus each average code value can be used as a feature. Note that for an event CODE, the VALUE is always zero, since it only indicates that an event happens at such time. In that case, we calculate the average number of occurrences over the number of days.
For each patient, we combine the information and form a feature vector. Here, we need to do a transpose of the data frame, which means that we want to convert CODE as separate columns. This can be done by using a 'pivot table' operation, which can be realized in R by calling the library reshape2.
We can use reshape2 through ORE embedded R execution. See the code below.

aggregate.code <- function(DD, full_code_list, event_list){ 
  full_code_list <- c(33, 34, 35, 48, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72)
  event_list = c(65, 66, 67, 68, 69,70, 71, 72)
 
  code_filtered <- subset(DD, CODE %in% full_code_list)
  code_agg <- aggregate(code_filtered$VALUE, 
        by=list(ID=code_filtered$ID, CODE=code_filtered$CODE),
        FUN=mean,
        na.rm=TRUE)
  colnames(code_agg)[3] <- 'MEASURE' 
  activity_filtered <- subset(DD, CODE %in% event_list)
  activity_count <- aggregate(activity_filtered$VALUE, 
        by=list(ID=activity_filtered$ID, CODE=activity_filtered$CODE),
        FUN=length)
  colnames(activity_count)[3] <- 'MEASURE'
  total.date <- length(unique(DD$DATE)) 
  activity_count$MEASURE <- activity_count$MEASURE*1.0/total.date 
  feature.agg <- rbind(code_agg, activity_count) 
  feature.df <- aggregate(feature.agg$MEASURE, 
by=list(ID=feature.agg$ID, 
CODE=feature.agg$CODE), 
FUN=max) 
  library(reshape2)
  cast.df <- dcast(feature.df, ID~CODE)   
  for (col in colnames(cast.df)){
    cast.df[,col] <- ifelse(is.na(cast.df[,col]), 0, cast.df[,col])
  }
  ore.drop(table='PIVOTED')
  ore.create(cast.df, table='PIVOTED')
  TRUE
}
ore.tableApply(DD,                 # an ore.frame referencing the prepared database table 
               aggregate.code,     # function defined above to execute on table 
               ore.connect = TRUE, # allows creating resulting table in function
               full_code_list = full_code_list, # data for function args
               event_list = event_list)
ore.sync(table = "PIVOTED")  # make new table accessible in client session

Now, we obtained the ORE data frame (ore.frame) PIVOTED that contains the average of each CODE values for each patient ID. In fact, the entire operation is done at the database server and the result is stored as a DB table. We can have a look at the obtained data set shown below.

head(PIVOTED[,1:7])
ID        33        34       35       48     57      58        
1  6.593750 16.892086 0.000000 150.1538   0.00 169.7185  
2 10.060847 11.333333 0.000000 201.4022   0.00 207.8438  
3  2.433333  8.000000 8.452055   0.0000 120.50 117.6750  
4  2.304348  8.413793 8.444444   0.0000 142.75 141.5714  
5  2.388889  8.500000 0.000000   0.0000 183.40 147.4324  
6  6.084746 18.000000 0.000000 246.5556   0.00 213.5238  

For ease of presentation, we omitted other CODE values. In total, there are 20 types of CODE. The next job is to cluster this data. Usually, clustering on data with high dimensionality is not ideal since the distance of each data point tends be large. It would be great if we can do a PCA analysis and cluster on principle components (PC), which can indicate directions of the features that have the highest variations.
In ORE, the following code carries out PCA analysis. Note that the function looks the same as open source R - it is overloaded in ORE.

dd.pca <- prcomp(PIVOTED[, -1], # use the ore.frame and exclude ID column
                 center = TRUE,
                 scale. = TRUE)

We can convert the original data frame to the space of principal components (PCs) using the code below.

dd.pca.ore <- ore.predict(dd.pca, PIVOTED[, -1])

In the principal component space, we use a k-means clustering method to do the clustering over the first two PCs.

model.km <- ore.odmKMeans(~., dd.pca.ore[, c("PC1","PC2")] , num.centers=5)
km.res <- predict(model.km, dd.pca.ore, type="class", supplemental.cols=c("PC1","PC2"))

To generate the plot, we use ggplot2, which can also be used through ORE embedded R execution. For simplicity here, however, we pull the data from the DB and call ggplot2.

km.res.df <- ore.pull(km.res)
ggplot(km.res.df, aes(PC1, PC2)) +
   geom_point(aes(color=factor(CLUSTER_ID)))

The clusters are illustrated in the plot.
alt text

A natural question is what are the essential differences between different clusters? We plot several boxplots for glucose level and insulin dose to have an idea of the distribution in the patients' characteristics.

pivoted <- ore.pull(PIVOTED)
library(ggplot2)
colnames(pivoted) <- append('ID', paste("CODE", colnames(pivoted[,-1]), sep = '_'))
label.df <- cbind(pivoted, km.res.df$CLUSTER_ID)
colnames(label.df)[ncol(label.df)] <- 'CLUSTER_ID'
library(gridExtra)
p1 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_35)) + geom_boxplot() + labs(x = "CLUSTER", y = 'Reguar insulin dose') +
  ggtitle("Boxplot of Selected Features across Patient Groups") +
  theme(plot.title = element_text(hjust = 0.5)) 
p2 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_65)) + geom_boxplot()+ labs(x = "CLUSTER", y = 'Hypoglycemic')
p3 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_69)) + 
geom_boxplot()+ 
labs(x = "CLUSTER", y = 'Typical Excercise')
p4 <- ggplot(label.df, aes(factor(CLUSTER_ID), CODE_62)) + 
geom_boxplot()+ labs(x = "CLUSTER", y = 'Pre-supper glucose') + grid.arrange(p1, p2, p3, p4, nrow=4)   

alt text

From these boxplots, we can see that the difference in the patient clusters involves insulin dose, frequency of hypoglycemic symptoms, frequency of exercise and pre-supper glucose level. One interesting observation is that the patients with higher amount of exercise (cluster 8) have relatively lower glucose level than other groups with similar dose of insulin (cluster 5), but with higher hypoglycemic symptoms (occurs because of low glucose level).
Although this clustering analysis sheds some light on differences among patients, we lose a lot of information associated with the time stamped data. Next, we will see how we leverage the data without aggregation.

Regression Analysis

One important question for a diabetes patient is to estimate the glucose level in order to control the insulin dose. This topic has been widely studied, which is an important direction as part of system biology. Previous work has access to time stamped data with high frequency (15 min). But in this data set, each data point has a large time interval and that interval is not constant.
In this case, we can still do regression on the glucose level by considering all factors within a time window. For a glucose measurement at a particular time, we can focus on all the insulin doses and previous events 15 hours before that time. Then, a regression model can be built considering the insulin dose in all types and the previous events.
To do the regression, we need to create a data frame that contains the target glucose level and all features related to it. For that purpose, we go through each row of the original data frame, find all features within the time window and collect the features for that data point. This is done in the following function.

related.row <- function(row, dd.df){ 
  range.df <- subset(dd.df,
                     (ID == row$ID & row$DATETIME - DATETIME < 14*3600 & row$DATETIME - DATETIME > 0))
  range.df$TIME_DIFF = row$DATETIME - range.df$DATETIME 
  events.df <- subset(range.df, (CODE %in% c(66,67,68,69,70, 71)))
  events.df <- events.df[which.max(events.df$DATETIME), ]
  events.df$LAST_EVENT <-  ifelse( events.df$CODE %in% c(67), 1, -1) 
  col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET') 
  glu.df <-  subset(range.df, CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64))
  last.glu <- glu.df[which.max(glu.df$DATETIME), ]
  last.glu$TIME_DIFF <- row$DATETIME - last.glu$DATETIME
  C33 <- subset(range.df, (CODE==33))
  C33 <- C33[which.max(C33$DATETIME), c('VALUE', 'TIME_DIFF')] 
  C34 <- subset(range.df, CODE == 34)
  C34 <- C34[which.max(C34$DATETIME), c('VALUE', 'TIME_DIFF')] 
  C35 <- subset(range.df, CODE == 35)
  C35 <- C35[which.max(C35$DATETIME), c('VALUE', 'TIME_DIFF')] 
  row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF,
                   events.df$LAST_EVENT, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, row$VALUE)
  row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)}) 
  new.df <- data.frame(row.result) 
  colnames(new.df) <- col.names 
  new.df[1,]
}

combine.row <- function(DD.glu, dd.df, related.row) {
                        N.glu <- nrow(DD.glu)
                        row <- DD.glu[1,]
                        train.df <- related.row(row, dd.df)
                        for(i in 2:N.glu){
                          row <- DD.glu[i,]
                          new.row <- related.row(row,dd.df)
                          if(all(new.row[1,]==0) != TRUE)
                            train.df <- rbind(train.df, new.row)
                        }
                        train.df <- train.df[-1,]
                        ore.drop(table='TRAIN')
                        ore.create(train.df, table='TRAIN')
                        TRUE
                      }
DD.glu <- subset(DD, CODE %in% c(48, 57, 58, 60, 62))
row.names(DD.glu) <- DD.glu$NO
res <- ore.tableApply(DD.glu, 
                      combine.row, 
                      dd.df = DD, related.row = related.row,
                      ore.connect = TRUE)
ore.sync(table ='TRAIN')  

We can have a look at the data frame obtained:

C33    T33 C34    T34    C35     T35  LAST_EVENT EVENT_TIME LAST_GLU LAST_TIME TARGET
1 12.483333   0 0.000000   8 12.48333          1  12.483333      220 12.516667    118
4  9.233333   8 9.233333   0  0.00000          1   9.233333      272  9.283333    213
0  0.000000   0 0.000000   0  0.00000         -1   0.850000      222  7.883333     71
0  0.000000   0 0.000000   0  0.00000         -1   6.100000      222 13.133333    193
4  1.883333   4 1.883333   0  0.00000          1   1.733333       70  5.383333    134
4  5.300000   4 5.300000   0  0.00000          1   5.150000       70  8.800000    281

On an i5 based laptop with 16G memory, this normally takes about 6 minutes. The data set itself has only 29244 rows, so if this performance in not acceptable to the user, we can further optimize the solution by including native SQL in our R function. For those data scientists comfortable with SQL, this is a significant advantage of ORE being able to leverage SQL in conjunctions with R and ore.frame objects.
Let us review this process. The entire operation actually need to join each row of glucose measurement to the original data set ('DD') and filter based on time stamps on a certain time window. The entire process is done in R using iteration, which is expensive.
One solution to this is to use an Oracle SQL query to do the joining and then do the subsequent operations in parallel using 'groupApply' in ORE embedded R execution. The relational database optimizes table join performance, so we can take advantage of that using the following SQL query.

ore.exec("CREATE TABLE DD_GLU_AGG AS
  SELECT ID, GLU_NO, GLUCOSE, CODE, VALUE,
  EXTRACT(HOUR FROM  GLU_TIME - DATETIME) + EXTRACT(MINUTE FROM  GLU_TIME -    DATETIME)/60 AS TIME_DIFF
  FROM (
    SELECT ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME,
      row_number() OVER ( PARTITION BY  ID, GLU_NO, GLUCOSE, GLU_TIME, CODE ORDER BY DATETIME NULLS LAST) AS RANK
    FROM (
     SELECT a.ID, GLU_NO, GLUCOSE, GLU_TIME,
       CASE WHEN b.CODE IN (48, 57, 58, 59, 60, 61, 62, 63, 64) THEN '0' ELSE
        (CASE WHEN b.CODE IN (66, 68, 69, 70, 71) THEN '-1' ELSE
          (CASE WHEN b.CODE = 67 THEN '1' ELSE b.CODE END) END) END AS CODE,
            b.VALUE, b.DATETIME
     FROM (
       SELECT ID, NO AS GLU_NO, VALUE AS GLUCOSE, DATETIME AS GLU_TIME
       FROM DD
       WHERE CODE IN (48, 57, 58, 60, 62)
     ) a
     JOIN DD b
     ON (GLU_TIME - b.DATETIME) < INTERVAL '14' HOUR
     AND (GLU_TIME - b.DATETIME) > INTERVAL '0' HOUR
     AND a.ID = b.ID) c
     ORDER BY ID, GLU_NO, GLUCOSE, GLU_TIME, CODE, VALUE, DATETIME) d
     WHERE RANK = 1")
ore.sync(table='DD_GLU_AGG')

This query generates a table that looks like

ID GLU_NO GLUCOSE CODE VALUE TIME_DIFF
1    105     282    0   183  13.13333
1    105     282   33    10  13.13333
1    105     282   34    14  13.13333
1    105     282   65     0   6.30000
1    107      91    0   282   9.40000
1    107      91   33     7  13.70000

The data aggregate all CODES, with the latest timestamp, related to each pair of IDs and GLU_NO. Next, we can use groupApply() to do the rest of the job in parallel.

form.row <- function(DD_GLU_AGG){
  C33 <- subset(DD_GLU_AGG, CODE == 33)
  C34 <- subset(DD_GLU_AGG, CODE == 34)
  C35 <- subset(DD_GLU_AGG, CODE == 35)
  events.df <- subset(DD_GLU_AGG, CODE %in% c(1,-1))
  events.df$VALUE <- ifelse(events.df$CODE == -1, -1, 1)
  last.glu <- subset(DD_GLU_AGG, CODE == 0)
  row.list <- list(C33$VALUE, C33$TIME_DIFF, C34$VALUE, C34$TIME_DIFF, C35$VALUE, C35$TIME_DIFF,
                   events.df$VALUE, events.df$TIME_DIFF, last.glu$VALUE, last.glu$TIME_DIFF, DD_GLU_AGG$GLUCOSE[1])
  row.result <- lapply(row.list, function(x) { ifelse(length(x)==0, 0, x)})
  
  new.df <- data.frame(row.result)
  # C33 stands for the insulin dose. T33 stands for the time between the injection and the glucose level
  col.names <- c('C33', 'T33', 'C34', 'T34', 'C35', 'T35', 'LAST_EVENT', 'EVENT_TIME', 'LAST_GLU', 'LAST_TIME', 'TARGET')
  colnames(new.df) <- col.names
  new.df[1,]
}

Train <- ore.groupApply(DD_GLU_AGG, # ore.frame proxy for database table
  DD_GLU_AGG[, c("ID", "GLU_NO")],  # columns for partitioning data 
  form.row, 
  FUN.VALUE = data.frame(C33 = integer(0), # define resulting table sturucture
  T33 = numeric(0),
  C34 = integer(0),
  T34 = numeric(0),
  C35 = integer(0),
  T35 = numeric(0),
  LAST_EVENT = integer(0),
  EVENT_TIME = numeric(0),
  LAST_GLU = numeric(0),
  LAST_TIME = numeric(0),
  TARGET = numeric(0)),
  parallel = 4)
# materialize the ORE frame.
train.df <- ore.pull(Train)

Using this approach, the entire process takes around 1 min, a 6x performance improvement. The ORE framework allows us to run the in-database queries and R analytics without moving the data off the database server.
After the data is prepared, we can go forward to build regression models. For simplicity, we treat all data points as homogeneous, which means we assume all patients have the same nature regarding their response to the insulin dose. A regression model is fit according to this data using the ORE’s parallel implementation of lm: ore.lm.

model.formula <- formula( TARGET ~ C33 + T33 + C34 + T34+ C35 + T35 + LAST_EVENT + EVENT_TIME + LAST_GLU + LAST_TIME)
model.lm <- ore.lm(model.formula, TRAIN)
summary(model.lm)
Call:
ore.lm(formula = model.formula, data = TRAIN)
Residuals:
    Min      1Q  Median      3Q     Max
-488.69  -59.35   -7.54   46.96  370.25
Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) 132.316609   2.770239  47.764  < 2e-16 ***
C33           1.699390   0.143044  11.880  < 2e-16 ***
T33           2.339259   0.244046   9.585  < 2e-16 ***
C34          -0.384308   0.068708  -5.593 2.29e-08 ***
T34          -0.171930   0.102842  -1.672  0.09460 . 
C35           0.278242   0.122206   2.277  0.02282 * 
T35          -0.285301   0.114634  -2.489  0.01283 * 
LAST_EVENT   -0.150663   2.852225  -0.053  0.95787   
EVENT_TIME   -0.206862   0.302349  -0.684  0.49388   
LAST_GLU      0.003885   0.009028   0.430  0.66696   
LAST_TIME     0.580310   0.182167   3.186  0.00145 **
---
 
Residual standard error: 77.27 on 9952 degrees of freedom
Multiple R-squared:  0.04094,   Adjusted R-squared:  0.03998
F-statistic: 42.48 on 10 and 9952 DF,  p-value: < 2.2e-16

We can see that most of the features have significant impact on the response. But the R squared score is low. This is because we use all patients' data and the variance of glucose level given the same condition could be high.
Moreover, we can run a decision tree to see the effect of insulin dose and time of injection on the glucose level. For better visualization purpose, we use the conditional inference tree package {party}.

train.df <- ore.pull(TRAIN)
library(party)
model.ct <- ctree(model.formula,
                  data=train.df)
plot(model.ct, main="Conditional Inference Tree for GLU")

alt text

The plot provides boxplots of glucose level in different partitions. It is easy to understand that for larger doses of UltraLente insulin (C35), the related glucose level is lower, similar for NPH insulin (C34). However, the partition related to regular insulin (C33) indicates that the one with a higher dose of regular insulin tends to have a high level of glucose, which seems to be paradoxical. One explanation is that the patient who has severe symptoms tends to take more regular insulin. This reminds us that the data only provide evidence of correlation, not causality.

Statistical Test of Hypoglycemic Symptoms

One of the symptoms recorded in the data set is the hypoglycemic symptom. This symptom is supposed to occur when the patient has too low of a glucose level. The following plot illustrates the occurrence of this symptom. The blue dots are the glucose level of one patient and the red vertical line marks the occurrence of hypoglycemic symptoms. We can see that most of the symptoms are related to the low glucose level.
To verify this fact statistically, we can run a T-test to check if there is a significant difference in the glucose level.
Here is the code for this analysis. Basically, the code goes through each hypoglycemic event and finds the nearest glucose measurement. Then compares the group of glucose level associated with hypoglycemic event and the one not.

rm(list=ls())
library(ORE)
options(ore.warn.order=FALSE)
ore.connect(...)
ore.ls()
DD.hypo <- subset(DD, CODE==65)
rownames(DD.hypo) <- DD.hypo$NO
rownames(DD) <- DD$NO
row.ahead <- function(row, dd.df){
  range.df <- subset(dd.df, ID == row$ID & (row$DATETIME - DATETIME < 0.25*3600) & (row$DATETIME - DATETIME > - 0.25*3600))
  glu.df <- subset(range.df, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64)))
  glu.df$TIME_DIFF = row$DATETIME -glu.df$DATETIME
  events.df <- subset(range.df, CODE %in% c(66,67,68,69,70, 71))
  events.df$TIME_DIFF = row$DATETIME -events.df$DATETIME
  events.df <- events.df[which.max(events.df$DATETIME), ]
  events.df$LAST_EVENT = ifelse( events.df$CODE %in% c(67), 1, -1)
  col.names <- c('LAST_EVENT', 'EVENT_TIME', 'LAST_NO', 'LAST_GLU', 'MEAUSRE_TIME', 'HYPO_TIME')
  if(nrow(glu.df) ==0){
    empty.df <- data.frame(as.list(rep(0, length(col.names) )))
    colnames(empty.df) <- col.names
    return(empty.df)
  }
  last.glu <- glu.df[which.min(abs(glu.df$TIME_DIFF)), ]
  row.list <- list(ifelse(nrow(events.df)==0, 0, events.df$LAST_EVENT),
                   ifelse(nrow(events.df)==0, 0, events.df$TIME_DIFF),
                   ifelse(nrow(last.glu) ==0, 0, last.glu$NO),
                   ifelse(nrow(last.glu) ==0, 0, last.glu$VALUE),
                   ifelse(nrow(last.glu) ==0, 0, last.glu$TIME_DIFF),
                   row$DATETIME
  )
  row.result <- lapply(row.list, function(x) { ifelse(is.na(x), 0, x)})
  new.df <- data.frame(row.list)
  colnames(new.df) <- col.names
  new.df[1,]
}
combine.rows <- function(DD.hypo, dd.df, row.ahead){  
  N.glu <- nrow(DD.hypo)
  row <- DD.hypo[1,]
  hypo.df <- row.ahead(row, dd.df)
  for(i in 2:N.glu){
    row <- DD.hypo[i,]
    new.row <- row.ahead(row,dd.df)
    if(all(new.row[1,]==0) != TRUE)
      hypo.df <- rbind(hypo.df, new.row)
  }  
  hypo.df <- hypo.df[-1,]
  ore.drop(table='HYPO')
  ore.create(hypo.df, table='HYPO')
  TRUE
}
res <- ore.tableApply(DD.hypo, 
combine.rows, 
dd.df = DD, 
row.ahead = row.ahead,
      ore.connect = TRUE)
ore.sync(table ='HYPO')
hypo.df <- ore.pull(HYPO)
DD.glu <- subset(DD, (CODE %in% c(48, 57, 58, 59, 60, 61, 62, 63, 64)))
dd.glu.df <- ore.pull(DD.glu)
boxplot(HYPO$LAST_GLU)
dd.glu.df$HYPO = ifelse(dd.glu.df$NO %in% hypo.df$LAST_NO, 1, 0 )
library(ggplot2)
dd.glu.df$HYPO <- as.factor(dd.glu.df$HYPO)
p <- ggplot(dd.glu.df, aes(HYPO, VALUE))
p + geom_boxplot()

alt text

t.test(dd.glu.df[dd.glu.df$HYPO==1,]$VALUE,dd.glu.df[dd.glu.df$HYPO==0,]$VALUE)
Welch Two Sample t-test
data:  dd.glu.df[dd.glu.df$HYPO == 1, ]$VALUE and dd.glu.df[dd.glu.df$HYPO == 0, ]$VALUE
t = -35.265, df = 251.52, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -106.35820  -95.10693
sample estimates:
mean of x mean of y
 61.18304 161.91560

The p-value is lower than 0.05, so we can reject the null hypothesis that the two groups have the same glucose level. This provides statistical evidence of the relationship between glucose level and the hypoglycemic symptom.

Conclusion

In this blog, we demonstrated the data wrangling and analysis capability of R and ORE for the diabetes data set. A workable dataset was successfully created from the raw data. Based on the dataset, a clustering and decision tree based analysis and visualization provided important insights into the data, which can be useful for evaluation of the effect of the treatment for diabetes patients

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.Captcha
Oracle

Integrated Cloud Applications & Platform Services