While browsing the chapter on Circular Distributions in Zar’s Biostatistical Analysis, I came across an interesting visualization for circular data. Circular scale data is all around: the days of the week, months of the year, hours of the day, degrees on the compass. Defined technically, circular scale data is a special type of interval scale data where the scale has equal intervals, no true zero point, and there is no rational high or low values, or if there are, they are arbitrarily designated. Consider days of the week, each day essentially has the same length, and although we may depict a week starting on Sunday on a typical calendar there is no true zero point. Further, aside from ascribing to Wednesdays a sense of the “hump” of the week we need to “get over,” there are no high or low values.
Relating this to R and Oracle R Enterprise, you may want to visualize such data using a scatter diagram that’s shown on the circumference of a circle. I found a package in R that does this called CircStats. In this blog post, we’ll focus on one of the functions, circ.plot, to see if it can be used with an ore.frame, and then to see how it scales to bigger data. While it is possible for open source packages to work with ORE proxy objects, it is not necessarily the case. As we'll see, circ.plot does work with an ore.numeric proxy object, but this function encounters scalability issues, which we'll address. Let's start with a variation of the example provided with circ.plot. It should be noted that circ.plot is more than adequate for a wide range of uses, but the goal of this post is to explore issues around scalability.
library(CircStats) data.vm100 <- rvm(100, 0, 3) circ.plot(data.vm100)
After loading the library, we generate 100 observations from a von Mises distribution, with a mean of 0 and a concentration of 3. We then plot the data using a scatter diagram, which places each point on the circle, as depicted in the following graph.
However, if the data are too dense, you don’t get a good sense of where most of the data are, especially when points may overlap or be over-plotted. To address this problem, there’s an option to bin and stack the counts around the circle. This gives us a much clearer picture of where the data are concentrated. In the following invocation, we set the stack parameter to TRUE and the number of bins to 150.
circ.plot(data.vm100, stack=TRUE, bins=150)
So now, let’s see if this function will work on database data that could be part of an ore.frame (ORE's proxy object to a database table), in this case, an ore.numeric vector. The class ore.frame is a subclass of data.frame, where many of the functions are overloaded to have computations performed by Oracle Database. Class ore.numeric implements many of the functions available for class numeric.
We can take the same data, and push it to the database to create an ore.frame, i.e., a proxy object corresponding to a table in Oracle Database. In this particular case, without changing a line of code in the circ.plot function, we’re able to transparently supply an ore.numeric vector.
data.vm100.ore <- ore.push(data.vm100) circ.plot(data.vm100.ore) circ.plot(data.vm100.ore, stack=TRUE, bins=150) class(data.vm100.ore) [1] "ore.numeric" attr(,"package") [1] "OREbase"
The exact same plots as above were generated. There was one notable difference; the performance of the circ.plot with the ore.numeric vector took longer – taking about 1.4 seconds, compared with 0.03 seconds for R. Note both Oracle Database and R are executing locally on my 4 CPU laptop. This isn’t surprising since for such small data the communication to the database takes more time than the actual computation.
> system.time(circ.plot(data.vm100.ore, stack=TRUE, bins=150)) user system elapsed 1.07 0.02 1.43 > system.time(circ.plot(data.vm100, stack=TRUE, bins=150)) user system elapsed 0.01 0.01 0.03
Let’s try to scale this to 100K observations and see what happens. As depicted in the following graph, we don’t get much value from this visualization – without stacking. While there appear to be fewer data points around 180 degrees, we don’t get a picture of the actual distribution elsewhere, certainly no notion of concentration.
data.vm100k <- rvm(100000, 0, 3) circ.plot(data.vm100k,main="100K Points")
For concentration, we need to use the stacked graph, as shown in the following plot. Because of the number of points, we can infer concentration between 90 degrees through 0 to 270 degrees, but the data point go past the edge of the image. This took about 9 seconds plus some lag to actually display the data points. Notice that we used shrink = 2 to reduce the size of the inner circle.
system.time(circ.plot(data.vm100k, stack=TRUE, bins=150,main="100K Points",shrink=2)) user system elapsed 8.86 0.13 9.07
If we set the dotsep argument, which specifies the distance between stacked points (default is 40), to 1000, we begin to see the picture more clearly.
Repeating the experiment with the ore.numeric vector, we again get the same graph results (not shown). However, the performance was slower for the binned option. Why is this?
> system.time(circ.plot(data.vm100k.ore, main="100K Points")) user system elapsed 0.50 0.50 1.15 > system.time(circ.plot(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2)) 7.73 0.06 11.73 > system.time(x.pull <- ore.pull(data.vm100k.ore)) user system elapsed 0.03 0.00 0.12
Let’s look at what is going on inside the circ.plot function. We’ll focus on two issues for scalability. The first (highlighted in red) is a for-loop that bins the data by testing if the value is within the arc segment and counting (that is, summing) the result. This sum produces a count for the bin. Notice that for each bin, the code makes a full scan of the data. The second scalability issue (highlighted in blue) involves a set of nested for-loops that draws a character for each value assigned to the bin. If a bin has a value of 20, then 20 characters are drawn, if it has 1 million, it draws 1 million characters!
circ.plot <- function (x, main = "", pch = 16, stack = FALSE, bins = 0, cex = 1, dotsep = 40, shrink = 1) { x <- x%%(2 * pi) if (require(MASS)) { eqscplot(x = cos(seq(0, 2 * pi, length = 1000)), y = sin(seq(0, 2 * pi, length = 1000)), axes = FALSE, xlab = "", ylab = "", main = main, type = "l", xlim = shrink * c(-1, 1), ylim = shrink * c(-1, 1), ratio = 1, tol = 0.04) lines(c(0, 0), c(0.9, 1)) text(0.005, 0.85, "90", cex = 1.5) lines(c(0, 0), c(-0.9, -1)) text(0.005, -0.825, "270", cex = 1.5) lines(c(-1, -0.9), c(0, 0)) text(-0.8, 0, "180", cex = 1.5) lines(c(0.9, 1), c(0, 0)) text(0.82, 0, "0", cex = 1.5) text(0, 0, "+", cex = 2) n <- length(x) z <- cos(x) y <- sin(x) if (stack == FALSE) points(z, y, cex = cex, pch = pch) else { bins.count <- c(1:bins) arc <- (2 * pi)/bins for (i in 1:bins) { bins.count[i] <- sum(x <= i * arc & x > (i - 1) * arc) } mids <- seq(arc/2, 2 * pi - pi/bins, length = bins) index <- cex/dotsep for (i in 1:bins) { if (bins.count[i] != 0) { for (j in 0:(bins.count[i] - 1)) { r <- 1 + j * index z <- r * cos(mids[i]) y <- r * sin(mids[i]) points(z, y, cex = cex, pch = pch) } } } } } else { stop("To use this function you have to install the package MASS (VR)\n") } }
When we profile the function above providing an ore.numeric vector to the function, we find that nearly all of the time is spent in the for-loop in red. Can we eliminate this for-loop and the corresponding full data scans since this greatly limits scalability?
To compute bins, we can take advantage of overloaded in-database ORE Transparency Layer functions to compute the needed bin boundaries: round and tabulate. Here’s a function that computes bins using standard R syntax and leveraging the ORE Transparency Layer.
countbins <- function(x, nbins, range_x = range(x, na.rm = TRUE)) { nbins <- as.integer(nbins)[1L] if (is.na(nbins) || nbins < 1L) stop("invalid 'nbins' argument") min_x <- range_x[1L] max_x <- range_x[2L] scale <- (max_x - min_x)/nbins x <- (0.5 + 1e-8) + (x - min_x)/scale nbinsp1 <- nbins + 1L counts <- ore.pull(tabulate(round(x), nbinsp1)) c(head(counts, -2), counts[nbins] + counts[nbinsp1]) }
For data stored in Oracle Database, we can produce the graph much faster by replacing the for-loop in red (and a couple of lines preceding it) with the following that uses the countbins function defined above:
bins.count <- countbins(x,bins,range_x=c(0,2*pi)) arc <- (2 * pi)/bins
This works for a straight R numeric vector as well as for an ore.numeric vector. However, the performance is dramatically faster and the computation scales in ORE.
Computing the binned counts now takes about a half second, but the overall execution time is still at 8.5 seconds, with the point generation taking 75% of the time.
system.time(test(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2)) BIN.COUNT: user system elapsed 0.05 0.00 0.55 POINTS: user system elapsed 6.27 0.09 6.38 TOTAL: user system elapsed 7.90 0.09 8.51
So we’ll now turn our attention to the for-loops highlighted in blue. The inner loop is plotting a point for each observation in the data. While this works quite well for smaller data, as we get bigger data, R can become overwhelmed. What we’re really trying to do is graph a line from the circle for a specific length, based on the count associated with bin. We can reduce from plotting N points – 100,000 points in our example – to at most 150 lines. This is much faster and scales to an arbitrary number of underlying points, especially when used with Oracle R Enterprise in-database computations.
We’ll replace the blue lines with the following:
for (i in 1:bins) { if (bins.count[i] != 0) { z2 <- bins.count[i]*index*cos(mids[i]) y2 <- bins.count[i]*index*sin(mids[i]) z <- c(cos(mids[i]), z2) y <- c(sin(mids[i]), y2) segments(z[1],y[1],z[2],y[2]) } }
Now, performance is dramatically improved, completing in 2.1 seconds with the raw data never leaving the database. Recall that the original R function took ~10 seconds on the 100k observations.
system.time(test2(data.vm100k.ore, stack=TRUE, bins=150, main="100K Points", shrink=2)) BIN.COUNT: user system elapsed 0.03 0.02 0.53 POINTS: user system elapsed 0.01 0.00 0.04 TOTAL: user system elapsed 1.54 0.02 2.10
We might say that 10 seconds vs. 2 seconds isn’t really a significant difference. So let’s scale up to 1 million points.
> data.vm1M <- rvm(1000000, 0, 3) > system.time(circ.plot(data.vm1M, stack=TRUE, bins=150,main="1M Points",shrink=2, dotsep=1000)) user system elapsed 229.81 6.55 238.77
> data.vm1M.ore <- ore.push(data.vm1M) > system.time(circ.plot2b(data.vm1M.ore, stack=TRUE, bins=150, main="1M Points", shrink=2)) BIN.COUNT: user system elapsed 0.02 0.02 4.54 POINTS: user system elapsed 0 0 0 user system elapsed 10.94 0.02 15.50
Now the difference becomes more pronounce with 239 seconds compared to 16 seconds. Note that the 239 seconds does not reflect the actual drawing time of the plot, just the computation portion. Overall stopwatch time was 323 seconds to complete the plot drawing in RStudio. When performing interactive data analytics, this latency can impact data scientist productivity, if not just be annoying.
But, we still have a problem for visualizing the result. Because of the size of the counts and the default dotsep value of 40, many lines are beyond the graph boundary.
We can further shrink the circle, but the scale of our counts is such that we still don’t get a good sense of the full data distribution. In addition, by using the segments function, there is an odd behavior of lines inside the circle. Because of how the counts are scaled using index <- cex/dotsep, whenever bins.count[i]*index < 1 the line is drawn inside the circle. This could be annoying or interesting depending on your perspective. We’ll leave resolving that as an exercise for the reader.
In any case, we can increase dotsep to 200, or even 1000, but we’re losing information since so many of the lines are inside the circle.
We can better address this providing the option to take the log of the counts, or even normalize the counts between 0 and 1 and adjust the dotsep argument.
At this point, we have a better indication of the distribution. Moreover, we can scale this to millions or billions of points and still have the computation performed quickly. As shown below, the entire graph is produced in just over 3 seconds.
Here is a scalable circ.plot function, circ.plot2.
circ.plot2 <- function (x, main = "", pch = 16, stack = FALSE, bins = 0, cex = 1, dotsep = 40, shrink = 1, scale.method = c("none","log","normalize"),col="blue") { x <- x%%(2 * pi) if (require(MASS)) { eqscplot(x = cos(seq(0, 2 * pi, length = 1000)), y = sin(seq(0, 2 * pi, length = 1000)), axes = FALSE, xlab = "", ylab = "", main = main, type = "l", xlim = shrink * c(-1, 1), ylim = shrink * c(-1, 1), ratio = 1, tol = 0.04) text(0.005, 0.85, "90", cex = cex) lines(c(0, 0), c(-0.9, -1)) text(0.005, -0.825, "270", cex = cex) lines(c(-1, -0.9), c(0, 0)) text(-0.8, 0, "180", cex = cex) lines(c(0.9, 1), c(0, 0)) text(0.82, 0, "0", cex = cex) text(0, 0, "+", cex = cex) n <- length(x) z <- cos(x) y <- sin(x) if (stack == FALSE) points(z, y, cex = cex, pch = pch) else { bins.count <- countbins(x,bins,range_x=c(0,2*pi)) arc <- (2 * pi)/bins min_x <- min(bins.count) max_x <- max(bins.count) bins.count = switch(scale.method, none = bins.count, log = log(bins.count), normalize = (bins.count - min_x) / (max_x-min_x)) index <- 1/dotsep for (i in 1:bins) { if (bins.count[i] != 0) { z2 <- bins.count[i]*index*cos(mids[i]) y2 <- bins.count[i]*index*sin(mids[i]) z <- c(cos(mids[i]), z2) y <- c(sin(mids[i]), y2) segments(z[1],y[1],z[2],y[2],col=col) } } } } else { stop("To use this function you have to install the package MASS (VR)\n")
} }
To push the scalability further, the following code produces a bipolar graph constructed with 2 million points, depicted below.
data.vm2mBP <- c(rvm(1000000, 0, 15),rvm(1000000,3.14,12)) data.vm2mBP.ore <- ore.push(data.vm2mBP) system.time(circ.plot2(data.vm2mBP.ore, stack=TRUE, bins=100, shrink=5,dotsep=2,scale.method = "log",main="2M Points, log counts",col="red")) user system elapsed 0.13 0.00 8.74
To do the same with the original circ.plot function for 1 million points took about 91 seconds for the elapsed execution time, and another 135 seconds to complete rendering the plot. It’s worth noting that the improved circ.plot2 function – when invoked on a local data.frame such as data.vm2mBP – takes under 2 seconds in local execution time, so the algorithm change benefits R as well.
In summary, this example highlights the impact of algorithm design on scalability. Using an existing function, CircStats circ.plot, we were able to leverage in-database ORE computations via the Transparency Layer, which leaves data in the database. However, to enable scalability, the R code that performs both computations and visualizations needed to be revised. A demonstrated, such changes can dramatically impact performance.
This installment of the Data Science Maturity Model (DSMM) blog series contains a summary table of the dimensions and levels. Enterprises...
In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'deployment': How easily can...
In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'tools': What tools are used within the...