X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Recent Posts

Use Big Data Appliance and Big Data Cloud Service High Availability, or You'll Blame Yourself Later

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features are always available out of the box on your systems, and that no extra steps are required from your end. One of the key value-adds of leveraging a hardened system from Oracle. A special shout-out to Sandra and Ravi from our team, for helping with this blog post. For this post on HA, we'll subdivide the content into the following topics: High Availability in the Hardware Components of the system High Availability within a single node Hadoop Components High Availability 1. High Availability in Hardware Components When we are talking about an on-premise solution, it is important to understand the fault tolerance and HA built into the actual hardware you have on the floor. Based on Oracle Exadata and the experience we have in managing mission critical systems, a BDA  is built out of components to handle hardware faults and simply stay up and running. Networking is redundant, power supplies in the racks are redundant, ILOM software tracks the health of the system and ASR pro-actively logs SRs if needed on hardware issues. You can find a lot more information here. 2. High availability within a single node Talking about high availability within a single node, I'd like to focus on disk failures. In large clusters, disk failures do occur but should - in general - nor cause any issues for BDA and BDCS customers. First let's have a look at the disk representation (minus data directories) for the Oracle system: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# df -h|grep -v "/u" Filesystem      Size  Used Avail Use% Mounted on devtmpfs        126G     0  126G   0% /dev tmpfs           126G  8.0K  126G   1% /dev/shm tmpfs           126G   67M  126G   1% /run tmpfs           126G     0  126G   0% /sys/fs/cgroup /dev/md2        961G   39G  874G   5% / /dev/md6        120G  717M  113G   1% /ssddisk /dev/md0        454M  222M  205M  53% /boot /dev/sda1       191M   16M  176M   9% /boot/efi /dev/sdb1       191M     0  191M   0% /boot/rescue-efi cm_processes    126G  309M  126G   1% /run/cloudera-scm-agent/process Next, let's take a look where critical services store their data. - Name Node. Aparently most critical HDFS component. It stores FSimage file and edits on the hard disks, let's check where: [root@bdax72bur09node02 ~]# df -h /opt/hadoop/dfs/nn Filesystem      Size  Used Avail Use% Mounted on /dev/md2        961G   39G  874G   5% / - Journal Node: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} span.s2 {font-variant-ligatures: no-common-ligatures; color: #1effff} span.s3 {font-variant-ligatures: no-common-ligatures; color: #4c7aff} [root@bdax72bur09node02 ~]# df /opt/hadoop/dfs/jn Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk [root@bdax72bur09node02 ~]# ls -l /opt/hadoop/dfs/jn lrwxrwxrwx 1 root root 15 Jul 15 22:58 /opt/hadoop/dfs/jn -> /ssddisk/dfs/jn - Zookeeper: [root@bdax72bur09node02 ~]# df /var/lib/zookeeper Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk all these services store their data on RAIDs /dev/md2 and /dev/md6. Let's take a look on what it consist of: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md2 /dev/md2: ...      Array Size : 1023867904 (976.44 GiB 1048.44 GB)      Used Dev Size : 1023867904 (976.44 GiB 1048.44 GB)       Raid Devices : 2      Total Devices : 2    ...     Active Devices : 2 ...     Number   Major   Minor   RaidDevice State        0       8        3        0      active sync   /dev/sda3        1       8       19        1      active sync   /dev/sdb3 so, md2 is one terabyte mirror RAID. We are save if one of the disks will fail. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md6 /dev/md6: ...      Array Size : 126924800 (121.04 GiB 129.97 GB)      Used Dev Size : 126924800 (121.04 GiB 129.97 GB)       Raid Devices : 2      Total Devices : 2 ...    Active Devices : 2 ...    Number   Major   Minor   RaidDevice State        0       8      195        0      active sync   /dev/sdm3        1       8      211        1      active sync   /dev/sdn3 so, md6 is mirror SSD RAID. We are save if one of the disks will fail. fine, let's go next! 3. High Availability of Hadoop Components 3.1 Default service distribution on BDA/BDCS We briefly took a look at the hardware layout of BDA/BDCS and how we layout data on disk. In this section, let's look at the Hadoop software details. By default, when you deploy BDCS or configure and create a BDA cluster, you will have the following service distribution by default: Node01 Node02 Node03 Node04 Node05 to nn Balancer - Cloudera Manager Server - - Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent DataNode DataNode DataNode DataNode DataNode Failover Controller Failover Controller - Oozie - JournalNode JournalNode JournalNode - - - MySQL Backup MySQL Primary - - NameNode NameNode Navigator Audit Server and Navigator Metadata Server - - NodeManager (in clusters of eight nodes or less) NodeManager (in clusters of eight nodes or less) NodeManager NodeManager NodeManager - - SparkHistoryServer Oracle Data Integrator Agent - - - ResourceManager ResourceManager - ZooKeeper ZooKeeper ZooKeeper - - Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) JobHistory - - Sentry Server (if enabled) Sentry Server (if enabled) - - - Hive Metastore - - Hive Metastore - Active Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) Passive Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) - - - - HttpFS - - - Hue Server - - Hue Server - Hue Load Balancer - - Hue Load Balancer - let me talk about High Availability implementation of some of this services. This configuration may change in the future, you could check some updates here. 3.2 Service with configured High Availability by default As of today (November 2018) we support high availability features for certain Hadoop components: 1) Name Node 2) YARN 3) Kerberos Distribution Center 4) Sentry 5) Hive Metastore Service 6) HUE 3.2.1 Name Node High Availability As you may know Oracle Solutions based on Cloudera Hadoop distribution. Here you could find detailed explanation about how HDFS high availability is achieved, but the good news that all those configuration steps done by default on BDA and BDCS and you simply have it by default. Let me show a small demo for NameNode high availability. First, let's check list of the nodes, which runs this service: [root@bdax72bur09node01 ~]# hdfs getconf -namenodes bdax72bur09node01.us.oracle.com bdax72bur09node02.us.oracle.com the easiest way to determine which node is active is to go to Cloudera Manager -> HDFS -> Instances: in my case bdax72bur09node02 node is active. I'll run hdfs list command in the cycle and reboot active namenode and we will take a look on how will system behave: [root@bdax72bur09node01 ~]# for i in seq {1..100}; do hadoop fs -ls hdfs://gcs-lab-bdax72-ns|tail -1; done; drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks 18/11/01 19:53:53 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 1 fail over attempts. Trying to fail over immediately. ... 18/11/01 19:54:16 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 5 fail over attempts. Trying to fail over after sleeping for 11022ms. java.net.ConnectException: Call From bdax72bur09node01.us.oracle.com/192.168.8.170 to bdax72bur09node02.us.oracle.com:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:786)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2167)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1265)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)     at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)     at org.apache.hadoop.fs.Globber.glob(Globber.java:151)     at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1715)     at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)     at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)     at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:102)     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) Caused by: java.net.ConnectException: Connection timed out ...   drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks so, as we can see due unavailability of one of the name nodes, second one take over it responsibility. Customer will experience short service outage. In Cloudera Manager we can see that NameNode service on node02 is not available: but despite on this, users could keep continue to work with the cluster without outages or any extra actions. 3.2.2 YARN High Availability YARN is another key Hadoop component and it's also highly available by default within Oracle Solution. Cloudera Requires to make some configuration, but with BDA and BDCS all these steps done after service deployment. Let's do the same test with YARN resource manager. In Cloudera Manager we define nodes, which run YARN resource manager service and try to reboot active one (reproduce hardware fail): I'll run some MapReduce code and will restart bdax72bur09node04 node (which contains active resource manager). [root@bdax72bur09node01 hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples.jar pi 1 1 Number of Maps  = 1 Samples per Map = 1 Wrote input for Map #0 Starting Job 18/11/01 20:08:03 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm16 18/11/01 20:08:03 INFO input.FileInputFormat: Total input paths to process : 1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: number of splits:1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541115989562_0002 18/11/01 20:08:04 INFO impl.YarnClientImpl: Submitted application application_1541115989562_0002 18/11/01 20:08:04 INFO mapreduce.Job: The url to track the job: http://bdax72bur09node04.us.oracle.com:8088/proxy/application_1541115989562_0002/ 18/11/01 20:08:04 INFO mapreduce.Job: Running job: job_1541115989562_0002 18/11/01 20:08:07 INFO retry.RetryInvocationHandler: Exception while invoking getApplicationReport of class ApplicationClientProtocolPBClientImpl over rm16. Trying to fail over immediately. java.io.EOFException: End of File Exception between local host is: "bdax72bur09node01.us.oracle.com/192.168.8.170"; destination host is: "bdax72bur09node04.us.oracle.com":8032; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy13.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy14.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408)     at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:302)     at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:154)     at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:323)     at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:423)     at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:698)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:326)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)     at java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)     at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)     at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:621)     at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1366)     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1328)     at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)     at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)     at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)     at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)     at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.io.EOFException     at java.io.DataInputStream.readInt(DataInputStream.java:392)     at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006) 18/11/01 20:08:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15 18/11/01 20:08:09 INFO mapreduce.Job: Job job_1541115989562_0002 running in uber mode : false 18/11/01 20:08:09 INFO mapreduce.Job:  map 0% reduce 0% 18/11/01 20:08:23 INFO mapreduce.Job:  map 100% reduce 0% 18/11/01 20:08:29 INFO mapreduce.Job:  map 100% reduce 100% 18/11/01 20:08:29 INFO mapreduce.Job: Job job_1541115989562_0002 completed successfully well, in the logs we clearly can see that we were failing over to second resource manager. In Cloudera Manager we can see that node03 took over active role: so, looking entire node, which contain Resource Manager users will not lose ability to submit their jobs. 3.2.3 Kerberos Distribution Center (KDC) In fact majority of production Hadoop Clusters running in secure mode, which means Kerberized Clusters. Kerberos Distribution Center is the key component for it. The good news when we install Kerberos with BDA or BDCS, you automatically get standby on your BDA/BDCS. 3.2.4 Sentry High Availability If Kerberos is authentication method (define who you are), that quite frequently users want to use some Authorization tool in couple with it. In case of Cloudera almost default tool is Sentry. Since BDA4.12 software release we have support of Sentry High Availability out of the box. Cloudera has detailed documentation, which explains how it works.  3.2.5 Hive Metastore Service High Availability When we are talking about hive, it's very important to keep in mind that it consist of many components. it's easy to see in Cloudera Manager: and whenever you deal with some hive tables, you have to go through many logical layers: for keep it simple, let's consider one case, when we use beeline to query some hive tables. So, we need to have HiveServer2, Hive Metastore Service and Metastore backend RDBMS available. Let's connect and make sure that data is available: 0: jdbc:hive2://bdax72bur09node04.us.oracle.c (closed)> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Connected to: Apache Hive (version 1.1.0-cdh5.14.2) Driver: Hive JDBC (version 1.1.0-cdh5.14.2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+   Now, let's shut down HiveServer2 and will make sure that we can't connect to database: 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Could not open connection to the HS2 server. Please check the server URI and if the URI is correct, then ask the administrator to check the server status. Error: Could not open client transport with JDBC Uri: jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0) 1: jdbc:hive2://bdax72bur09node04.us.oracle.c>  as we expected we couldn't perform connection. We have to go to Cloudera Manager -> Hive -> Instances -> Add Role and add extra HiveServer2 (add it to node05): After this we will need to install balancer: [root@bdax72bur09node06 ~]# yum -y install haproxy Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package haproxy.x86_64 0:1.5.18-7.el7 will be installed --> Finished Dependency Resolution   Dependencies Resolved   ==========================================================================================================================================================================================================  Package                                        Arch                                          Version                                             Repository                                         Size ========================================================================================================================================================================================================== Installing:  haproxy                                        x86_64                                        1.5.18-7.el7                                        ol7_latest                                        833 k   Transaction Summary ========================================================================================================================================================================================================== Install  1 Package   Total download size: 833 k Installed size: 2.6 M Downloading packages: haproxy-1.5.18-7.el7.x86_64.rpm                                                                                                                                                    | 833 kB  00:00:01      Running transaction check Running transaction test Transaction test succeeded Running transaction   Installing : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Verifying  : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Installed:   haproxy.x86_64 0:1.5.18-7.el7                                                                                                                                                                              Complete! now we will need to config haproxy. Go to configuration file: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node06 ~]# vi /etc/haproxy/haproxy.cfg this is example of my haproxy.cfg: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} global     log         127.0.0.1 local2       chroot      /var/lib/haproxy     pidfile     /var/run/haproxy.pid     maxconn     4000     user        haproxy     group       haproxy     daemon       # turn on stats unix socket     stats socket /var/lib/haproxy/stats   #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults     mode                    http     log                     global     option                  httplog     option                  dontlognull     option http-server-close     option forwardfor       except 127.0.0.0/8     option                  redispatch     retries                 3     timeout http-request    10s     timeout queue           1m     timeout connect         10s     timeout client          1m     timeout server          1m     timeout http-keep-alive 10s     timeout check           10s     maxconn                 3000   #--------------------------------------------------------------------- # main frontend which proxys to the backends #--------------------------------------------------------------------- frontend  main *:5000     acl url_static       path_beg       -i /static /images /javascript /stylesheets     acl url_static       path_end       -i .jpg .gif .png .css .js     use_backend static          if url_static p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} #--------------------------------------------------------------------- # static backend for serving up images, stylesheets and such #--------------------------------------------------------------------- backend static     balance     roundrobin     server      static 127.0.0.1:4331 check   #--------------------------------------------------------------------- # round robin balancing between the various backends #--------------------------------------------------------------------- listen hiveserver2 :10005     mode tcp     option tcplog     balance source server hiveserver2_1 bdax72bur09node04.us.oracle.com:10000 check server hiveserver2_2 bdax72bur09node05.us.oracle.com:10000 check   Then go to Cloudera Manager and setup balancer hostname/port (accordingly how we config it in our previous step): after all these changes been done try to connect again: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} beeline> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ 3 rows selected (2.08 seconds) Great! it work. Try to shutdown one of the HiveServer2: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 0: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ this is works! Now let's move on and let's have a look what do we have for Hive Metastore Service high availability. The really great news, that we do have enable it by default with BDA and BDCS: for showing this, I'll try to shutdown one by one service consequently and will see you connection to beeline would work. Shutdown service on node01 and try to connect/query through beeline: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ works, now i'm going to startup service on node01 and shutdown on the node04: 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ it works again! so, we are safe with Hive Metastore service. BDA and BDCS use MySQL RDBMS as database layout. As of today there is no High Availability for MySQL database, so we are using Master - Slave replication (in future we hope to have HA for MySQL), which allows us switch to Slave in case of Master failing. Today, you will need to perform node migration in case of failing master node (node03 by default), I'll explain this later in this blog. To find out where is MySQL Master, run this: [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node03 to find out slave, run: [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02   3.2.6 HUE High Availability Hue is quite a popular tool for working with Hadoop data. It's also possible to run Hue in HA mode, Cloudera explains it here, but with BDA and BDCS you will have it out of the box since 4.12 software version. By default you have HUE and HUE balancer available on node01 and node04: in case of unavailability of Node01 or Node04, users could easily, without any extra actions keep using HUE, just for switching to different balancer URL. 3.3 Migrate Critical Nodes One of the greatest features of Big Data Appliance is the capability to migrate all roles of critical services. For example, some nodes may contain many critical services, like node03 (Cloudera Manager, Resource Manager, MySQL store...). Fortunately, BDA has the simple way to migrate all roles from critical to a non-critical node. All details you may find in MOS (Node Migration on Oracle Big Data Appliance V4.0 OL6 Hadoop Cluster to Manage a Hardware Failure (Doc ID 1946210.1)). Let's consider a case, when we lose (because of Hardware failing, for example) one of the critical server - node03, which contains MySQL Active RDBMS and Cloudera Manager. For fix this we need to migrate all roles of this node to some other server. For perform migration all roles from node03, just run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# bdacli admin_cluster migrate bdax72bur09node03 all details you could find in the MOS note, but briefly: 1) This is two major type of migrations: - Migration of critical nodes - Reprovisioning of non-critical nodes 2) When you migrate critical nodes, you couldn't choose non-critical on which you will migrate services (mammoth will do this for you, generally it will be the first available non-critical node) 3) after hardware server will be back to cluster (or new one will be added), you should reprovision it as non critical. 4) You don't need to switch services back, just leave it as it is after migration done, the new node will take over all roles from failing one. In my example, I've migrated one of the critical node, which has Active MySQL RDBMS and Cloudera Manager. To check where is active RDBMS, you may run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node05 Note: for find slave RDBMS, run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02 and Cloudera Manager runs on node05: Resource Manager also was migrated to the node05: Migration process does decommission of the node. After failing node will come back to the cluster, we will need to reprovision it (deploy non-critical services). In other words, we need to make re-commision of the node. 3.4 Redundant Services there are certain Hadoop services, which configured on BDA in redundant way. You shouldn't worry about high availability of these services: - Data Node. By default, HDFS configured for being 3 times redundant. If you will lose one node, you will have tow more copies. - Journal Node. By default, you have 3 instances of JN configured. Missing one is not a big deal. - Zookeeper. By default, you have 3 instances of JN configured. Missing one is not a big deal. 4. Services with no configured High Availability by default There are certain services on BDA, which doesn't have High Availability Configuration by default: - Oozie. If you need to have High Availability for Oozie, you may check Cloudera's documentation - Cloudera Manager. It's also possible to config Cloudera Manager for High Availability, like it's explained here, but I'd recommend use node migration, like I show above - Impala. By default, neither BDA nor BDCS don't have Impala configured by default (yet), but it's quite important. All detailed information you could find here, but briefly for config HA for Impala, you need: a. Config haproxy (I've extend existing haproxy config, doen for HiveServer2), by adding: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} listen impala :25003     mode tcp     option tcplog     balance leastconn       server symbolic_name_1 bdax72bur09node01.us.oracle.com:21000 check     server symbolic_name_2 bdax72bur09node02.us.oracle.com:21000 check     server symbolic_name_3 bdax72bur09node03.us.oracle.com:21000 check     server symbolic_name_4 bdax72bur09node04.us.oracle.com:21000 check     server symbolic_name_5 bdax72bur09node05.us.oracle.com:21000 check     server symbolic_name_6 bdax72bur09node06.us.oracle.com:21000 check b. Go to Cloudera Manager -> Impala - Confing -> search for "Impala Daemons Load Balancer" and add haproxy host there: c. Login into Impala, using haproxy host:port: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 bin]# impala-shell -i bdax72bur09node06:25003 ... Connected to bdax72bur09node06:25003 ... [bdax72bur09node06:25003] > talking about Impala, it's worth to mention that there are two more services - Impala Catalog Service and Impala State Store. It's not mission critical services. From Cloudera's documentation: The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons.  and The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. ... Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host. I'll make a quick test: - I've disabled Impala daemon on the node01, disable StateStore, and Catalog id - Connect to loadbalancer and run the query: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# impala-shell -i bdax72bur09node06:25003 .... [bdax72bur09node06:25003] > select count(1) from test_table; ... +------------+ | count(1)   | +------------+ | 6659433869 | +------------+ Fetched 1 row(s) in 1.76s [bdax72bur09node06:25003] > so, as we can see Impala may work even without StateStore and Catalog Service p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures}

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features...

Big Data

Influence Product Roadmap: Introducing the new Big Data Idea Lab

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on ideas submitted by others. Visit the Big Data Idea Lab now. What does the Idea Lab let you do, and how do we use your feedback? For all our products and services we (Product Management) define a set of features and functionality that will enhance the products and solves customer problems. We then set out to prioritize these features and functions, and a big driver of this is the impact said features have on you, our customers. Until now we really used interaction with customers as the yard stick for that impact, or that bit of prioritization. That will change with the Idea Lab, where we will have direct, recorded and scalable input available on features and ideas.  Of course we are also looking for input into new features and things we had not thought about. That is the other part of the Idea Lab: giving us new ideas, new functions and features and anything that you think would help you use our products better in your company. As we progress in releasing new functionality, the Idea Lab will be a running tally of our progress, and we promise to keep you updated on where we are going in roadmap posts on this blog (see this example: Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now), and on the Idea Lab. So, please use this Idea Lab, submit and vote, and visit often to see what is new and keep us tracking towards better products. And thanks in advance for your efforts!

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on...

Autonomous

Thursday at OpenWorld 2018 - Your Must-See Sessions

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of the whole conference is today - Using Analytic Views for Self-Service Business Intelligence, which is at 9:00am in Room 3005, Moscone West. Multi-Dimensional models inside the database are very powerful and totally cool. AVs uniquely deliver sophisticated analytics from very simple SQL. If you only get to one session today then make it this one! Of course, today is your final chance to get some much-needed real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 1:30pm - 2:30 at the Marriott Marquis (Yerba Buena Level) - Salon 9B. The product management team will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   THURSDAY's MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  Technorati Tags: Analytics, Autonomous, Big Data, Cloud, Conference, Data Warehousing, OpenWorld, SQL Analytics

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of...

Autonomous

Wednesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is two additional chances to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 12:45pm - 1:45pm and then again at 3:45pm - 4:45pm, both at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   WEDNESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full...

Autonomous

Tuesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is the chance to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 3:45 PM - 4:45 PM at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   TUESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of...

Autonomous

Managing Autonomous Data Warehouse Using oci-curl

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means you can use OCI REST APIs to manage your ADW instances as an alternative to using the OCI web interface. I want to provide a few examples to do this using the bash function oci-curl provided in the OCI documentation. This was the easiest method for me to use, you can also use the OCI command line interface, or the SDKs to do the same operations. oci-curl oci-curl is a bash function provided in the documentation that makes it easy to get started with the REST APIs. You will need to complete a few setup operations before you can start calling it. Start by copying the function code from the documentation into a shell script on your machine. I saved it into a file named oci-curl.sh, for example. You will see the following section at the top of the file. You need to replace these four values with your own. TODO: update these values to your own local tenancyId="ocid1.tenancy.oc1..aaaaaaaaba3pv6wkcr4jqae5f15p2b2m2yt2j6rx32uzr4h25vqstifsfdsq"; local authUserId="ocid1.user.oc1..aaaaaaaat5nvwcna5j6aqzjcaty5eqbb6qt2jvpkanghtgdaqedqw3rynjq"; local keyFingerprint="20:3b:97:13:55:1c:5b:0d:d3:37:d8:50:4e:c5:3a:34"; local privateKeyPath="/Users/someuser/.oci/oci_api_key.pem"; How to find or generate these values is explained in the documentation here, let's walk through those steps now. Tenancy ID The first one is the tenancy ID. You can find your tenancy ID at the bottom of any page in the OCI web interface as indicated in this screenshot. Copy and paste the tenancy ID into the tenancyID argument in your oci-curl shell script. Auth User ID This is the OCI ID of the user who will perform actions using oci-curl. This user needs to have the privileges to manage ADW instances in your OCI tenancy. You can find your user OCI ID by going to the users screen as shown in this screenshot. Click the Copy link in that screen which copies the OCI ID for that user into the clipboard. Paste it into the authUserId argument in your oci-curl shell script. Key Fingerprint The first step for getting the key fingerprint is to generate an API signing key. Follow the documentation to do that. I am running these commands on a Mac and for demo purposes, I am not using a passphrase, see the documentation for Windows commands and for using a passphrase to encrypt the key file. mkdir ~/.oci openssl genrsa -out ~/.oci/oci_api_key.pem 2048 chmod go-rwx ~/.oci/oci_api_key.pem openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem For your API calls to authenticate against OCI you need to upload the public key file. Go to the user details screen for your user on the OCI web interface and select API keys on the left. Click the Add Public Key button and copy and paste the contents of the file oci_api_key_public.pem into the text field, click Add to finish the upload. After you upload your key you will see the fingerprint of it in the user details screen as shown below. Copy and paste the fingerprint text into the keyFingerprint argument in your oci-curl shell script. Private Key Path Lastly, change the privateKeyPath argument in your oci-curl shell script to the path for the key file you generated in the previous step. For example, I set it as below in my machine. local privateKeyPath="/Users/ybaskan/.oci/oci_api_key.pem"; At this point, I save my updated shell script as oci-curl.sh and I will be calling this function to manage my ADW instances. Create an ADW instance Let's start by creating an instance using the function. Here is my shell script for doing that, createdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com post ./request.json "/20160918/autonomousDataWarehouses" Note that I first source the file oci-curl.sh which contains my oci-curl function updated with my OCI tenancy information as explained previously. I am calling the CreateAutonomousDataWarehouse REST API to create a database. Note that I am running this against the Phoenix data center (indicated by the first argument, database.us-phoenix-1.oraclecloud.com), if you want to create your database in other data centers you need to use the relevant endpoint listed here. I am also referring to a file named request.json which is a file that contains my arguments for creating the database. Here is the content of that file. { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "dbName" : "adwdb1", "displayName" : "adwdb1", "adminPassword" : "WelcomePMADWC18", "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "licenseModel" : "LICENSE_INCLUDED" } As seen in the file I am creating a database named adwdb1 with 1 CPU and 1TB storage. You can create your database in any of your compartments, to find the compartment ID which is required in this file, go to the compartments page on the OCI web interface, find the compartment you want to use and click the Copy link to copy the compartment ID into the clipboard. Paste it into the compartmentId argument in your request.json file. Let's run the script to create an ADW instance. ./createdb.sh { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : null, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "PROVISIONING", "serviceConsoleUrl" : null, "timeCreated" : "2018-09-06T19:56:48.077Z" As you see the lifecycle state is listed as provisioning which indicates the database is being provisioned. If you now go to the OCI web interface you will see the new database as being provisioned. Listing ADW instances Here is the script, listdb.sh, I use to list the ADW instances in my compartment. I use the ListAutonomousDataWarehouses REST API for this. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com get "/20160918/autonomousDataWarehouses?compartmentId=ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a" As you see it has one argument, compartmentId, which I set to the ID of my compartment I used in the previous example when creating a new ADW instance. When you run this script it gives you a list of databases and information about them in JSON which looks pretty ugly. ./listdb.sh [{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"adwdb1","definedTags":{},"displayName":"adwdb1","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW","timeCreated":"2018-09-06T19:56:48.077Z"},{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"testdw","definedTags":{},"displayName":"testdw","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW","timeCreated":"2018-07-31T22:39:14.436Z"}] You can use a JSON beautifier to make it human-readable. For example, I use Python to view the same output in a more readable format. ./listdb.sh | python -m json.tool [ { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "adwdb1", "definedTags": {}, "displayName": "adwdb1", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated": "2018-09-06T19:56:48.077Z" }, { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "testdw", "definedTags": {}, "displayName": "testdw", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW", "timeCreated": "2018-07-31T22:39:14.436Z" } ] Scaling an ADW instance To scale an ADW instance you need to use the UpdateAutonomousDataWarehouse REST API with the relevant arguments. Here is my script, updatedb.sh, I use to do that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com put ./update.json "/20160918/autonomousDataWarehouses/$1" As you see it uses the file update.json as the request body and also uses the command line argument $1 as the database OCI ID. The file update.json has the following argument in it. { "cpuCoreCount" : 2 } I am only using cpuCoreCount as I want to change my CPU capacity, you can use other arguments listed in the documentation if you need to. To find the database OCI ID for your ADW instance you can either look at the output of the list databases API I mentioned above or you can go the ADW details page on the OCI web interface which will show you the OCI ID. Now, I call it with my database ID and the scale operation is submitted. ./updatedb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "SCALE_IN_PROGRESS", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" } If you go to the OCI web interface again you will see that the status for that ADW instance is shown as Scaling in Progress. Stopping and Starting an ADW Instance To stop and start ADW instances you need to use the StopAutonomousDataWarehouse and the StartAutonomousDataWarehouse REST APIs. Here is my stop database script, stopdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/stop As you see it takes one argument, $1, which is the database OCI ID as I used in the scale example before. It also refers to the file empty.json which is an empty JSON file with the below content. { } As you will see this requirement is not mentioned in the documentation, but the call will give an error if you do not provide the empty JSON file as input. Here is the script running with my database OCI ID. ./stopdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STOPPING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Likewise, you can start the database using a similar call. Here is my script, startdb.sh, that does that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/start Here it is running for my database. ./startdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STARTING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Other Operations on ADW Instances I gave some examples of common operations on an ADW instance, to use REST APIs for other operations you can use the same oci-curl function and the relevant API documentation. For demo purposes, as you saw I have hardcoded some stuff like OCIDs, you can further enhance and parameterize these scripts to use them generally for your ADW environment. Next, I will post some examples of managing ADW instances using the command line utility oci-cli.

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means...

See How Easily You Can Query Object Store with Big Data Cloud Service (BDCS)

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to they need. Need some extra space? Simply load data into Object Store - Scalability. It's infinitely scale. At least theoretically :) - Durability and Availability. Object Store is first class citizen in all Cloud Stories, so all vendors do all their best to maintain 100% availability and durability. If some diet will go down, it shouldn't worry you. If some node with OSS software will go down, it shouldn't worry you. As user you have to put data there and read data from Object Store.  - Cost. In a Cloud Object Store is most cost efficient solution. Nothing comes for free and as downside I may highlight: - Performance in comparison with HDFS or local block devices. Whenever you read data from Object Store, you read it over the network. - Inconsistency of performance. You are not alone on object store and obviously under the hood it uses physical disks, which have own throughput. If many users will start to read and write data to/from Object Store, you may get performance which is different with what you use to have a day, week, month ago - Security. Unlike filesystems Object Store has not file grain permissions policies and customers will need to reorganize and rebuild their security standards and policies. based on conclusions that we made above we may conclude, that Object Store is well suitable as way to share data across many systems as well as historical layer for certain Information management systems. If we will compare Object Store with HDFS (they are both are Schema on Read system, which simply store data and define schema on runtime, when user run a query), I'm personally could differentiate it like HDFS is "Write once, read many", Object Store is "Write once, read few". So, it's more historical (cheaper and slower) than HDFS.  In context of Information Data Management Object Store may make place on the bottom of Pyramid: How to copy data to Object Store Well, let's imagine that we do have Big Data Cloud Service (BDCS) and want to archive some data from HDFS to Object Store (for example, because we run out of capacity on HDFS). There are multiple ways to do this (I've wrote about this earlier here), but I'll pick ODCP - oracle build tool for coping data between multiple sources, including HDFS and Object Store. Full documentation you could find here, but I only show a brief example how I did it on my test cluster. First we will need to define Object store on client node(in my case it's one of BDCS node), where we will run a client: [opc@node01 ~]$ export CM_ADMIN=admin [opc@node01 ~]$ export CM_PASSWORD=Welcome1! [opc@node01 ~]$ export CM_URL=https://cmhost.us2.oraclecloud.com:7183 [opc@node01 ~]$ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey@oracle.com" --swift-password "MyPassword-" --swift-storageurl "https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage After this we may check, that it appears: [opc@node01 ~]$ bda-oss-admin list_swift_creds -t PROVIDER  USERNAME                                                    STORAGE URL                              bdcstorage storage-a424392:alexey.filanovskiy@oracle.com               https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens after we will need to copy data from HDFS to Object Store: [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/parq.db/ swift://tpcds-parq.bdcstorage/parq.db ... [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/csv.db/ swift://tpcds-parq.bdcstorage/csv.db now we have data in Object store: [opc@node01 ~]$  hadoop fs -du -h  swift://tpcds-parq.bdcstorage/parq.db ... 74.2 K   74.2 K   swift://tpcds-parq.bdcstorage/parq.db/store 14.4 G   14.4 G   swift://tpcds-parq.bdcstorage/parq.db/store_returns 272.8 G  272.8 G  swift://tpcds-parq.bdcstorage/parq.db/store_sales 466.1 K  466.1 K  swift://tpcds-parq.bdcstorage/parq.db/time_dim ...   good time to define table in Hive Metastore, I'll show example with only one table, rest I did with script: 0: jdbc:hive2://node03:10000/default> CREATE EXTERNAL TABLE store_sales ( ss_sold_date_sk           bigint , ss_sold_time_sk           bigint , ss_item_sk                bigint , ss_customer_sk            bigint , ss_cdemo_sk               bigint , ss_hdemo_sk               bigint , ss_addr_sk                bigint , ss_store_sk               bigint , ss_promo_sk               bigint , ss_ticket_number          bigint , ss_quantity               int , ss_wholesale_cost         double , ss_list_price             double , ss_sales_price            double , ss_ext_discount_amt       double , ss_ext_sales_price        double , ss_ext_wholesale_cost     double , ss_ext_list_price         double , ss_ext_tax                double , ss_coupon_amt             double , ss_net_paid               double , ss_net_paid_inc_tax       double , ss_net_profit             double ) STORED AS PARQUET LOCATION 'swift://tpcds-parq.bdcstorage/parq.db/store_sales'   Make sure that you have required libraries in place for Hive and for Spark: [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/bin/../lib/hadoop-mapreduce/ [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/jars/   Now we are ready for test!   Why should you use smart data formats? Predicate Push Down In Big Data world there are type of file formats called Smart (for example, ORC files and Parquetfiles). They have metadata in file, which allow dramatically speed up query performance for some queries. The most powerful feature is Predicate Push Down. This feature allows to filter data in place where it actually is without moving over the network. Each Parquet file page have Minimum and Maximum value, which allows us to skip entire page. Follow SQL predicates could be used for filtering data: < <= = != >= > so, it's better to see once rather than heat many times.  0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales; ... +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (339.221 seconds)   We could take a look on the resource utilization and we could note, that Network quite heavily utilized. Now, let's try to do the same with csv files: 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales; +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (762.38 seconds) As we can see all the same - high network utilization, but query takes even longer. It's because CSV is row row format and we could not do column pruning.   so, let's try to feel power of Predicate Push Down and let's use some equal predicate in the query: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (74.689 seconds)   Now we can see that in case of parquet files we almost don't utilize network. Let's see how it's gonna be in case of csv files. 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (760.682 seconds) well, as assumed, csv files don't get any benefits out of WHERE predicate. But, not all functions could be offloaded. To illustrate this I run query with cast function over parquet files: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%'; ... +---------+--+ |   _c0   | +---------+--+ | 959269  | +---------+--+ 1 row selected (133.829 seconds)   as we can see, we move part of data set to the BDCS instance and process it there.   Column projection another feature of Parquetfiles is column format, which means that then less columns we are using, then less data we bring back to the BDCS. Let me illustrate this by running same query with one column and with 24 columns (I'll use cast function, which is not pushed down). 0: jdbc:hive2://node03:10000/default> select ss_ticket_number from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (128.887 seconds) now I run the query over same data, but request 24 columns: 0: jdbc:hive2://node03:10000/default> select  . . . . . . . . . . . . . . . . . . . . > ss_sold_date_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_sold_time_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_item_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_customer_sk             . . . . . . . . . . . . . . . . . . . . > ,ss_cdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_hdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_addr_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_store_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_promo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_ticket_number           . . . . . . . . . . . . . . . . . . . . > ,ss_quantity                . . . . . . . . . . . . . . . . . . . . > ,ss_wholesale_cost          . . . . . . . . . . . . . . . . . . . . > ,ss_list_price              . . . . . . . . . . . . . . . . . . . . > ,ss_sales_price             . . . . . . . . . . . . . . . . . . . . > ,ss_ext_discount_amt        . . . . . . . . . . . . . . . . . . . . > ,ss_ext_sales_price         . . . . . . . . . . . . . . . . . . . . > ,ss_ext_wholesale_cost      . . . . . . . . . . . . . . . . . . . . > ,ss_ext_list_price          . . . . . . . . . . . . . . . . . . . . > ,ss_ext_tax                 . . . . . . . . . . . . . . . . . . . . > ,ss_coupon_amt              . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid                . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid_inc_tax        . . . . . . . . . . . . . . . . . . . . > ,ss_net_profit              . . . . . . . . . . . . . . . . . . . . > from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (333.641 seconds)   ​I think after seen these numbers you will always put only columns that you need.   Object store vs HDFS performance Now, I'm going to show example of performance numbers for Object Store and for HDFS. It's not official benchmark, just numbers, which could give you idea how compete performance over Object store vs HDFS. Querying Object Store with Spark SQL as s bonus I'd like to show who to query object store with Spark SQL.   [opc@node01 ~]$ spark2-shell  .... scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession   scala> val warehouseLocation = "file:${system:user.dir}/spark-warehouse" warehouseLocation: String = file:${system:user.dir}/spark-warehouse   scala> val spark = SparkSession.builder().appName("SparkSessionZipsExample").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate() 18/07/09 05:36:32 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@631c244c   scala> spark.catalog.listDatabases.show(false) +----------+---------------------+----------------------------------------------------+ |name      |description          |locationUri                                         | +----------+---------------------+----------------------------------------------------+ |csv       |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv.db       | |csv_swift |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv_swift.db | |default   |Default Hive database|hdfs://bdcstest-ns/user/hive/warehouse              | |parq      |null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq.db      | |parq_swift|null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq_swift.db| +----------+---------------------+----------------------------------------------------+     scala> spark.catalog.listTables.show(false) +--------------------+--------+-----------+---------+-----------+ |name                |database|description|tableType|isTemporary| +--------------------+--------+-----------+---------+-----------+ |customer_demographic|default |null       |EXTERNAL |false      | |iris_hive           |default |null       |MANAGED  |false      | +--------------------+--------+-----------+---------+-----------+     scala> val resultsDF = spark.sql("select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%' ") resultsDF: org.apache.spark.sql.DataFrame = [count(1): bigint]   scala> resultsDF.show() [Stage 1:==>                                                  (104 + 58) / 2255]   in fact there is no difference for Spark SQL between SWIFT and HDFS. All performance considerations which I've motion above.    Parquet files. Warning! After looking on these results, you may want to convert everything in parquet files, but don't rush to do so. Parquet files is schema-on-write, which means that you do ETL when convert data to it. ETL means optimization as well as probability to make a mistake during this transformation. This is the example. I have table which has timestamps, which obviously couldn't be less than 0: hive> create table tweets_parq  ( username  string,    tweet     string,    TIMESTAMP smallint    )  STORED AS PARQUET;   hive> INSERT OVERWRITE TABLE tweets_parq select * from  tweets_flex;  we defined timestamp as smallint, which is not enough for some data: hive> select TIMESTAMP from tweets_parq ... ------------  1472648470 -6744 and as consequence we got overflow and get negative timestamp. Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format. Conclusion 1) Object Store is not competitor for HDFS. HDFS is schema on read system, which could give you good performance (but definitely lower than schema on write system, such as database). Object Store itself could give you elasticity. It's a good option for historical data, which you plan to use really not frequently. 2) Object Store add significant startup overhead, so it's not suitable for interactive queries. 3) If you put data on Object Store consider to use smart file formats such as Parquet. It could give you benefits of Predicates Push Down as well as column projection 4) Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format.

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to...

Big Data

Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should provide significant benefits for customer applications. This post describes Oracle's strategy to support our customers in up-taking C6 quickly and efficiently, with minimal disruption to their infrastructure. The Basics One of the key differences with C6 are its core versions, which are summarized here for everyones benefit: Apache Hadoop 3.0 Apache Hive 2.1 Apache Parquet 1.9 Apache Spark 2.2 Apache Solr 7.0 Apache Kafka 1.0 Apache Sentry 2.0 Cloudera Manager 6.0 Cloudera Navigator 6.0 and much more... for full details, always check the Cloudera download bundles or Oracle's documentation. Now what that this all mean for Oracle's Big Data platform (cloud and on-premises) customers? Upgrading the Platform This is the part where running Big Data Cloud Service, Big Data Appliance and Big Data Cloud at Customer makes a big difference. As with minor updates, where we move the entire stack (OS, JDK, MySQL, Cloudera CDH and everything else), we will also do this for your CDH 5.x to CDH 6.x move. What to expect: Target Version: CDH 6.0.1, which at the point of writing this post, has not been released Target Dates: November 2018 with a dependency on the actual 6.0.1 release date Automated Upgrade: Yes - as with minor releases, CDH and the entire stack (OS, MySQL, JDK) will be upgraded using the Mammoth Utility As always, Oracle is building this all in house, and we will are testing the migration across a number scenarios for technical correctness.  Application Impact The first thing to start planning for is what a version uptick like this means for your applications. Will everything work nicely as before? Well, that is where the hard work comes in: testing the actual applications on a C6 version. In general, we would recommend to configure a small BDA/BDCS/BDCC cluster and load some data (also note the paragraph below on Erasure Coding in that respect) and then do the appropriate functional testing. Once that is all running satisfactorily and per your expectations, you would start to upgrade existing clusters. What about Erasure Coding? This is the big feature that will become available in the 6.1 timeframe. Just to be clear, Erasure Coding is not in the first versions supported by Cloudera. Therefore it will also not be supported on the Oracle platforms, which is based on 6.0.1 (note the 0 in the middle :-) ). As usual, once the 6.1 is available, Oracle will offer that as a release to upgrade too, and we will at that time address the details around Erasure Coding, how to get there, and how to leverage this on the Oracle Big Data solutions. To give everyone a quick 10,000 foot guideline, keep using regular block encoding (the current HDFS structure) for best performance, and use Erasure Coding for storage savings, while understanding that more network traffic can impact raw performance. Do I have to Move? No. You do not have to move to CDH 6, nor do you need to switch to Erasure Coding. We do expect one more 5.x release, most likely 5.16, and will release this on our platforms as well. That is of course a fully supported release. It is then - generally speaking - up to your timelines to move to the C6 platform. As we move closer to the C6 release on BDA, BDCS and BDCC we will provide updates on specific versions to migrate from, dates and timelines etc. Should you have questions, contact us in the big data community. The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.  

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should...

Big Data

Roadmap Update: BDA 4.13 with Cloudera CDH 5.15.x

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned date, and so, we may choose to pick up that version We are adding the following features: Support for SHA-2 certificates with Kerberos Upgrade and Expand for Kafka clusters (the create was introduced in BDA 4.12) A disk shredding utility, where you easily "erase" data on the disks. We expect most customers to use this in cloud nodes Support for Active Directory in Big Data Manager We obviously will update the JDK and the Linux OS to the latest version, as well as apply the latest security updates. Same for MySQL Then there is of course the important question on timelines. Right now - subject to change and the below mentioned safe harbor statement - we are looking at mid August as the planned date, assuming we are going with 5.15.0. If you are interested in discussing or checking up on the dates, features or have other questions, see our new community: Or visit the same community using the direct link to our community home. As always, feedback and comments welcome. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.In our continued quest to keep all of you informed about the latest versions, releases and approximate timelines, here is the next update.

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned...

Big Data

Need Help with Oracle Big Data: Here is Where to Go!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered and comments are taken on board.  How do you find us? Easy, go to the community main page. Then do the following to start asking your questions on big data:   Once you are here, click on the thing you want from us, in this case I would assume that you want some answers on some of your questions. So, click on Answers in the menu bar, and then on Platform (PaaS): From there, just look in the data management section and choose Big Data: All set... now you are ready - provided you are a member of course - to start both asking questions, and if you know some answers helping others in the community. What do we cover in this community? Great question. Since the navigation and the title elude to Cloud, you would expect we cover our cloud service. And that is correct. But, because we are Oracle, we do have a wide portfolio and you will have questions around an entire ecosystem of tools, utilities and solutions, as well as potentially architecture questions and ideas. So, rather then limiting questions, ideas and thoughts, we figured to actually broaden the scope to what we think the community will be discussing. And so here are some of the things we hope we can cover: Big Data Cloud Service (BDCS) - of course The Cloudera stack included Specific cloud features like: Bursting/shrinking One-click Secure Clusters Easy Upgrade  Networking / Port Management  and more... Big Data Spatial and Graph, which is included in BDCS Big Data Connectors and ODI, also included in BDCS Big Data Manager and its notebook feature (Zeppelin based) and other cool features Big Data SQL Cloud Service and of course the general software features in Big Data SQL Big Data Best Practices Architecture Patterns and Reference Architectures Configuration and Tuning / Setup When to use what tools or technologies Service and Product roadmaps and announcements And more Hopefully that will trigger all of you (and us) to collaborate, discuss and get our community to be a fun and helpful one. Who is on here from Oracle? Well, hopefully a lot of people will join us, both from Oracle and from customers, partners and universities/schools. But we, as the product development team will be manning the front lines. So you have product management resources, some architects and some developers working in the community. And with that, see you all soon in the community!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered...

Big Data SQL

Big Data SQL Quick Start. Kerberos - Part 26

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data SQL over Kerberized cluster, but today, I'd like to show couple typical steps how to test and debug Kerberized installation. First of all, let me tell about test environment. it's a 4 nodes: 3 nodes for Hadoop cluster (vm0[1-3]) and one for Database (vm04). Kerberos tickets should be initiated from keytab file, which should be on the database side (in case of RAC on each database node) and on each Hadoop node. Let's check that on the database node we have valid Kerberos ticket: [oracle@vm04 ~]$ id uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba) [oracle@scaj0602bda09vm04 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:58  07/24/18 01:15:58  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:01 let's check that we have access to HDFS from database host: [oracle@vm04 ~]$ cd $ORACLE_HOME/bigdatasql [oracle@vm04 bigdatasql]$ ls -l|grep hadoop*env -rw-r--r-- 1 oracle oinstall 2249 Jul 12 15:41 hadoop_martybda.env [oracle@vm04 bigdatasql]$ source hadoop_martybda.env  [oracle@vm04 bigdatasql]$ hadoop fs -ls ... Found 4 items drwx------   - oracle hadoop          0 2018-07-13 06:00 .Trash drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:10 .sparkStaging drwx------   - oracle hadoop          0 2018-07-12 05:17 .staging drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:14 oozie-oozi [oracle@vm04 bigdatasql]$  seems everything is ok. let's do the same from Hadoop node: [root@vm01 ~]# su - oracle [oracle@scaj0602bda09vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_1000 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:02  07/24/18 01:15:02  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:02 let's check that we have assess for the environment and also create test file on HDFS: [oracle@vm01 ~]$ echo "line1" >> test.txt [oracle@vm01 ~]$ echo "line2" >> test.txt [oracle@vm01 ~]$ hadoop fs -mkdir /tmp/test_bds [oracle@vm01 ~]$ hadoop fs -put test.txt /tmp/test_bds   now, let's jump to Database node and create external table for this file: [root@vm04 bin]# su - oracle [oracle@vm04 ~]$ . oraenv <<< orcl ORACLE_SID = [oracle] ? The Oracle base has been set to /u03/app/oracle [oracle@vm04 ~]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 06:39:06 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> CREATE TABLE bds_test (line VARCHAR2(4000))    ORGANIZATION EXTERNAL  (   TYPE ORACLE_HDFS       DEFAULT DIRECTORY       DEFAULT_DIR LOCATION ('/tmp/test_bds')   )    REJECT LIMIT UNLIMITED;      Table created.   SQL>  and for sure this is our two row file which we created on the previous step: SQL> select * from bds_test;   LINE ------------------------------------ line1 line2 Now let's go through some typical cases with Kerberos and let's talk about how to catch it.   Kerberos ticket missed on the database side Let's simulate case when Kerberos ticket is missed on the database side. it's pretty easy and for doing this we will use kdestroy command: [oracle@vm04 ~]$ kdestroy  [oracle@vm04 ~]$ klist  klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_500) extproc cache Kerberos ticket, so to apply our changes, you will need to restart extproc. First, we will need to obtain name of the extproc: [oracle@vm04 admin]$ cd $ORACLE_HOME/hs/admin [oracle@vm04 admin]$ ls -l total 24 -rw-r--r-- 1 oracle oinstall 1170 Mar 27 01:04 extproc.ora -rw-r----- 1 oracle oinstall 3112 Jul 12 15:56 initagt.dat -rw-r--r-- 1 oracle oinstall  190 Jul 12 15:41 initbds_orcl_martybda.ora -rw-r--r-- 1 oracle oinstall  489 Mar 27 01:04 initdg4odbc.ora -rw-r--r-- 1 oracle oinstall  406 Jul 12 15:11 listener.ora.sample -rw-r--r-- 1 oracle oinstall  244 Jul 12 15:11 tnsnames.ora.sample name consist of database SID and Hadoop Cluster name. So, seems our extproc name is bds_orcl_martybda. let's stop and start it: [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   oracle 16776 1 0 Jul12 ? 00:49:25 extprocbds_orcl_martybda -mt Stopping MTA process "extprocbds_orcl_martybda -mt"...   MTA process "extprocbds_orcl_martybda -mt" stopped!   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   MTA process "extprocbds_orcl_martybda -mt" is not running!   Checking MTA init parameters...   [O]  INIT_LIBRARY=$ORACLE_HOME/lib/libkubsagt12.so [O]  INIT_FUNCTION=kubsagtMTAInit [O]  BDSQL_CLUSTER=martybda [O]  BDSQL_CONFIGDIR=/u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/databases/orcl/bigdata_config   MTA process "extprocbds_orcl_martybda -mt" started! oracle 19498 1 4 06:58 ? 00:00:00 extprocbds_orcl_martybda -mt now we reset Kerberos ticket cache. Let's try to query HDFS data: [oracle@vm04 admin]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 07:00:26 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test; select * from bds_test * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-11504: error from external driver: java.lang.Exception: Error initializing JXADProvider: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "m04.vm.oracle.com/192.168.254.5"; destination host is: "vm02.vm.oracle.com":8020; remember this error. If you see it it means that you don't have valid Kerberos ticket on the database side. Let's bring everything back and make sure that our environment again works properly. [oracle@vm04 admin]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" [oracle@vm04 admin]$ /usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab [oracle@vm04 admin]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 07:03:46  07/24/18 07:03:46  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 07:03:46 [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ...   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ...   [oracle@scaj0602bda09vm04 admin]$ sqlplus / as sysdba   ...   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>    Kerberos ticket missed on the Hadoop side Another case when Kerberos ticket is misses on the Hadoop side (for Oracle user). Let's take a look what is going to be if we have such case. For this I also will use kdestroy command tool on each Hadoop node: [oracle@vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ kdestroy after perform all this steps, let's go to the database side and run the query again: [oracle@vm04 bigdata_config]$ sqlplus / as sysdba   ...     SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>  from the first look everything looks ok, but, let's take a look what is the execution statistics:   SQL> select n.name, s.value /* , s.inst_id, s.sid */ from v$statname n, gv$mystat s where n.name like '%XT%' and s.statistic# = n.statistic#;     NAME                                      VALUE ---------------------------------------------------------------- ---------- cell XT granules requested for predicate offload                1 cell XT granule bytes requested for predicate offload           12 cell interconnect bytes returned by XT smart scan               8192 cell XT granule predicate offload retries                       3 cell XT granule IO bytes saved by storage index                 0 cell XT granule IO bytes saved by HDFS tbs extent map scan      0 and we see that "cell XT granule predicate offload retries" is not equal to 0, which means that all real processing in happens on the database side. If you query 10TB table on HDFS, you will briuse Multi-user ng back all 10TB and will process it all on the database side. Not good. So, if you missed Kerberos ticket on the Hadoop side query will finish, but SmartScan will not work.   Renewal of Kerberos tickets One of the key Kerberos pillar is that tickets have expiration time and user have to renew it. During installation Big Data SQL creates crontab job, which does this on the database side as well as on the Hadoop side. If you miss it for some reasons you could use this one as an example: [oracle@vm04 ~]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" one note, that you always will use Oracle principal for Big Data SQL, but if you want to have fine grained  control over access to HDFS, you have to use Multiuser Authorization feature, as explained here.   Conclusion 1) Big Data SQL works over Kerberized clusters 2) You have to have Kerberos tickets on the Database side as well as on the Hadoop side 3) If you miss Kerberos ticket on the Database side query will fail 4) If you miss Kerberos ticket on the Hadoop side, query will not fail, but it will work on failback mode, when you move all blocks over the wire on the database node and process it there. You don't want to do so :)

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data...

Secure Kafka Cluster

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic deserved dedicated article. Now it's time to do this and this blog will be devoted by Kafka security only.  Kafka Security challenges 1) Encryption in motion. By default you communicate with Kafka cluster over unsecured network and everyone, who can listen network between your client and Kafka cluster, can read message content. the way to avoid this is use some on-wire encryption technology - SSL/TLS. Using SSL/TLS you encrypt data on a wire between your client and Kafka cluster. Communication without SSL/TLS: SSL/TLS communication:   After you enable SSL/TLS communication, you will have follow consequence of steps for write/read message to/from Kafka cluster: 2) Authentication. Well, now when we encrypt traffic between client and server, but here is another challenge - server doesn't know with whom it communicate. In other words, you have to enable some mechanisms, which will not allow to work with cluster for UNKNOWN users. The default authentication mechanism in Hadoop world is Kerberos protocol. Here is the workflow, which shows sequence of steps to enable secure communication with Kafka:   Kerberos is the trusted way to authenticate user on cluster and make sure, that only known users can access it.  3) Authorization. Next step when you authenticate user on your cluster (and you know that you are working as a Bob or Alice), you may want to apply some authorization rules, like setup permissions for certain users or groups. In other words define what user can do and what user can't do. Sentry may help you with this. Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions. 4) Rest Encryption. Another one security aspect is rest encryption. It's when you want to protect data, stored on the disk. Kafka is not purposed for long term storing data, but it could store data for a days or even weeks. We have to make sure that all data, stored on the disks couldn't be stolen and them read with out encryption key. Security implementation. Step 1 - SSL/TLS There is no any strict steps sequence for security implementation, but as a first step I will recommend to do SSL/TLS configuration. As a baseline I took Cloudera's documentation. For structuring all your security setup, create a directory on your Linux machine where you will put all files (start with one machine, but later on you will need to do the same on other's Kafka servers): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ sudo chown -R kafka:kafka /opt/kafka/security $ sudo mkdir -p /opt/kafka/security A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption. We will need to generate a key pair (a public key and associated private key). Wraps the public key into an X.509 self-signed certificate, which is stored as a single-element certificate chain. This certificate chain and the private key are stored in a new keystore entry identified by selfsigned. # keytool -genkeypair -keystore keystore.jks -keyalg RSA -alias selfsigned -dname "CN=localhost" -storepass 'welcome2' -keypass 'welcome3' if you want to check content of keystore, you may run follow command: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} # keytool -list -v -keystore keystore.jks ... Alias name: selfsigned Creation date: May 30, 2018 Entry type: PrivateKeyEntry Certificate chain length: 1 Certificate[1]: Owner: CN=localhost Issuer: CN=localhost Serial number: 2065847b Valid from: Wed May 30 12:59:54 UTC 2018 until: Tue Aug 28 12:59:54 UTC 2018 ... As the next step we will need to extract a copy of the cert from the java keystore that was just created: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # keytool -export -alias selfsigned -keystore keystore.jks -rfc -file server.cer Enter keystore password: welcome2 Then create a trust store by making a copy of the default java trust store.  Main difference between trustStore vs keyStore is that trustStore (as name suggest) is used to store certificates from trusted Certificate authorities(CA) which is used to verify certificate presented by Server in SSL Connection while keyStore is used to store private key and own identity certificate which program should present to other party (Server or client) to verify its identity. Some more details you could find here. In my case on Big Data Cloud Service I've performed follow command: # cp /usr/java/latest/jre/lib/security/cacerts /opt/kafka/security/truststore.jks put it into truststore: # ls -lrt -rw-r--r-- 1 root root 113367 May 30 12:46 truststore.jks -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer put the certificate that was just extracted from the keystore into the trust store (note: "changeit" is standard password): # keytool -import -alias selfsigned -file server.cer -keystore truststore.jks -storepass changeit check file size after (it's bigger, because includes new certificate): # ls -let -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer -rw-r--r-- 1 root root 114117 May 30 13:06 truststore.jks It may seems too complicated and I decided to depict all those steps in one diagram: so far, all those steps been performed on the single (some random broker) machine. But you will need to have keystore and trustore files on each Kafka broker, let's copy It (note, current syntax is working on Big Data Appliance, Big Data Cloud Service or Big Data Cloud at Customer): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # dcli -C "mkdir -p /opt/kafka/security" # dcli -C "chown kafka:kafka /opt/kafka/security" # dcli -C -f /opt/kafka/security/keystore.jks -d /opt/kafka/security/keystore.jks # dcli -C -f /opt/kafka/security/truststore.jks -d /opt/kafka/security/truststore.jks after doing all these steps, you need to make some configuration changes in Cloudera Manager for each node (go to Cloudera Manager -> Kafka -> Configuration): In addition to this, on each node, you have to change listeners in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" Also, make sure, that in Cloudera Manager, you have security.inter.broker.protocol equal to SSL: After node restart, when all brokers up and running, let's test it: # openssl s_client -debug -connect kafka1.us2.oraclecloud.com:9093 -tls1_2 ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Certificate chain 0 s:/CN=localhost    i:/CN=localhost --- Server certificate -----BEGIN CERTIFICATE----- MIICxzCCAa+gAwIBAgIEIGWEezANBgkqhkiG9w0BAQsFADAUMRIwEAYDVQQDEwls b2NhbGhvc3QwHhcNMTgwNTMwMTI1OTU0WhcNMTgwODI4MTI1OTU0WjAUMRIwEAYD VQQDEwlsb2NhbGhvc3QwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCI 53T82eoDR2e9IId40UPTj3xg3khl1jdjNvMiuB/vcI7koK0XrZqFzMVo6zBzRHnf zaFBKPAQisuXpQITURh6jrVgAs1V4hswRPrJRjM/jCIx7S5+1INBGoEXk8OG+OEf m1uYXfULz0bX9fhfl+IdKzWZ7jiX8FY5dC60Rx2RTpATWThsD4mz3bfNd3DlADw2 LH5B5GAGhLqJjr23HFjiTuoQWQyMV5Esn6WhOTPCy1pAkOYqX86ad9qP500zK9lA hynyEwNHWt6GoHuJ6Q8A9b6JDyNdkjUIjbH+d0LkzpDPg6R8Vp14igxqxXy0N1Sd DKhsV90F1T0whlxGDTZTAgMBAAGjITAfMB0GA1UdDgQWBBR1Gl9a0KZAMnJEvxaD oY0YagPKRTANBgkqhkiG9w0BAQsFAAOCAQEAaiNdHY+QVdvLSILdOlWWv653CrG1 2WY3cnK5Hpymrg0P7E3ea0h3vkGRaVqCRaM4J0MNdGEgu+xcKXb9s7VrwhecRY6E qN0KibRZPb789zQVOS38Y6icJazTv/lSxCRjqHjNkXhhzsD3tjAgiYnicFd6K4XZ rQ1WiwYq1254e8MsKCVENthQljnHD38ZDhXleNeHxxWtFIA2FXOc7U6iZEXnnaOM Cl9sHx7EaGRc2adIoE2GXFNK7BY89Ip61a+WUAOn3asPebrU06OAjGGYGQnYbn6k 4VLvneMOjksuLdlrSyc5MToBGptk8eqJQ5tyWV6+AcuwHkTAnrztgozatg== -----END CERTIFICATE----- subject=/CN=localhost issuer=/CN=localhost --- No client certificate CA names sent Server Temp Key: ECDH, secp521r1, 521 bits --- SSL handshake has read 1267 bytes and written 441 bytes --- New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384 Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session:     Protocol  : TLSv1.2     Cipher    : ECDHE-RSA-AES256-GCM-SHA384     Session-ID: 5B0EAC6CA8FB4B6EA3D0B4A494A4660351A4BD5824A059802E399308C0B472A4     Session-ID-ctx:     Master-Key: 60AE24480E2923023012A464D16B13F954A390094167F54CECA1BDCC8485F1E776D01806A17FB332C51FD310730191FE     Key-Arg   : None     Krb5 Principal: None     PSK identity: None     PSK identity hint: None     Start Time: 1527688300     Timeout   : 7200 (sec)     Verify return code: 18 (self signed certificate) Well, seems our SSL connection is up and running. Time try to put some messages into the topic: #  kafka-console-producer  --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar ... 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. reason of this error, that we don't have properly configured clients. We will need to create and use client.properties and jaas.conf files. # cat /opt/kafka/security/client.properties security.protocol=SSL ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit -bash-4.1# cat jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useTicketCache=true;     }; # export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/security/jaas.conf"  now you could try again to produce messages: # kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar --producer.config client.properties ... Hello SSL world no any errors - already good! Let's try to consume message: # kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... Hello SSL world Bingo! So, we created secure communication between Kafka Cluster and Kafka Client and write a message there. Security implementation. Step 2 - Kerberos So, we up and run Kafka on Kerberized cluster and write and read data from a cluster without Kerberos ticket. $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) This is not how it's suppose to work. We assume that if we protect cluster by Kerberos it's impossible to do something without ticket. Fortunately, it's relatively easy to config communications with Kerberized Kafka cluster. First, make sure that you have enabled Kerberos authentification in Cloudera Manager (Cloudera Manager -> Kafka -> Configuration): second, go again to Cloudera Manager and change value of "security.inter.broker.protocol" to SASL_SSL: Note: Simple Authentication and Security Layer (SASL) is a framework for authentication and data security in Internet protocols. It decouples authentication mechanisms from application protocols, in theory allowing any authentication mechanism supported by SASL to be used in any application protocol that uses SASL. Very roughly - in this blog post you may think that SASL is equal to Kerberos. After this change, you will need to modify listeners protocol on each broker (to SASL_SSL) in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" setting: you ready for restart Kafka Cluster and write/read data from/to it.  Before doing this, you will need to modify Kafka client credentials: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit after this you may try to read data from Kafka cluster: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user ... Error may miss-lead you, but the the real reason is absence of Kerberos ticket: $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) $ kinit oracle Password for oracle@BDACLOUDSERVICE.ORACLE.COM: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Hello SSL world Great, it works! now we have to run kinit everytime before read/write data from Kafka cluster. Instead of this for convenience we may use keytab. For doing this you will need go to KDC server and generate keytab file there: # kadmin.local Authenticating as principal hdfs/admin@BDACLOUDSERVICE.ORACLE.COM with password. kadmin.local: xst -norandkey -k testuser.keytab testuser Entry for principal oracle with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:oracle.keytab. kadmin.local:  quit # ls -l ... -rw-------  1 root root    436 May 31 14:06 testuser.keytab ... now, when we have keytab file, we could copy it to the client machine and use it for Kerberos Authentication. don't forget to change owner of keytab file to person, who will run the script: $ chown opc:opc /opt/kafka/security/testuser.keytab Also, we will need to modify jaas.conf file: $ cat /opt/kafka/security/jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useKeyTab=true       keyTab="/opt/kafka/security/testuser.keytab"       principal="testuser@BDACLOUDSERVICE.ORACLE.COM";     }; seems we are fully ready to consumption of messages from topic. Despite on we have oracle as kerberos principal on a OS, we connect to the cluster as testuser (according jaas.conf): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... 18/05/31 15:04:45 INFO authenticator.AbstractLogin: Successfully logged in. 18/05/31 15:04:45 INFO kerberos.KerberosLogin: [Principal=testuser@BDACLOUDSERVICE.ORACLE.COM]: TGT refresh thread started. ... Hello SSL world Security Implementation Step 3 - Sentry One step before we configured Authentication, which answers on question - who am I. Now is the time to set up some Authorization mechanism, which will answer on question - what am I allow to do. Sentry became very popular engine in Hadoop world and we will use it for Kafka's Authorization. As I posted earlier Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions: And we will need to follow this with Kafka as well. But we will start with some service configurations first (Cloudera Manager -> Kafka -> Configuration): Also, it's very important to add in Sentry config (Cloudera Manager -> Sentry -> Config) user kafka in "sentry.service.admin.group":  Well, when we know who connects to the cluster, we may restrict he/she from reading some particular topics (in other words perform some Authorization).  Note: for perform administrative operations with Sentry, you have to work as Kafka user. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ id uid=1001(opc) gid=1005(opc) groups=1005(opc) $ sudo find /var -name kafka*keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2 /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab $ sudo cp /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab /opt/kafka/security/kafka.keytab $ sudo chown opc:opc /opt/kafka/security/kafka.keytab obtain Kafka ticket: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 05/31/18 15:52:28  06/01/18 15:52:28  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/05/18 15:52:28 Before configuring and testing Sentry with Kafka, we will need to create unprivileged user, who we will give grants (Kafka user is privileged and it bypassed Sentry). there are few simple steps, create test user (unprivileged) on each Hadoop node (this syntax will work on Big Data Appliance, Big Data Cloud Service and Big Data Cloud at Customer): # dcli -C "useradd testsentry -u 1011" we should remember that Sentry heavily relies on the Groups and we have to create it and put "testsentry" user there: # dcli -C "groupadd testsentry_grp -g 1017" after group been created, we should put user there: # dcli -C "usermod -g testsentry_grp testsentry" check that everything is how we expect: # dcli -C "id testsentry" 10.196.64.44: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.60: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.64: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.65: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.61: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) Note: you have to have same userID and groupID on each machine. Now verify that Hadoop can lookup group: # hdfs groups testsentry testsentry : testsentry_grp All this steps you have to perform as root. Next you should create testsentry principal in KDC (it's not mandatory, but more organize and easy to understand). Go to the KDC host and run follow commands: # kadmin.local  Authenticating as principal root/admin@BDACLOUDSERVICE.ORACLE.COM with password.  kadmin.local:  addprinc testsentry WARNING: no policy specified for testsentry@BDACLOUDSERVICE.ORACLE.COM; defaulting to no policy Enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Re-enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Principal "testsentry@BDACLOUDSERVICE.ORACLE.COM" created. kadmin.local:  xst -norandkey -k testsentry.keytab testsentry Entry for principal testsentry with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des3-cbc-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type arcfour-hmac added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-hmac-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-cbc-md5 added to keytab WRFILE:testsentry.keytab. Now we have all setup for unprivileged user. Time to start configure Sentry policies. As soon as Kafka is superuser we may run admin commands as Kafka user. For managing sentry settings we will need to use Kafka user. To obtain Kafka credentials we need to run: $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:37:53  06/16/18 01:37:53  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/20/18 01:37:53 First we need to create role. Let's call it testsentry_role: $ kafka-sentry -cr -r testsentry_role let's check, that role been created: $ kafka-sentry -cr -r testsentry_role ... admin_role testsentry_role [opc@cfclbv3872 ~]$  as soon as role created, we will need to give some permissions to this role for certain topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=write" and also describe: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=describe" next step, we have to allow some consumer group to read and describe from this topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=read" $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=describe" next step is linking role and groups, we will need to assign testsentry_role to testsentry_grp (group automatically inherit all role's permissions): $ kafka-sentry -arg -r testsentry_role -g testsentry_grp after this, let's check that our mapping worked fine: $ kafka-sentry -lr -g testsentry_grp ... testsentry_role now let's review list of the permissions, which have our certain role: $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=read it's also very important to have consumer group in client properties file: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit group.id=testconsumergroup after all set, we will need to switch to testsentry user for testing: $ kinit -kt /opt/kafka/security/testsentry.keytab testsentry $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: testsentry@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:38:49  06/16/18 01:38:49  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/22/18 01:38:49 test writes: $ kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093 --topic testTopic --producer.config /opt/kafka/security/client.properties ... > testmessage1 > testmessage2 > seems everything is ok, now let's test a read: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic testTopic --from-beginning  --consumer.config /opt/kafka/security/client.properties ... testmessage1 testmessage2 now, for showing Sentry in action, I'll try to read messages from other topic, which is outside of allowed topics for our test group. $ kafka-console-consumer --from-beginning --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --consumer.config /opt/kafka/security/client.properties ... 18/06/15 02:54:54 INFO internals.AbstractCoordinator: (Re-)joining group testconsumergroup 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 13 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 15 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 16 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 17 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} so, as we can see we could not read from Topic, which we don't authorize to read. Systemizing all this, I'd like to put user-group-role-privilegies flow on one picture: And also, I'd like to summarize steps, required for getting list of privileges for certain user (testsentry in my example): // Run as superuser - Kafka $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/cfclbv3872.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/19/18 02:38:26  06/20/18 02:38:26  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/24/18 02:38:26 // Get list of the groups which belongs certain user $ hdfs groups testsentry testsentry : testsentry_grp // Get list of the role for certain group $ kafka-sentry -lr -g testsentry_grp ...   testsentry_role // Get list of permissions for certain role $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=read HOST=*->CONSUMERGROUP=testconsumergroup->action=describe Based on what we saw above - our user testsentry could read and write to topic testTopic. For reading data he should to belong to the consumergroup "testconsumergroup". Security Implementation Step 4 - Encryption At Rest Last part of security journey is Encryption of Data, which you store on the disks. Here there are multiple ways, one of the most common is Navigator Encrypt.

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic...

Big Data

Big Data SQL 3.2.1 is Now Available

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: Existing customers using Big Data SQL 3.2 do not need to take this update; Oracle Database 12.2.0.1 support is the reason for the update. Big Data SQL 3.2.1 can be used for both Oracle Database 12.1.0.2 and 12.2.0.1 deployments For Oracle Database 12.2.0.1, Big Data SQL 3.2.1 requires the April Release Update plus the Big Data SQL 3.2.1 one-off patch The software is available on ARU.  The Big Data SQL 3.2.1 installer will be available on edelivery soon  Big Data SQL 3.2.1 Installer ( Patch 28071671).  Note, this is the complete installer; it is not a patch. Oracle Database 12.2.0.1 April Release Update (Patch 27674384).  Ensure your Grid Infrastructure is also on the 12.2.0.1 April Release Update (if you are using GI) Big Data SQL 3.2.1 one-off on top of April RU (Patch 26170659).  Ensure you pick the appropriate release in the download page.  This patch must be applied to each database server and Grid Infrastructure. Also, check out this new Big Data SQL Tutorial series on Oracle Learning Library.  The series includes numerous videos that helps you understand Big Data SQL capabilities.  It includes: Introducing the Oracle Big Data Lite Virtual Machine and Hadoop Introduction to Oracle Big Data SQL Hadoop and Big Data SQL Architectures Oracle Big Data SQL Performance Features Information Lifecycle Management 

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: E...

Event Hub Cloud Service. Hello world

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it is, without any transformation and preserve it for a long period of time.  At the same time you have two types of data - Streaming Data and Batch. Batch could be log files, RDBMS archives. Streaming data could be IoT, Sensors, Golden Gate replication logs. Apache Kafka is very popular engine for acquiring streaming data. It has multiple advantages, like scalability, fault tolerance and high throughput. Unfortunately, Kafka is hard to manage. Fortunately, Cloud simplifies many routine operations. Oracle Has three options for deploy Kafka in the Cloud: 1) Use Big Data Cloud Service, where you get full Cloudera cluster and there you could deploy Apache Kafka as part of CDH. 2) Event Hub Cloud Service Dedicated. Here you have to specify server shapes and some other parameters, but rest done by Cloud automagically.  3) Event Hub Cloud Service. This service is fully managed by Oracle, you even don't need to specify any compute shapes or so. Only one thing to do is tell for how long you need to store data in this topic and tell how many partitions do you need (partitions = performance). Today, I'm going to tell you about last option, which is fully managed cloud service. It's really easy to provision it, just need to login into your Cloud account and choose "Event Hub" Cloud service. after this go and choose open service console: Next, click on "Create service": Put some parameters - two key is Retention period and Number of partitions. First defines for how long will you store messages, second defines performance for read and write operations. Click next after: Confirm and wait a while (usually not more than few minutes): after a short while, you will be able to see provisioned service:     Hello world flow. Today I want to show "Hello world" flow. How to produce (write) and consume (read) message from Event Hub Cloud Service. The flow is (step by step): 1) Obtain OAuth token 2) Produce message to a topic 3) Create consumer group 4) Subscribe to topic 5) Consume message Now I'm going to show it in some details. OAuth and Authentication token (Step 1) For dealing with Event Hub Cloud Service you have to be familiar with concept of OAuth and OpenID. If you are not familiar, you could watch the short video or go through this step by step tutorial.  In couple words OAuth token authorization (tells what I could access) method to restrict access to some resources. One of the main idea is decouple Uses (real human - Resource Owner) and Application (Client). Real man knows login and password, but Client (Application) will not use it every time when need to reach Resource Server (which has some info or content). Instead of this, Application will get once a Authorization token and will use it for working with Resource Server. This is brief, here you may find more detailed explanation what is OAuth. Obtain Token for Event Hub Cloud Service client. As you could understand for get acsess to Resource Server (read as Event Hub messages) you need to obtain authorization token from Authorization Server (read as IDCS). Here, I'd like to show step by step flow how to obtain this token. I will start from the end and will show the command (REST call), which you have to run to get token: #!/bin/bash curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json as you can see there are many parameters required for obtain OAuth token. Let's take a looks there you may get it. Go to the service and click on topic which you want to work with, there you will find IDCS Application, click on it: After clicking on it, you will go be redirected to IDCS Application page. Most of the credentials you could find here. Click on Configuration: On this page right away you will find ClientID and Client Secret (think of it like login and password):   look down and find point, called Resources: Click on it and you will find another two variables, which you need for OAuth token - Scope and Primary Audience. One more required parameter - IDCS_URL, you may find in your browser: you have almost everything you need, except login and password. Here implies oracle cloud login and password (it what you are using when login into http://myservices.us.oraclecloud.com): Now you have all required credential and you are ready to write some script, which will automate all this stuff: #!/bin/bash export CLIENT_ID=7EA06D3A99D944A5ADCE6C64CCF5C2AC_APPID export CLIENT_SECRET=0380f967-98d4-45e9-8f9a-45100f4638b2 export THEUSERNAME=john.dunbar export THEPASSWORD=MyPassword export SCOPE=/idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export PRIMARY_AUDIENCE=https://7EA06D3A99D944A5ADCE6C64CCF5C2AC.uscom-central-1.oraclecloud.com:443 export THESCOPE=$PRIMARY_AUDIENCE$SCOPE export IDCS_URL=https://idcs-1d6cc7dae45b40a1b9ef42c7608b9afe.identity.oraclecloud.com curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json after running this script, you will have new file called access_token.json. Field access_token it's what you need: $ cat access_token.json {"access_token":"eyJ4NXQjUzI1NiI6InVUMy1YczRNZVZUZFhGbXFQX19GMFJsYmtoQjdCbXJBc3FtV2V4U2NQM3MiLCJ4NXQiOiJhQ25HQUpFSFdZdU9tQWhUMWR1dmFBVmpmd0UiLCJraWQiOiJTSUdOSU5HX0tFWSIsImFsZyI6IlJTMjU2In0.eyJ1c2VyX3R6IjoiQW1lcmljYVwvQ2hpY2FnbyIsInN1YiI6ImpvaG4uZHVuYmFyIiwidXNlcl9sb2NhbGUiOiJlbiIsInVzZXJfZGlzcGxheW5hbWUiOiJKb2huIER1bmJhciIsInVzZXIudGVuYW50Lm5hbWUiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwic3ViX21hcHBpbmdhdHRyIjoidXNlck5hbWUiLCJpc3MiOiJodHRwczpcL1wvaWRlbnRpdHkub3JhY2xlY2xvdWQuY29tXC8iLCJ0b2tfdHlwZSI6IkFUIiwidXNlcl90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsImNsaWVudF9pZCI6IjdFQTA2RDNBOTlEOTQ0QTVBRENFNkM2NENDRjVDMkFDX0FQUElEIiwiYXVkIjpbInVybjpvcGM6bGJhYXM6bG9naWNhbGd1aWQ9N0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMiLCJodHRwczpcL1wvN0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMudXNjb20tY2VudHJhbC0xLm9yYWNsZWNsb3VkLmNvbTo0NDMiXSwidXNlcl9pZCI6IjM1Yzk2YWUyNTZjOTRhNTQ5ZWU0NWUyMDJjZThlY2IxIiwic3ViX3R5cGUiOiJ1c2VyIiwic2NvcGUiOiJcL2lkY3MtMWQ2Y2M3ZGFlNDViNDBhMWI5ZWY0MmM3NjA4YjlhZmUtb2VodGVzdCIsImNsaWVudF90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsInVzZXJfbGFuZyI6ImVuIiwiZXhwIjoxNTI3Mjk5NjUyLCJpYXQiOjE1MjY2OTQ4NTIsImNsaWVudF9ndWlkIjoiZGVjN2E4ZGRhM2I4NDA1MDgzMjE4NWQ1MzZkNDdjYTAiLCJjbGllbnRfbmFtZSI6Ik9FSENTX29laHRlc3QiLCJ0ZW5hbnQiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwianRpIjoiMDkwYWI4ZGYtNjA0NC00OWRlLWFjMTEtOGE5ODIzYTEyNjI5In0.aNDRIM5Gv_fx8EZ54u4AXVNG9B_F8MuyXjQR-vdyHDyRFxTefwlR3gRsnpf0GwHPSJfZb56wEwOVLraRXz1vPHc7Gzk97tdYZ-Mrv7NjoLoxqQj-uGxwAvU3m8_T3ilHthvQ4t9tXPB5o7xPII-BoWa-CF4QC8480ThrBwbl1emTDtEpR9-4z4mm1Ps-rJ9L3BItGXWzNZ6PiNdVbuxCQaboWMQXJM9bSgTmWbAYURwqoyeD9gMw2JkwgNMSmljRnJ_yGRv5KAsaRguqyV-x-lyE9PyW9SiG4rM47t-lY-okMxzchDm8nco84J5XlpKp98kMcg65Ql5Y3TVYGNhTEg","token_type":"Bearer","expires_in":604800} Create Linux variable for it: #!/bin/bash export TOKEN=`cat access_token.json |jq .access_token|sed 's/\"//g'` Well, now we have Authorization token and may work with our Resource Server (Event Hub Cloud Service).  Note: you also may check documentation about how to obtain OAuth token. Produce Messages (Write data) to Kafka (Step 2) The first thing that we may want to do is produce messages (write data to a Kafka cluster). To make scripting easier, it's also better to use some environment variables for common resources. For this example, I'd recommend to parametrize topic's end point, topic name, type of content to be accepted and content type. Content type is completely up to developer, but you have to consume (read) the same format as you produce(write). The key parameter to define is REST endpoint. Go to PSM, click on topic name and copy everything till "restproxy": Also, you will need topic name, which you could take from the same window: now we could write a simple script for produce one message to Kafka: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export CONTENT_TYPE=application/vnd.kafka.json.v2+json curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"records":[{"value":{"foo":"bar"}}]}' \ $OEHCS_ENDPOINT/topics/$TOPIC_NAME if everything will be fine, Linux console will return something like: {"offsets":[{"partition":1,"offset":8,"error_code":null,"error":null}],"key_schema_id":null,"value_schema_id":null} Create Consumer Group (Step 3) The first step to read data from OEHCS is create consumer group. We will reuse environment variables from previous step, but just in case I'll include it in this script: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export CONTENT_TYPE=application/vnd.kafka.json.v2+json export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"format": "json", "auto.offset.reset": "earliest"}' \ $OEHCS_ENDPOINT/consumers/oehcs-consumer-group \ -o consumer_group.json this script will generate output file, which will contain variables, that we will need to consume messages. Subscribe to a topic (Step 4) Now you are ready to subscribe for this topic (export environment variable if you didn't do this before): #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ -d "{\"topics\": [\"$TOPIC_NAME\"]}" \ $BASE_URI/subscription If everything fine, this request will not return something.  Consume (Read) messages (Step 5) Finally, we approach last step - consuming messages. and again, it's quite simple curl request: #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export H_ACCEPT=application/vnd.kafka.json.v2+json curl -X GET \ -H "Authorization: Bearer $TOKEN" \ -H "Accept: $H_ACCEPT" \ $BASE_URI/records if everything works, like it supposed to work, you will have output like: [{"topic":"idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest","key":null,"value":{"foo":"bar"},"partition":1,"offset":17}] Conclusion Today we saw how easy to create fully managed Kafka Topic in Event Hub Cloud Service and also we made a first steps into it - write and read message. Kafka is really popular message bus engine, but it's hard to manage. Cloud simplifies this and allow customers concentrate on the development of their applications. here I also want to give some useful links: 1) If you are not familiar with REST API, I'd recommend you to go through this blog 2) There is online tool, which helps to validate your curl requests 3) Here you could find some useful examples of producing and consuming messages 4) If you are not familiar with OAuth, here is nice tutorial, which show end to end example

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it...

Data Warehousing

Autonomous Data Warehouse is LIVE!

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference Center We had a major launch event on Thursday last week at the Oracle Conference center in Redwood Shores which got a huge amount of press coverage. Larry Ellison delivered the main keynote covering how our next-generation cloud service is built on the self-driving Oracle Autonomous Database technology which leverages machine learning to deliver unprecedented performance, reliability and ease of deployment for data warehouses. As an autonomous cloud service, it eliminates error-prone manual management tasks and, most importantly for a lot readers of this blog, frees up DBA resources, which can now be applied to implementing more strategic business projects. The key highlights of our Oracle Autonomous Data Warehouse Cloud include: Ease of Use: Unlike traditional cloud services with complex, manual configurations that require a database expert to specify data distribution keys and sort keys, build indexes, reorganize data or adjust compression, Oracle Autonomous Data Warehouse Cloud is a simple "load and go" service. Users specify tables, load data and then run their workloads in a matter of seconds-no manual tuning is needed. Industry-Leading Performance: Unlike traditional cloud services, which use generic compute shapes for database cloud services, Oracle Autonomous Data Warehouse Cloud is built on the high-performance Oracle Exadata platform. Performance is further enhanced by fully-integrated machine learning algorithms which drive automatic caching, adaptive indexing and advanced compression. Instant Elasticity: Oracle Autonomous Data Warehouse Cloud allocates new data warehouses of any size in seconds and scales compute and storage resources independently of one another with no downtime. Elasticity enables customers to pay for exactly the resources that the database workloads require as they grow and shrink. To highlight these three unique aspects of Autonomous Data Warehouse Cloud the launch included a live, on-stage demo of ADWC and Oracle Analytics Cloud. If you have never seen a new data warehouse delivered in seconds rather than days then pay careful attention to the demo video below where George Lumpkin creates a new fully autonomous data warehouse with a few mouse clicks and then starts to query one of the sample schemas, shipped with ADWC, using OAC. Probably the most important section was the panel discussion with a handful of our early adopter customers which was hosted by Steve Daheb, Senior Vice President, Oracle Cloud. As always. it’s great to hear customers talk about how the simplicity and speed of ADWC are bringing about significant changes to the way our customers think about their data. I you missed all the excitement, the keynote, demos and discussions then here is some great news….we recorded everything for you so can watch it from the comfort of your desk. Below are the links to the three main parts of the launch:     Video: Larry Ellison, CTO and Executive Chairman, Oracle, introduces Oracle Autonomous Database Cloud. Oracle Autonomous Database Cloud eliminates complexity and human error, helping to ensure higher reliability, security, and efficiency at the lowest cost.     Video: Steve Daheb, Senior Vice President, Oracle Cloud, discusses the benefits of Oracle Autonomous Cloud Platform with Oracle customers: - Paul Daugherty, Accenture - Benjamin Arnulf, Hertz - Michael Morales, QMP Health - Al Cordoba, QLX   Video: George Lumpkin, Vice President of Product Management, Oracle, demonstrates the self-driving, self-securing, and self-repairing capabilities of Oracle Autonomous Data Warehouse Cloud.   So what's next? So you are all fired up and you want to learn more about Autonomous Data Warehouse Cloud! Where do you go? First place to visit is the ADWC home page on cloud.oracle.com: https://cloud.oracle.com/datawarehouse Can I Try It? Yes you can! We have a great program that let's you get started with Oracle Cloud for free with $300 in free credits. Using your credits (which will probably last you around 30 days depending on how you configure your ADWC)you will be able to get valuable hands-on time to try loading some your own workloads and testing integration with our other cloud services such as Analytics Cloud and Data Integration Cloud. Are there any tutorials to help me get started? Yes there are! We have quick start tutorials covering both Autonomous Data Warehouse Cloud and our bundled SQL notebook application called Oracle Machine Learning, just click here: Provisioning Autonomous Data Warehouse Cloud Connecting SQL Developer and Creating Tables Loading Your Data Running a Query on Sample Data Creating Projects and Workspaces in OML Creating and Running Notebooks Collaborating in OML Creating a SQL Script Running SQL Statements Is the documentation available? Yes it is! The documentation set for ADWC is right here and the documentation set for Oracle Machine Learning is right here. Anything else I need to know? Yes there is! Over the next few weeks I will be posting links to more videos where our ADWC customers will take about their experiences of using ADWC during the last couple of months. There will be information about some deep-dive online tutorials that you can use as part of your free $300 trial along with lots of other topics that re too numerous to list. If you have a burning question about Oracle Autonomous Data Warehouse Cloud then feel free to reach out to me via email:keith.laker@oracle.com

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference CenterWe had a major launch event on Thursday last week at the...

Object Store Service operations. Part 1 - Loading data

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud providers, including Oracle, assumes, that data lifecycle starts from Object Store: You land it there and then either read or load it by different services, such as ADWC or BDCS, for example. Oracle has two flavors of Object Store Services (OSS), OSS on OCI (Oracle Cloud Infrastructure) and OSS on OCI -C (Oracle Cloud  Infrastructure Classic).  In this post, I'm going to focus on OSS on OCI-C, mostly because OSS on OCI, was perfectly explained by Hermann Baer here and by Rachna Thusoo here. Upload/Download files. As in Hermann's blog, I'll focus on most frequent operations Upload and Download. There are multiple ways to do so. For example: - Oracle Cloud WebUI - REST API - FTM CLI tool - Third Part tools such as CloudBerry - Big Data Manager (via ODCP) - Hadoop client with Swift API - Oracle Storage Software Appliance Let's start with easiest one - Web Interface. Upload/Download files. WebUI. For sure you have to start with Log In to cloud services: then, you have to go to Object Store Service: after this drill down into Service Console and you will be able to see list of the containers within your OSS: To create a new container (bucket in OCI terminology), simply click on "Creare Container" and give a name to it: After it been created, click on it and go to "Upload object" button: Click and Click again and here we are, file in the container: Let's try to upload a bigger file, ops... we got an error: So, seems we have 5GB limitations. Fortunitely, we could have "Large object upload": Which will allow us to uplod file bigger than 5GB: so, and what about downloading? It's easy, simply click download and land file on local files system. Upload/Download files. REST API. WebUI maybe a good way to upload data, when a human operates with it, but it's not too convenient for scripting. If you want to automate your file uploading, you may use REST API. You may find all details regarding REST API here, but alternatively, you may use this script, which I'm publishing below and it could hint you some basic commands:​ #!/bin/bash shopt -s expand_aliases alias echo="echo -e" USER="alexey.filanovskiy@oracle.com" PASS="MySecurePassword" OSS_USER="storage-a424392:${USER}" OSS_PASS="${PASS}" OSS_URL="https://storage-a424392.storage.oraclecloud.com/auth/v1.0" echo "curl -k -sS -H \"X-Storage-User: ${OSS_USER}\" -H \"X-Storage-Pass:${OSS_PASS}\" -i \"${OSS_URL}\"" out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` while [ $? -ne 0 ]; do echo "Retrying to get token\n" sleep 1; out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` done AUTH_TOKEN=`echo "${out}" | grep "X-Auth-Token" | sed 's/X-Auth-Token: //;s/\r//'` STORAGE_TOKEN=`echo "${out}" | grep "X-Storage-Token" | sed 's/X-Storage-Token: //;s/\r//'` STORAGE_URL=`echo "${out}" | grep "X-Storage-Url" | sed 's/X-Storage-Url: //;s/\r//'` echo "Token and storage URL:" echo "\tOSS url: ${OSS_URL}" echo "\tauth token: ${AUTH_TOKEN}" echo "\tstorage token: ${STORAGE_TOKEN}" echo "\tstorage url: ${STORAGE_URL}" echo "\nContainers:" for CONTAINER in `curl -k -sS -u "${USER}:${PASS}" "${STORAGE_URL}"`; do echo "\t${CONTAINER}" done FILE_SIZE=$((1024*1024*1)) CONTAINER="example_container" FILE="file.txt" LOCAL_FILE="./${FILE}" FILE_AT_DIR="/path/file.txt" LOCAL_FILE_AT_DIR=".${FILE_AT_DIR}" REMOTE_FILE="${CONTAINER}/${FILE}" REMOTE_FILE_AT_DIR="${CONTAINER}${FILE_AT_DIR}" for f in "${LOCAL_FILE}" "${LOCAL_FILE_AT_DIR}"; do if [ ! -e "${f}" ]; then echo "\nInfo: File "${f}" does not exist. Creating ${f}" d=`dirname "${f}"` mkdir -p "${d}"; tr -dc A-Za-z0-9 </dev/urandom | head -c "${FILE_SIZE}" > "${f}" #dd if="/dev/random" of="${f}" bs=1 count=0 seek=${FILE_SIZE} &> /dev/null fi; done; echo "\nActions:" echo "\tListing containers:\t\t\t\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/\"" echo "\tCreate container \"oss://${CONTAINER}\":\t\tcurl -k -vX PUT -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}\"" echo "\tListing objects at container \"oss://${CONTAINER}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\n\tUpload \"${LOCAL_FILE}\" to \"oss://${REMOTE_FILE}\":\tcurl -k -vX PUT -T \"${LOCAL_FILE}\" -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\tDownload \"oss://${REMOTE_FILE}\" to \"${LOCAL_FILE}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\" > \"${LOCAL_FILE}\"" echo "\n\tDelete \"oss://${REMOTE_FILE}\":\tcurl -k -vX DELETE -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\"" echo "\ndone" I put the content of this script into file oss_operations.sh, give execute permission and run it: $ chmod +x oss_operations.sh $ ./oss_operations.sh the output will look like: curl -k -sS -H "X-Storage-User: storage-a424392:alexey.filanovskiy@oracle.com" -H "X-Storage-Pass:MySecurePass" -i "https://storage-a424392.storage.oraclecloud.com/auth/v1.0" Token and storage URL: OSS url: https://storage-a424392.storage.oraclecloud.com/auth/v1.0 auth token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage url: https://storage.us2.oraclecloud.com/v1/storage-a424392 Containers: 123_OOW17 1475233258815 1475233258815-segments Container ... Actions: Listing containers: curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/" Create container "oss://example_container": curl -k -vX PUT -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container" Listing objects at container "oss://example_container": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Upload "./file.txt" to "oss://example_container/file.txt": curl -k -vX PUT -T "./file.txt" -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Download "oss://example_container/file.txt" to "./file.txt": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" > "./file.txt" Delete "oss://example_container/file.txt": curl -k -vX DELETE -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" Upload/Download files. FTM CLI. REST API may seems a bit cumbersome and quite hard to use. But there is a good news that there is kind of intermediate solution Command Line Interface - FTM CLI. Again, here is the full documentation available here, but I'd like briefly explain what you could do with FTM CLI. You could download it here and after unpacking, it's ready to use! $ unzip ftmcli-v2.4.2.zip ... $ cd ftmcli-v2.4.2 $ ls -lrt total 120032 -rwxr-xr-x 1 opc opc 1272 Jan 29 08:42 README.txt -rw-r--r-- 1 opc opc 15130743 Mar 7 12:59 ftmcli.jar -rw-rw-r-- 1 opc opc 107373568 Mar 22 13:37 file.txt -rw-rw-r-- 1 opc opc 641 Mar 23 10:34 ftmcliKeystore -rw-rw-r-- 1 opc opc 315 Mar 23 10:34 ftmcli.properties -rw-rw-r-- 1 opc opc 373817 Mar 23 15:24 ftmcli.log You may note that there is file ftmcli.properties, it may simplify your life if you configure it once. Documentation you may find here and my example of this config: $ cat ftmcli.properties #saving authkey #Fri Mar 30 21:15:25 UTC 2018 rest-endpoint=https\://storage-a424392.storage.oraclecloud.com/v1/storage-a424392 retries=5 user=alexey.filanovskiy@oracle.com segments-container=all_segments max-threads=15 storage-class=Standard segment-size=100 Now we have all connection details and we may use CLI as simple as possible. There are few basics commands available with FTMCLI, but as a first step I'd suggest to authenticate a user (put password once): $ java -jar ftmcli.jar list --save-auth-key Enter your password: if you will use "--save-auth-key" it will save your password and next time will not ask you for a password: $ java -jar ftmcli.jar list 123_OOW17 1475233258815 ... You may refer to the documentation for get full list of the commands or simply run ftmcli without any arguments: $ java -jar ftmcli.jar ... Commands: upload Upload a file or a directory to a container. download Download an object or a virtual directory from a container. create-container Create a container. restore Restore an object from an Archive container. list List containers in the account or objects in a container. delete Delete a container in the account or an object in a container. describe Describes the attributes of a container in the account or an object in a container. set Set the metadata attribute(s) of a container in the account or an object in a container. set-crp Set a replication policy for a container. copy Copy an object to a destination container. Let's try to accomplish standart flow for OSS - create container, upload file there, list objects in container,describe container properties and delete it. # Create container $ java -jar ftmcli.jar create-container container_for_blog Name: container_for_blog Object Count: 0 Bytes Used: 0 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Upload file to container $ java -jar ftmcli.jar upload container_for_blog file.txt Uploading file: file.txt to container: container_for_blog File successfully uploaded: file.txt Estimated Transfer Rate: 16484KB/s # List files into Container $ java -jar ftmcli.jar list container_for_blog file.txt # Get Container Metadata $ java -jar ftmcli.jar describe container_for_blog Name: container_for_blog Object Count: 1 Bytes Used: 434 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Delete container $ java -jar ftmcli.jar delete container_for_blog ERROR:Delete failed. Container is not empty. # Delete with force option $ java -jar ftmcli.jar delete -f container_for_blog Container successfully deleted: container_for_blog Another great thing about FTM CLI is that allows easily manage uploading performance out of the box. In ftmcli.properties there is the property called "max-threads". It may vary between 1 and 100. Here is testcase illustrates this: -- Generate 10GB file $ dd if=/dev/zero of=file.txt count=10240 bs=1048576 -- Upload file in one thread (has around 18MB/sec rate $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 18381KB/s -- Change number of thrads from 1 to 99 in config file $ sed -i -e 's/max-threads=1/max-threads=99/g' ftmcli.properties -- Upload file in 99 threads (has around 68MB/sec rate) $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 68449KB/s so, it's very simple and at the same time powerful tool for operations with Object Store, it may help you with scripting of operations.  Upload/Download files. CloudBerry. Another way to interact with OSS use some application, for example, you may use CloudBerry Explorer for OpenStack Storage. There is a great blogpost, which explains how to configure CloudBerry for Oracle Object Store Service Classic and I will start from the point where I already configured it. Whenever you log in it looks like this:     You may easily create container in CloudBerry: And for sure you may easily copy data from your local machine to OSS: It's nothing to add here, CloudBerry is convinient tool for browsing Object Stores and do a small copies between local machine and OSS. For me personally, it looks like TotalCommander for a OSS.  Upload/Download files. Big Data Manager and ODCP. Big Data Cloud Service (BDCS) has great component called Big Data Manager. This is tool developed by Oracle, which allows you to manage and monitor Hadoop Cluster. Among other features Big Data Manager (BDM) allows you to register Object Store in Stores browser and easily drug and drop data between OSS and other sources (Database, HDFS...). When you copy data to/from HDFS you use optimized version of Hadoop Distcp tool ODCP. This is very fast way to copy data back and forth. Fortunitely, JP already wrote about this feature and I could just simply give a link. If you want to see concreet performance numbers, you could go here to a-team blog page. Without Big Data Manager, you could manually register OSS on Linux machine and invoke copy command from bash. Documentation will show you all details and I will show just one example: # add account: $ export CM_ADMIN=admin $ export CM_PASSWORD=SuperSecurePasswordCloderaManager $ export CM_URL=https://cfclbv8493.us2.oraclecloud.com:7183 $ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey.filanovskiy@oracle.com" --swift-password "SecurePasswordForSwift" --swift-storageurl "https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage # list of credentials: $ bda-oss-admin list_swift_creds Provider: bdcstorage     Username: storage-a424392:alexey.filanovskiy@oracle.com     Storage URL: https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens # check files on OSS swift://[container name].[Provider created step before]/: $ hadoop fs -ls swift://alextest.bdcstorage/ 18/03/31 01:01:13 WARN http.RestClientBindings: Property fs.swift.bdcstorage.property.loader.chain is not set Found 3 items -rw-rw-rw- 1 279153664 2018-03-07 00:08 swift://alextest.bdcstorage/bigdata.file.copy drwxrwxrwx - 0 2018-03-07 00:31 swift://alextest.bdcstorage/customer drwxrwxrwx - 0 2018-03-07 00:30 swift://alextest.bdcstorage/customer_address Now you have OSS, configured and ready to use. You may copy data by ODCP, here you may find entire list of the sources and destinations. For example, if you want to copy data from hdfs to OSS, you have to run: $ odcp hdfs:///tmp/file.txt swift://alextest.bdcstorage/ ODCP is a very efficient way to move data from HDFS to Object Store and back. if you are from Hadoop world and you use to Hadoop fs API, you may use it as well with Object Store (configuring it before), for example for load data into OSS, you need to run: $ hadoop fs -put /home/opc/file.txt swift://alextest.bdcstorage/file1.txt Upload/Download files. Oracle Storage Cloud Software Appliance. Object Store is a fairly new concept and for sure there is a way to smooth this migration. Years ago, when HDFS was new and undiscovered, many people didn't know how to work with it and few technologies, such as NFS-Gateway and HDFS-fuse appears. Both these technology allowed to mount HDFS on Linux filesystem and work with it as with normal filesystem. Something like this allows doing Oracle Cloud Infrastructure Storage Software Appliance. All documentation you could find here, brief video here, download software here. In my blog I just show one example of its usage. This picture will help me to explain how works Storage Cloud Software Appliance: you may see that customer need to install on-premise docker container, which will have all required stack. I'll skip all details, which you may find in the documentation above, and will just show a concept. # Check oscsa status [on-prem client] $ oscsa info Management Console: https://docker.oracleworld.com:32769 If you have already configured an OSCSA FileSystem via the Management Console, you can access the NFS share using the following port. NFS Port: 32770 Example: mount -t nfs -o vers=4,port=32770 docker.oracleworld.com:/<OSCSA FileSystem name> /local_mount_point # Run oscsa [on-prem client] $ oscsa up There (on the docker image, which you deploy on some on-premise machine) you may find WebUI, where you can configure Storage Appliance: after login, you may see a list of configured Objects Stores: In this console you may connect linked container with this on-premise host: after it been connected, you will see option "disconnect" After you connect a device, you have to mount it: [on-prem client] $ sudo mount -t nfs -o vers=4,port=32770 localhost:/devoos /oscsa/mnt [on-prem client] $ df -h|grep oscsa localhost:/devoos 100T 1.0M 100T 1% /oscsa/mnt Now you could upload a file into Object Store: [on-prem client] $ echo "Hello Oracle World" > blog.file [on-prem client] $ cp blog.file /oscsa/mnt/ This is asynchronous copy to Object Store, so after a while, you will be able to find a file there: Only one restriction, which I wasn't able to overcome is that filename is changing during the copy. Conclusion. Object Store is here and it will became more and more popular. It means there is no way to escape it and you have to get familiar with it. Blogpost above showed that there are multiple ways to deal with it, strting from user friendly (like CloudBerry) and ending on the low level REST API.

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud...

Data Warehousing

Loading Data to the Object Store for Autonomous Data Warehouse Cloud

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and are getting ready to rock-n-roll. But the one thing you’re not sure about is this Object Store. Yes, you used it successfully as described in the tutorial, but what’s next?. And what else is there to know about the Object Store? First and foremost, if you are interested in understanding a bit more about what this Object Store is, you should read the following blog post from Rachna, the Product Manager for the Object Store among other things. It introduces the Object Store, how to set it up and manage files with the UI, plus a couple of simple command line examples (don’t get confused by the term ‘BMC’, that’s the old name of Oracle’s Cloud Infrastructure; that’s true for the command line utility as well, which is now called oci). You should read that blog post to get familiar with the basic concepts of the Object Store and a cloud account (tenant). The documentation and blog posts are great, but now you actually want to do use it to load data into ADWC.  This means loading more (and larger) files, more need for automation, and more flexibility.  This post will focus on exactly that: to become productive with command line utilities without being a developer, and to leverage the power of the Oracle Object Store to upload more files in one go and even how to upload larger files in parallel without any major effort. The blog post will cover both: The Oracle oci command line interface for managing files The Swift REST interface for managing files   Using the oci command line interface The Oracle oci command line interface (CLI) is a tool that enables you to work with Oracle Cloud Infrastructure objects and services. It’s a thin layer on top of the oci APIs (typically REST) and one of Oracle’s open source project (the source code is on GitHub). Let’s quickly step through what you have to do for using this CLI. If you do not want to install anything, that is fine, too. In that case feel free to jump to the REST section in this post right away, but you’re going to miss out on some cool stuff that the CLI provides you out of the box. To get going with the utility is really simple, as simple as one-two-three Install oci cli following the installation instructions on github. I just did this on an Oracle Linux 7.4 VM instance that I created in the Oracle Cloud and had the utility up and running in no time.   Configure your oci cli installation. You need a user created in the Oracle Cloud account that you want to use, and that user must have the appropriate privileges to work with the object store. A keypair is used for signing API requests, with the public key uploaded to Oracle. Only the user calling the API should possess the private key. All this is described in the configuration section of the CLI.  That is probably the part that takes you the most time of the setup. You have to ensure to have UI console access when doing this since you have to upload the public key for your user.   Use oci cli. After successful setup you can use the command line interface to manage your buckets for storing all your files in the Cloud, among other things.   First steps with oci cli The focus of the command line interface is on ease-of-use and to make its usage as self-explaining as possible, with a comprehensive built-in help system in the utility. Whenever you want to know something without looking around, use the --help, -h, or -? Syntax for a command, irrespective of how many parameters you have already entered. So you can start with oci -h and let the utility guide you. For the purpose of file management the important category is the object store category, with the main tasks of: Creating, managing, and deleting buckets This task is probably done by an administrator for you, but we will cover it briefly nevertheless   Uploading, managing, and downloading objects (files) That’s your main job in the context of the Autonomous Data Warehouse Cloud That’s what we are going to do now.   Creating a bucket Buckets are containers that store objects (files). Like other resources, buckets belong to a compartment, a collection of resources in the Cloud that can be used as entity for privilege management. To create a bucket you have to know the compartment id to create a bucket. That is the only time we have to deal with this cloud-specific unique identifiers. All other object (file) operations use names. So let’s create a bucket. The following creates a bucket named myFiles in my account ADWCACCT in a compartment given to me by the Cloud administrator. $ oci os bucket create --compartment-id ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq --namespace-name adwcaact --name myFiles {   "data": {     "compartment-id": "ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq",     "created-by": "ocid1.user.oc1..aaaaaaaaomoqtk3z7y43543cdvexq3y733pb5qsuefcbmj2n5c6ftoi7zygq",     "etag": "c6119bd6-98b6-4520-a05b-26d5472ea444",     "metadata": {},     "name": "myFiles",     "namespace": "adwcaact",     "public-access-type": "NoPublicAccess",     "storage-tier": "Standard",     "time-created": "2018-02-26T22:16:30.362000+00:00"   },   "etag": "c6119bd6-98b6-4520-a05b-26d5472ea733" } The operation returns with the metadata of the bucket after successful creation. We’re ready to upload and manage files in the object store.   Upload your first file with oci cli You can upload a single file very easily with the oci command line interface. And, as promised before, you do not even have to remember any ocid in this case … . $ oci os object put --namespace adwcacct --bucket-name myFiles --file /stage/supplier.tbl Uploading object  [####################################]  100% {   "etag": "662649262F5BC72CE053C210C10A4D1D",   "last-modified": "Mon, 26 Feb 2018 22:50:46 GMT",   "opc-content-md5": "8irNoabnPldUt72FAl1nvw==" } After successful upload you can check the md5 sum of the file; that’s basically the fingerprint that the data on the other side (in the cloud) is not corrupt and the same than local (on the machine where the data is coming from). The only “gotcha” is that OCI is using base64 encoding, so you cannot just do a simple md5. The following command solves this for me on my Mac: $ openssl dgst -md5 -binary supplier.tbl |openssl enc -base64 8irNoabnPldUt72FAl1nvw== Now that’s a good start. I can use this command in any shell program, like the following which loads all files in a folder sequentially to the object store:  for i in `ls *.tbl` do   oci os object put --namespace adwcacct --bucket-name myFiles --file $i done You can write it to load multiple files in parallel, load only files that match a specific name pattern, etc. You get the idea. Whatever you can do with a shell you can do. Alternatively, if it's just about loading all the files in  you can achieve the same with the oci cli as well by using its bulk upload capabilities. The following shows briefly oci os object bulk-upload -ns adwcacct -bn myFiles --src-dir /MyStagedFiles {   "skipped-objects": [],   "upload-failures": {},   "uploaded-objects": {     "chan_v3.dat": {       "etag": "674EFB90B1A3CECAE053C210D10AC9D9",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "/t4LbeOiCz61+Onzi/h+8w=="     },     "coun_v3.dat": {       "etag": "674FB97D50C34E48E053C230C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "sftu7G5+bgXW8NEYjFNCnQ=="     },     "cust1v3.dat": {       "etag": "674FB97D52274E48E053C210C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:44:06 GMT",       "opc-content-md5": "Zv76q9e+NTJiyXU52FLYMA=="     },     "sale1v3.dat": {       "etag": "674FBF063F8C50ABE053C250C10AE3D3",       "last-modified": "Tue, 13 Mar 2018 17:44:52 GMT",       "opc-content-md5": "CNUtk7DJ5sETqV73Ag4Aeg=="     }   } } Uploading a single large file in parallel  Ok, now we can load one or many files to the object store. But what do you do if you have a single large file that you want to get uploaded? The oci command line offers built-in multi-part loading where you do not need to split the file beforehand. The command line provides you built-in capabilities to (A) transparently split the file into sized parts and (B) to control the parallelism of the upload. $ oci os object put -ns adwcacct -bn myFiles --file lo_aa.tbl --part-size 100 --parallel-upload-count 4 While the load is ongoing you can all in-progress uploads, but unfortunately without any progress bar or so; the progress bar is reserved for the initiating session:  $ oci os multipart list -ns adwcacct -bn myFiles {   "data":    [         {       "bucket": "myFiles",       "namespace": "adwcacct",       "object": "lo_aa.tbl",       "time-created": "2018-02-27T01:19:47.439000+00:00",       "upload-id": "4f04f65d-324b-4b13-7e60-84596d0ef47f"     }   ] }   While a serial process for a single file gave me somewhere around 35 MB/sec to upload on average, the parallel load sped up things quite a bit, so it’s definitely cool functionality (note that your mileage will vary and is probably mostly dependent on your Internet/proxy connectivity and bandwidth.  If you’re interested in more details about how that works, here is a link from Rachna who explains the inner details of this functionality in more detail.   Using the Swift REST interface Now after having covered the oci utility, let’s briefly look into what we can do out of the box, without the need to install anything. Yes, without installing anything you can leverage REST endpoints of the object storage service. All you need to know is your username/SWIFT password and your environment details, e.g. which region your uploading to, the account (tenant) and the target bucket.  This is where the real fun starts, and this is where it can become geeky, so we will focus only on the two most important aspects of dealing with files and the object store: uploading and downloading files.   Understanding how to use Openstack Swift REST File management with REST is equally simple than it is with the oci cli command. Similar to the setup of the oci cli, you have to know the basic information about your Cloud account, namely:  a user in the cloud account that has the appropriate privileges to work with a bucket in your tenancy. This user also has to be configured to have a SWIFT password (see here how that is done). a bucket in one of the object stores in a region (we are not going to discuss how to use REST to do this). The bucket/region defines the rest endpoint, for example if you are using the object store in Ashburn, VA, the endpoint is https://swiftobjectstorage.us-ashburn-1.oraclecloud.com) The URI for accessing your bucket is built as follows: <object store rest endpoint>/v1/<tenant name>/<bucket name> In my case for the simple example it would be https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/myFiles If you have all this information you are set to upload and download files.   Uploading an object (file) with REST Uploading a file is putting a file into the Cloud, so the REST command is a PUT. You also have to specify the file you want to upload and how the file should be named in the object store. With this information you can write a simple little shell script like the following that will take both the bucket and file name as input: # usage: upload_oss.sh <file> <bucket> file=$1 bucket=$2   curl -v -X PUT  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.! '  \  --upload-file ${file} \  https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/${bucket}/${file} So if you want to upload multiple files in a directory, similar to what we showed for the oci cli command, you just save this little script, say upload_oss.sh, and call it just like you called oci cli: for i in `ls *.tbl` do   upload_oss.sh myFiles $i done   Downloading an object (file) with REST  While we expect you to upload data to the object store way more often than downloading information, let’s quickly cover that, too. So you want to get a file from the object store? Well, the REST command GET will do this for you. It is similarly intuitive than uploading, and you might be able to guess the complete syntax already. Yes, it is:  curl -v -X GET  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.!'  \ https://swiftobjectstorage.us-ashburn1.oraclecloud.com/v1/adwcacct/myFiles/myFileName \ --output myLocalFileName That’s about all you need to get started uploading all your files to the Oracle Object Store so that you then can consume them from within the Autonomous Data Warehouse Cloud.  Happy uploading!

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and...

The Data Warehouse Insider

Roadmap Update: What you need to know about Big Data Appliance 4.12

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As discussed the 4.11 release delivered the following: Released with updates CDH and now deliver 5.13.1 Updates to the Operating System (OL6) with security updates to said OS Java updates The release consciously pushed back some of the features to ensure the Oracle environments pick up the latest CDH releases within our (roughly) 4 week goal. Next up is BDA 4.12 As part of the longer development time we carved out by doing 4.12 with features we are able to schedule a set of very interesting components into this release. At a high level, the following are planned to be in 4.12: Configure a Kafka cluster on dedicated nodes on the BDA Set up (and include) Big Data Manager on BDA. For more information on Big Data Manager, see these videos (or click the one further down) on what cool things you can do with the Zeppelin Notebooks, ODCP and drag-and-drop copying of data  Full BDA clusters on OL7. After we enabled the edge nodes for OL7 to support Cloudera Data Science Workbench, we are now delivering full clusters on OL7. Note that we have not yet delivered an in-place upgrade path to migrate from an OL6 based cluster to an OL7 cluster High Availability for more services in CDH, by leveraging and pre-configuring best practices. These new HA set up steps are obviously updated regularly and are fully supported as part of the system going forward: Hive Service Sentry Service Hue Service On BDA X7-2 hardware 2 SSDs are included. When running on X7, the Journal Node metadata and Zookeeper data is put onto these SSDs instead of the regular OS disks. This ensures better performance for highly loaded master nodes.   Of course the software will have undergone testing and we do run infrastructure security scans on the system. We include any Linux updates that available when we freeze the image and ship those. Any violation that crops up after the release can, no should, be updated using the official OL Repo to update the OS. Lastly, we are looking to release early April, and are finalizing the actual Cloudera CDH release. We may use 5.14.1, but there is chance that we switch and jump to 5.15.0 depending on timings. And one more Thing Because Big Data Appliance is an engineered system, customers expect robust movement between versions. Upgrading the entire system, where BDA is different from just a Cloudera cluster, is an important part of the value proposition but is also fairly complex. With 4.12 we place additional emphasis on addressing previously seen upgrade issues and we will be doing this as an ongoing priority on all BDA software releases. So expect even more robust upgrades going forward. Lastly Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As...

Big Data SQL

Big Data SQL Quick Start. Multi-user Authorization - Part 25

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side). But it's also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala...) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others.  Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here. Big Data SQL and Kerberos. Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question - which principal do you need to have with Big Data SQL?  Answer is easy - oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences: - Unable to authorize access to data based on the user that is running a query - Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature - Multi-User Authorization. Hadoop impersonalization. In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here: "A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe." at the same manner, "oracle" is the superuser and other users are impersonalized. Multi-User Authorization key concepts. 1) Big Data SQL will identify the trusted user that is accessing data on the cluster.  By executing the query as the trusted user: - Authorization rules specified in Hadoop will be respected - Authorization rules specified in Hadoop do not need to be replicated in the database - Hadoop cluster audits identify the actual Big Data SQL query user 2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop 3) Must map the database user that is running a query in Oracle Database to a Hadoop user 4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop  - This may be an Oracle Database user (i.e. schema) - Lightweight user comes from session-based contexts (see SYS_CONTEXT) - User/Group map must be available thru OS lookup in Hadoop Demonstration. Full documentation for this feature, you may find here and now I'm going to show few most popular cases with code examples. For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table: SQL> grant select on BDSQL_USER_MAP to bikes; SQL> grant execute on DBMS_BDSQL to bikes; SQL> grant BDSQL_ADMIN to bikes; In my cases, this is user "bikes". Just in case clean up permissions for user BIKES: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / check that the mapping table is empty: SQL> select * from SYS.BDSQL_USER_MAP; and after this run a query: SQL> select /*+ MONITOR */ * from bikes.weather_ext; this is the default mode, without any mapping, so I assume that I'll contact HDFS as oracle user. For double check this, I review audit files: $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=oracle ... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here is clear, that user Oracle reads the file (ugi=oracle). Let's check permissions for given file (which represents this external table): $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r--r-- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv so, everybody may read it. Remember this and let's try to create the first mapping. SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'user1' ); end; / this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let's find this in file permission table: Run query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. It's interesting that user1 doesn't exist on the Hadoop OS: # id user1 id: user1: No such user if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again: $ sudo -u hdfs hadoop fs -chmod 640 /data/weather/central_park_weather.csv $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r----- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv Now it failed. For make it works I may create "user1" account on each Hadoop node and add it to oinstall group. $ useradd user1 $ usermod -a -G oinstall user1 Run the query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here we are! We could read the file because of group permissions. What if I want to map this schema to HDFS or some other powerful user? Let's try: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'hdfs' ); end; / the reason why we got this exception is that hdfs user is on the blacklist for impersonation. $ cat $ORACLE_HOME/bigdatasql/databases/orcl/bigdata_config/bigdata.properties| grep impersonation .... # Impersonation properties impersonation.enabled=true impersonation.blacklist='hue','yarn','oozie','smon','mapred','hdfs','hive','httpfs','flume','HTTP','bigdatamgr','oracle' ... the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema. Bellow is the example, which illustrates this: we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here). So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop. Bellow is a test case, which illustrates this. -- Remove previous rule, related with BIKES user -- SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / -- Add a new rule, which tells that if database user is BIKES, Hadoop user have to be taken from USERENV as CLIENT_IDENTIFIER -- SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'CLIENT_IDENTIFIER' ); end; --Check current database user (schema) -- SQL> select user from dual; BIKES -- Check CLIENT_IDENTIFIER from USERENV -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; NULL -- Run any query aginst Hadoop -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:14:40 ... ugi=oracle ... src=/data/weather/central_park_weather.csv -- Set CLIENT_IDENTIFIER -- SQL> begin dbms_session.set_IDENTIFIER('Alexey'); end; / -- Check CLIENT_IDENTIFIER for current session -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; Alexey -- Run query agin over HDFS data -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs: -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:17:43 ... ugi=Alexey ... src=/data/weather/central_park_weather.csv the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc...) have their authenticated identity passed to Hadoop. To make it work, simply run: SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user => '*' , syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'AUTHENTICATED_IDENTITY'); end; / and after this your user on HDFS will be that returned by: SQL> select SYS_CONTEXT('USERENV', 'AUTHENTICATED_IDENTITY') from dual; BIKES for example, if I logged on to the database as BIKES (as database user), on HDFS I'll be authenticated as bikes user -bash-4.1 $ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:23:23 ... ugi=bikes... src=/data/weather/central_park_weather.csv for checking all rules, which you have for multi-user authorization you may run follow query: SQL> select * from SYS.BDSQL_USER_MAP; Hope that this feature may allow you to create robust security bastion around your data in HDFS.

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features,...

Data Warehousing

Demoing OAC Querying ADWC Massive Data Volumes

If you missed all the major announcements from OpenWorld 2017 about Oracle Autonomous Data Warehouse Cloud then take some time to review this blog post Review of Big Data Warehousing at OpenWorld 2017 - Now Available. This great new video below is an excerpt from the Thomas Kurian's Keynote session at Oracle Cloud World earlier this month (February). In this video, George Lumpkin, VP of Product Management, shows how easy it is to create an new data warehouse inside the Oracle Autonomous Data Warehouse Cloud, how fast and elastic it is and how quick and easy it is to integrate with other parts of Oracle's Cloud portfolio - in this case Oracle Analytics Cloud. This demo will take you from zero-to-dashboard in around 10 minutes!  For the full-length video of Thomas Kurian's keynote at Oracle CloudWorld, new York, go here: Oracle Cloud and the Future of Your Business Once you have seen the video you will understand that Autonomous Data Warehouse Cloud is amazingly easy: it is simple to set up and requires no tuning. You will know that Autonomous Data Warehouse is amazingly elastic in that it scales online without downtime and it helps you save money because you only have to pay for exactly what you need. The way Autonomous Data Warehouse Cloud integrates with Oracle Analytics Cloud (and other Oracle cloud services too) means you can know deliver complete end-to-end analytics solutions using Oracle Cloud. Autonomous Data Warehouse Cloud truly represents the new step in data management for the cloud! For more information about Autonomous Data Warehouse Cloud please follow these links: Autonomous Data Warehouse Cloud on OTN Autonomous Data Warehouse Cloud on cloud.oracle.com      

If you missed all the major announcements from OpenWorld 2017 about Oracle Autonomous Data Warehouse Cloud then take some time to review this blog post Review of Big Data Warehousing at OpenWorld 2017...

Big Data

Advanced Data Protection using Big Data SQL and Database Vault - Introduction

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new privacy regulations are being implemented such as the European Union (EU) General Data Protection Regulation (GDPR), are also being enforced and the increasing adoption of Public Cloud also legitimate these new Cyber Security requirements. Data Lake/Hub environments can be a treasure trove of sensitive data, so data protection must be considered in almost all Big Data SQL implementations. Fortunately, Big Data SQL is able to propagate several of the data protection capabilities of the Oracle Multi-Model Database such as Virtual Private Database (aka Row Level Security) or Data Redaction described in a previous post (see Big Data SQL Quick Start. Security - Part4.).  But now is the time to speak about one of the most powerful ones: Database Vault. Clearly, databases are a common target and 81% of 2017 hacking-related breaches leveraged either stolen and/or weak passwords.  So, once legitimate internal credentials are acquired (and preferably those for system accounts), then accessing interesting data is just a matter of time. Hence, while Alexey described all the security capabilities you could put in place to Secure your Hadoop Cluster: ...once hackers get legitimate database credentials, it's done... unless you add another Cyber Security layer to manage fine grained accesses. And here comes Database Vault1. This Introductory post is the first of a series where we'll illustrate the security capabilities that can be combined with Big Data SQL in order to propagate these protections to Oracle and non-Oracle data stores: NoSQL clusters (Oracle NoSQL DB, HBase, Apache Cassandra, MongoDB...), Hadoop (Hortonworks and Cloudera), Kafka (Confluent and Apache, with the 3.2 release of Big Data SQL). In essence, Database Vault allows to separation of duties between the operators (DBA) and application users. As a result, data are protected from users with system privileges (SYSTEM (which should never be used and locked), DBA named accounts...) - but they can still continue to do their job: Moreover Database Vault has the ability to add fine grained security layers to control precisely who accesses which objects (tables, view, PL/SQL code...), from where (e.g. edges nodes only), and when (e.g. application maintenance window solely):   As explained in the previous figure, Database Vault introduces the concepts of Realms and Command Rules. From documentation: A realm is a grouping of database schemas, database objects, and/or database roles that must be secured for a given application. Think of a realm as a zone of protection for your database objects. A schema is a logical collection of database objects such as tables (including external tables, hence allowing to work with Big Data SQL), views, and packages, and a role is a collection of privileges. By arranging schemas and roles into functional groups, you can control the ability of users to use system privileges against these groups and prevent unauthorized data access by the database administrator or other powerful users with system privileges. Oracle Database Vault does not replace the discretionary access control model in the existing Oracle database. It functions as a layer on top of this model for both realms and command rules. Oracle Database Vault provides two types of realms: regular and mandatory. A regular realm protects an entire database object (such as a schema). This type of realm restricts all users except users who have direct object privilege grants. With regular realms, users with direct object grants can perform DML operations but not DDL operations. A mandatory realm restricts user access to objects within a realm. Mandatory realms block both object privilege-based and system privilege-based access. In other words, even an object owner cannot access his or her own objects without proper realm authorization if the objects are protected by mandatory realms. After you create a realm, you can register a set of schema objects or roles (secured objects) for realm protection and authorize a set of users or roles to access the secured objects. For example, you can create a realm to protect all existing database schemas that are used in an accounting department. The realm prohibits any user who is not authorized to the realm to use system privileges to access the secured accounting data. And also: A command rule protects Oracle Database SQL statements (SELECT, ALTER SYSTEM), database definition language (DDL), and data manipulation language (DML) statements. To customize and enforce the command rule, you associate it with a rule set, which is a collection of one or more rules. The command rule is enforced at run time. Command rules affect anyone who tries to use the SQL statements it protects, regardless of the realm in which the object exists. One important point to emphasize is that Database Vault will audit any access violation to protected objects ensuring governance and compliance over time. To summarize:   In the next parts of this series, I'll present 3 use cases as following in order to demonstrate some of Database Vault capabilities in a context of Big Data SQL: Protect data from users with system privileges (DBA…) Access data only if super manager is connected too Prevent users from creating EXTERNAL tables for Big Data SQL And in the meantime, you shall discover practical information by reading one of our partner white-papers. 1: Database Vault is a database option and has to be licensed accordingly on the Oracle Database Enterprise Edition only. Notice that Database Cloud Service High Performance and Extreme Performance as well as Exadata Cloud Service and Exadata Cloud at Customer have this capability included into the cloud subscription.   Thanks to Alan, Alexey and Martin for their helpful reviews!

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new...

Learn more about using Big Data Manager - importing data, notebooks and other useful things

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the distributed (Spark based) copy utility. Since then a lot of useful features have been added to Big Data Manager, and to share with the world, these are now recorded and published on YouTube. The library consists of a number of videos with the following topics (video library is here): Working with Archives File Imports Working with Remote Data Importing Notebooks from GitHub For some background, Big Data Manager is a utility that is included with Big Data Cloud Service, Big Data Cloud at Customer and soon with Big Data Appliance. It's primary goal is to enable users to quickly achieve tasks like copying files, and publishing data via a Notebook interface. In this case, the interface is based on / leverages Zeppelin notebooks. The notebooks run on a node within the cluster and have direct access to the local data elements. As is shown in some of the videos, Big Data Manager enables easy file transport between Object Stores (incl. Oracle's and Amazon's) and HDFS. This transfer is based on ODCP, which leverages Apache Spark in the cluster to enable high volume and high performance file transfers. You can see more here: Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the...

Big Data

Oracle Big Data Lite 4.11 is Available

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more. Cloudera Distribution including Apache Hadoop (CDH5.13.1) Cloudera Manager (5.13.1) Oracle Big Data Spatial and Graph 2.4 Oracle Big Data Connectors 4.11 Oracle SQL Connector for HDFS 3.8.1 Oracle Loader for Hadoop 3.9.1 Oracle Data Integrator 12c (12.2.1.3.0) Oracle R Advanced Analytics for Hadoop 2.7.1 Oracle XQuery for Hadoop 4.9.1 Oracle Data Source for Apache Hadoop 1.2.1 Oracle Shell for Hadoop Loaders 1.3.1 Oracle NoSQL Database Enterprise Edition 12cR1 (4.5.12) Oracle JDeveloper 12c (12.2.1.2.0) Oracle SQL Developer and Data Modeler 17.3.1 with Oracle REST Data Services 3.0.7 Oracle Data Integrator 12cR1 (12.2.1.3.0) Oracle GoldenGate 12c (12.3.0.1.2) Oracle R Distribution 3.3.0 Oracle Perfect Balance 2.10.0 Check out the download page for the latest samples and useful links to help you get started with Oracle's Big Data platform. Enjoy!

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c...

Big Data

Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need to do: analyzing that data. This new free tutorial shows you how to easily and quickly do the grunt work with Big Data Manager in Big Data Cloud Service (learn more here) enabling you to worry about analytics, not moving files. The approach taken here is to take a file that resides on your desktop, and drag and drop that into HDFS on Oracle Big Data Cloud Service... as easy as that, and you are now off doing analytics by right clicking and adding the data into a Zeppelin Notebook. Within the notebook, you get to see how Big Data Manager enables you to quickly generate a Hive schema definition from the data set and then start to do some analytics. Mechanics made easy! You can, and always should look at leveraging Object Storage as you entry point for data, as discussed in this other Big Data Manager How To article:  See How Easily You Can Copy Data Between Object Store and HDFS. For more advanced analytics, have a look at Oracle wide ranging set of cloud services or open source tools like R, and the high performance version of R: Oracle R Advanced Analytics for Hadoop.  

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need...

Big Data

New Release: BDA 4.11 is now Generally Available

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software on MyOracleSupport. So what is new: BDA Software 4.11.0 contains few new things, but is intended to keep our software releases close to the Cloudera releases, as discussed in this roadmap post. This latest version uptakes: Cloudera CDH 5.13.1 and Cloudera Manager 5.13.1 Parcels for Kafka 3.0, Spark 2.2 and Key Trustee Server 5.13 are included in the BDA Software Bundle. Kudu is now included in the CDH parcel The team did a number of small but significant updates: Cloudera Manager cluster hosts now configured with TLS Level 3 - This includes encrypted communication with certificate verification of both Cloudera Manager Server and Agents  to verify identity and prevent spoofing by untrusted Agents running on hosts. Update to ODI Agent 12.2.1.3.0 Updates to Oracle Linux 6, JDK 8u151 and MySQL 5.7.20 It is important to remember that with 4.11.0 we no longer support upgrading OL5 based clusters. Review  New Release: BDA 4.10 is now Generally Available for some details on this. Links: Documentation: http://www.oracle.com/technetwork/database/bigdata-appliance/documentation/index.html Configurator: http://www.oracle.com/technetwork/database/bigdata-appliance/downloads/index.html That's all folks, more new releases, features and good stuff to come in 2018.   

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software...

Data Warehousing

SQL Pattern Matching Deep Dive - the book

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User Group Conference. The title of each blog post started with SQL Pattern Matching Deep Dive...and covered a set of 6 posts: Part 1 - Overview Part 2 - Using MATCH_NUMBER() and CLASSIFIER() Part 3 - Greedy vs. reluctant quantifiers Part 4 - Empty matches and unmatched rows? Part 5 - SKIP TO where exactly? Part 6 - State machines There are a lot of related posts derived from that core set of 6 posts along with other presentations and code samples. One of the challenges, even when searching via Google, was tracking down all the relevant content. Therefore, I have spent the last 6-9 months converting all my deep dive content into a book - an Apple iBook. I have added a lot of new content based discussions I have had at user conferences, questions posted on the developer SQL forum, discussions with my development team and some new presentations developed for the OracleCode series events. To make it life easier for everyone I have split the content into two volumes and just in time for Thanksgiving Volume 1 is now available in the iBook Store - it's free to download! This first volume covers the following topics: Chapter 1: Introduction/br> Background to the book and explanation of how some of the features with the book are expected to work Chapter 2: Industry specific use cases In this section we will review a series of uses cases and provide conceptual simplified SQL to solve these business requirements using the new SQL pattern matching functionality. Chapter 3: Syntax for MATCH_RECOGNIZE The easiest way to explore the syntax of 12c’s new MATCH_RECOGNIZE clause is to look at a simple example... Chapter 4: How to use built-in measures for debugging In this section I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to your pattern. Chapter 5: Patterns and Predicates This chapter looks at how predicates affect the results returned by MATCH_RECOGNIZE. Chapter 6: Next Steps This final section provides links to additional information relating to SQL pattern matching. Chapter 7: Credits   My objective is that by the end of this two-part series you will have a good, solid understanding of how MATCH_RECOGNIZE works, how it can be used to simplify your application code and how to test your code to make sure it is working correctly. In a couple of weeks I will publish information about the contents of Volume 2 and when I hope to have it finished! As usual, if you have any comments about the contents of the book then let please email me directly at keith.laker@oracle.com

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User...

Big Data SQL

Using Materialized Views with Big Data SQL to Accelerate Performance

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll review a performance optimization that has been around for quite a while and is used at thousands of customers:  Materialized Views (MVs). For those of you who are unfamiliar with MVs – an MV is a precomputed summary table.  There is a defining query that describes that summary.  Queries that are executed against the detail tables comprising the summary will be automatically rewritten to the MV when appropriate: In the diagram above, we have a 1B row fact table stored in HDFS that is being accessed thru a Big Data SQL table called STORE_SALES.  Because we know that users want to query the data using a product hierarchy (by Item), a geography hierarchy (by Region) and a mix (by Class & QTR) – we created three summary tables that are aggregated to the appropriate levels. For example, the “by Item” MV has the following defining query: CREATE MATERIALIZED VIEW mv_store_sales_item ON PREBUILT TABLE ENABLE QUERY REWRITE AS (   select ss_item_sk,          sum(ss_quantity) as ss_quantity,          sum(ss_ext_wholesale_cost) as ss_ext_wholesale_cost,          sum(ss_net_paid) as ss_net_paid,          sum(ss_net_profit) as ss_net_profit   from bds.store_sales   group by ss_item_sk ); Queries executed against the large STORE_SALES that can be satisfied by the MV will now be automatically rewritten: SELECT i_category,        SUM(ss_quantity) FROM bds.store_sales, bds.item_orcl WHERE ss_item_sk = i_item_sk   AND i_size in ('small', 'petite')   AND i_wholesale_cost > 80 GROUP BY i_category; Taking a look at the query’s explain plan, you can see that even though store_sales is the table being queried – the table that satisfied the query is actually the MV called mv_store_sales_item.  The query was automatically rewritten by the optimizer. Explain plan with the MV: Explain plan without the MV: Even though Big Data SQL optimized the join and pushed the predicates and filtering down to the Hadoop nodes – the MV dramatically improved query performance: With MV:  0.27s Without MV:  19s This is to be expected as we’re querying a significantly smaller and partially aggregated data.  What’s nice is that query did not need to change; simply the introduction of the MV sped up the processing. What is interesting here is that the query selected data at the Category level – yet the MV is defined at the Item level.  How did the optimizer know that there was a product hierarchy?  And that Category level data could be computed from Item level data?  The answer is metadata.  A dimension object was created that defined the relationship between the columns: CREATE DIMENSION BDS.ITEM_DIM LEVEL ITEM IS (ITEM_ORCL.I_ITEM_SK) LEVEL CLASS IS (ITEM_ORCL.I_CLASS) LEVEL CATEGORY IS (ITEM_ORCL.I_CATEGORY) HIERARCHY PROD_ROLLUP ( ITEM CHILD OF CLASS CHILD OF   CATEGORY  )  ATTRIBUTE ITEM DETERMINES ( ITEM_ORCL.I_SIZE, ITEM_ORCL.I_COLOR, ITEM_ORCL.I_UNITS, ITEM_ORCL.I_CURRENT_PRICE,I_WHOLESALE_COST ); Here, you can see that Items roll up into Class, and Classes roll up into Category.  The optimizer used this information to allow the query to be redirected to the Item level MV. A good practice is to compute these summaries and store them in Oracle Database tables.  However, there are alternatives.  For example, you may have already computed summary tables and stored them in HDFS.  You can leverage these summaries by creating an MV over a pre-built Big Data SQL table.  Consider the following example where a summary table was defined in Hive and called csv.mv_store_sales_qtr_class.  There are two steps required to leverage this summary: Create a Big Data SQL table over the hive source Create an MV over the prebuilt Big Data SQL table Let’s look at the details.  First, create the Big Data SQL table over the Hive source (and don’t forget to gather statistics!):   CREATE TABLE MV_STORE_SALES_QTR_CLASS     (       I_CLASS VARCHAR2(100)     , SS_QUANTITY NUMBER     , SS_WHOLESALE_COST NUMBER     , SS_EXT_DISCOUNT_AMT NUMBER     , SS_EXT_TAX NUMBER     , SS_COUPON_AMT NUMBER     , D_QUARTER_NAME VARCHAR2(30)     )     ORGANIZATION EXTERNAL     (       TYPE ORACLE_HIVE       DEFAULT DIRECTORY DEFAULT_DIR       ACCESS PARAMETERS       (         com.oracle.bigdata.tablename: csv.mv_store_sales_qtr_class       )     )     REJECT LIMIT UNLIMITED; -- Gather statistics exec  DBMS_STATS.GATHER_TABLE_STATS ( ownname => '"BDS"', tabname => '"MV_STORE_SALES_QTR_CLASS"', estimate_percent => dbms_stats.auto_sample_size, degree => 32 ); Next, create the MV over the Big Data SQL table: CREATE MATERIALIZED VIEW mv_store_sales_qtr_class ON PREBUILT TABLE WITH REDUCED PRECISION ENABLE QUERY REWRITE AS (     select i.I_CLASS,     sum(s.ss_quantity) as ss_quantity,        sum(s.ss_wholesale_cost) as ss_wholesale_cost, sum(s.ss_ext_discount_amt) as ss_ext_discount_amt,        sum(s.ss_ext_tax) as ss_ext_tax,        sum(s.ss_coupon_amt) as ss_coupon_amt,        d.D_QUARTER_NAME     from DATE_DIM_ORCL d, ITEM_ORCL i, STORE_SALES s     where s.ss_item_sk = i.i_item_sk       and s.ss_sold_date_sk = date_dim_orcl.d_date_sk     group by d.D_QUARTER_NAME,            i.I_CLASS     ); Queries against STORE_SALES that can be satisfied by the MV will be rewritten: Here, the following query used the MV: - What is the quarterly performance by category with yearly totals? select          i.i_category,        d.d_year,        d.d_quarter_name,        sum(s.ss_quantity) quantity from bds.DATE_DIM_ORCL d, bds.ITEM_ORCL i, bds.STORE_SALES s where s.ss_item_sk = i.i_item_sk   and s.ss_sold_date_sk = d.d_date_sk   and d.d_quarter_name in ('2005Q1', '2005Q2', '2005Q3', '2005Q4') group by rollup (i.i_category, d.d_year, d.D_QUARTER_NAME) And, the query returned in a little more than a second: Looking at the explain plan, you can see that the query is executed against the MV – and the EXTERNAL TABLE ACCESS (STORAGE FULL) indicates that Big Data SQL Smart Scan kicked in on the Hadoop cluster. MVs within the database can be automatically updated by using change tracking.  However, in the case of Big Data SQL tables, the data is not resident in the database – so the database does not know that the summaries are changed.  Your ETL processing will need to ensure that the MVs are kept up to date – and you will need to set query_rewrite_integrity=stale_tolerated. MVs are an old friend.  They have been used for years to accelerate performance for traditional database deployments.  They are a great tool to use for your big data deployments as well!  

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll...

Big Data SQL

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse: You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here: Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop).  In this architecture, Kafka contains stream data and it's able to answer the question "what is going on right now", whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question "how it use to be". Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical. Example of using Big Data SQL over Kafka and other sources. So, above I've explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example.  Input for demo example: - We have company, called MoviePlex, which sells video content all around the world - There are two stream datasets - network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales.  - Both stream data in real-time in Kafka - Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well. Based on this we have a business case - monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example). Using Oracle Data Visualization Desktop, we've created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries: The blue line is a historical benchmark. Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be. After a while revenue starts going down. This trend keeps going. A lot of network errors in France. Let's drill down into itemized traffic: Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors. Conclusion: 1) Kafka stores real-time data  and answers on question "what is going on right now" 2) Database and Hadoop stores historical data and answers on the question: "how it use to be" 3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets) 4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in the nutshell what Kafka is. What is Kafka? The full scope of the information about Kafka you may find here, but in the nutshell, it's distributed fault tolerant message system. It allows you to connect many systems in an organized fashion. Instead, connect each system peer to peer: you may land all your messages company wide on one system and consume it from there, like this: Kafka is kind of Data Hub system, where you land the messages and serve it after. More technical details. I'd like to introduce a few key Kafka's terms. 1) Kafka Broker. This is Kafka service, which you run on each server and which operates all read and write request 2) Kafka Producer. The process which writes data in Kafka 3) Kafka Consumer. The process, which reads data from Kafka. 4) Message. The name describes itself, I just want to add that messages have key and value. In comparison to NoSQL databases key Kafka's key is not indexed. It has application purposes (you may put some application logic in Key) and administrative purposes (each message with the same key goes to the same partition). 5) Topic. Set of the messages organized into topics. Database guys would compare it with a table. 6) Partition. It's a good practice to divide the topic into partitions for performance and maintenance purposes. Messages within the same key go to the same partition. If a key is absent, messages are distributing in round - robin fashion. 7) Offset. The offset is the position of each message in the topic. The offset is indexed and it allows you quickly access your particular message. When do you delete data? One of the basic Kafka concepts is that of retention - Kafka does not keep data forever, nor does it wait for all consumers to read a message before deleting a message. Instead, the Kafka administrator configures a retention period for each topic - either amount of time for which to store messages before deleting them or how much data to store older messages are purged. This two parameters control this: log.retention.ms and log.retention.bytes. The amount of data to retain in the log for each topic-partition. This is the limit per partition: multiply by the number of partitions to get the total data retained for the topic.  How to query Kafka data with Big Data SQL? for query the Kafka data you need to create hive table first. let me show an ent-to-end example. I do have a JSON file: $ cat web_clicks.json { click_date: "38041", click_time: "67786", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "396439", web_page: "646"} { click_date: "38041", click_time: "41831", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "90714", web_page: "804"} { click_date: "38041", click_time: "60334", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "afternoon", item_sk: "151944", web_page: "867"} { click_date: "38041", click_time: "53225", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "175796", web_page: "563"} { click_date: "38041", click_time: "47515", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "186943", web_page: "777"} { click_date: "38041", click_time: "73633", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "118004", web_page: "647"} { click_date: "38041", click_time: "43133", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "148210", web_page: "930"} { click_date: "38041", click_time: "80675", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "380306", web_page: "484"} { click_date: "38041", click_time: "21847", date: "2004-02-26", am_pm: "AM", shift: "third", sub_shift: "morning", item_sk: "55425", web_page: "95"} { click_date: "38041", click_time: "35131", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "185071", web_page: "118"} and I'm going to load it into Kafka with standard Kafka tool "kafka-console-producer": $ cat web_clicks.json|kafka-console-producer --broker-list bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092 --topic json_clickstream for a check that messages have appeared in the topic you may use the following command: $ kafka-console-consumer --zookeeper bds1:2181,bds2:2181,bds3:2181 --topic json_clickstream --from-beginning after I've loaded this file into Kafka topic, I create a table in Hive. Make sure that you have oracle-kafka.jar and kafka-clients*.jar in your hive.aux.jars.path: and here: after this you may run follow DDL in the hive: hive> CREATE EXTERNAL TABLE json_web_clicks_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092', 'oracle.kafka.table.topics'='json_clickstream' ); hive> describe json_web_clicks_kafka; hive> select * from json_web_clicks_kafka limit 1; and as soon as hive table been created I create ORACLE_HIVE table in Oracle: SQL> CREATE TABLE json_web_clicks_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=CLUSTER com.oracle.bigdata.tablename=default.json_web_clicks_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; here you also have to keep in mind that you need to add oracle -kafka.jar and  kafka -clients*.jar in your bigdata.properties file on the database and on the Hadoop side. I have dedicated the blog about how to do this here. Now we are ready to query: SQL> SELECT * FROM json_web_clicks_kafka WHERE ROWNUM<3; json_clickstream 209 { click_date: "38041", click_time: "43213"..."} 0 26-JUL-17 05.55.51.762000 PM 1 json_clickstream 209 { click_date: "38041", click_time: "74669"... } 1 26-JUL-17 05.55.51.762000 PM 1 Oracle 12c provides powerful capabilities for working with JSON, such as dot API. It allows us to easily query the JSON data as a structure:  SELECT t.value.click_date, t.value.click_time FROM json_web_clicks_kafka t WHERE ROWNUM < 3; 38041 40629 38041 48699 Working with AVRO messages. In many cases, customers are using AVRO as flexible self-described format and for exchanging messages through the Kafka. For sure we do support it and doing this in very easy and flexible way. I do have a topic, which contains AVRO messages and I define Hive table over it: CREATE EXTERNAL TABLE web_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='avro', 'oracle.kafka.table.value.schema'='{"type":"record","name":"avro_table","namespace":"default","fields": [{"name":"ws_sold_date_sk","type":["null","long"],"default":null}, {"name":"ws_sold_time_sk","type":["null","long"],"default":null}, {"name":"ws_ship_date_sk","type":["null","long"],"default":null}, {"name":"ws_item_sk","type":["null","long"],"default":null}, {"name":"ws_bill_customer_sk","type":["null","long"],"default":null}, {"name":"ws_bill_cdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_hdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_addr_sk","type":["null","long"],"default":null}, {"name":"ws_ship_customer_sk","type":["null","long"],"default":null} ]}', 'oracle.kafka.bootstrap.servers'='bds2:9092', 'oracle.kafka.table.topics'='web_sales_avro' ); describe web_sales_kafka; select * from web_sales_kafka limit 1; Here I define 'oracle.kafka.table.value.type'='avro' and also I have to specify 'oracle.kafka.table.value.schema'. After this we have structure. In a similar way I define a table in Oracle RDBMS: SQL> CREATE TABLE WEB_SALES_KAFKA_AVRO ( "WS_SOLD_DATE_SK" NUMBER, "WS_SOLD_TIME_SK" NUMBER, "WS_SHIP_DATE_SK" NUMBER, "WS_ITEM_SK" NUMBER, "WS_BILL_CUSTOMER_SK" NUMBER, "WS_BILL_CDEMO_SK" NUMBER, "WS_BILL_HDEMO_SK" NUMBER, "WS_BILL_ADDR_SK" NUMBER, "WS_SHIP_CUSTOMER_SK" NUMBER topic varchar2(50), partitionid integer, KEY NUMBER, offset integer, timestamp timestamp, timestamptype INTEGER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.tablename: web_sales_kafka ) ) REJECT LIMIT UNLIMITED ; And we good to query the data! Performance considerations. 1) Number of Partitions. This is the most important thing to keep in mind there is a nice article about how to choose a right number of partitions. For Big Data SQL purposes I'd recommend using a number of partitions a bit more than you have CPU cores on your Big Data SQL cluster. 2) Query fewer columns Use column pruning feature. In other words list only necessary columns in your SELECT and WHERE statements. Here is the example. I've created void PL/SQL function, which does nothing. But PL/SQL couldn't be offloaded to the cell side and we will move all the data towards the database side: SQL> create or replace function fnull(input number) return number is Result number; begin Result:=input; return(Result); end fnull; after this I ran the query, which requires one column and checked how much data have been returned to the DB side: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  5741.81 MB after this I repeat the same test case with 10 columns: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)), MIN(fnull(WS_SOLD_TIME_SK)), MIN(fnull(WS_SHIP_DATE_SK)), MIN(fnull(WS_ITEM_SK)), MIN(fnull(WS_BILL_CUSTOMER_SK)), MIN(fnull(WS_BILL_CDEMO_SK)), MIN(fnull(WS_BILL_HDEMO_SK)), MIN(fnull(WS_BILL_ADDR_SK)), MIN(fnull(WS_SHIP_CUSTOMER_SK)), MIN(fnull(WS_SHIP_CDEMO_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  32193.98 MB so, hopefully, this test case clearly shows that you have to use only useful columns 3) Indexes There is no Indexes rather than Offset columns. The fact that you have key column doesn't have to mislead you - it's not indexed. The only offset allows you have a quick random access 4) Warm up your data If you want to read data faster many times, you have to warm it up, by running "select *" type of the queries. Kafka relies on Linux filesystem cache, so for reading the same dataset faster many times, you have to read it the first time. Here is the example - I clean up the Linux filesystem cache dcli -C "sync; echo 3 > /proc/sys/vm/drop_caches" - I tun the first query: SELECT COUNT(1) FROM WEB_RETURNs_JSON_KAFKA t it took 278 seconds. - Second and third time took 92 seconds only. 5) Use bigger Replication Factor Use bigger replication factor. Here is the example. I do have two tables one is created over the Kafka topic with Replication Factor  = 1, second is created over Kafka topic with ith Replication Factor  = 3. SELECT COUNT(1) FROM JSON_KAFKA_RF1 t this query took 278 seconds for the first run and 92 seconds for the next runs SELECT COUNT(1) FROM JSON_KAFKA_RF3 t This query took 279 seconds for the first run, but 34 seconds for the next runs. 6) Compression considerations Kafka supports different type of compressions. If you store the data in JSON or XML format compression rate could be significant. Here is the examples of the numbers, that could be: Data format and compression type Size of the data, GB JSON on HDFS, uncompressed 273.1 JSON in Kafka, uncompressed 286.191 JSON in Kafka, Snappy 180.706 JSON in Kafka, GZIP 52.2649 AVRO in Kafka, uncompressed 252.975 AVRO in Kafka, Snappy 158.117 AVRO in Kafka, GZIP 54.49 This feature may save some space on the disks, but taking into account, that Kafka primarily used for the temporal store (like one week or one month), I'm not sure that it makes any sense. Also, you will pay some performance penalty, querying this data (and burn more CPU).  I've run a query like: SQL> select count(1) from ... and had followed results: Type of compression Elapsed time, sec uncompressed 76 snappy 80 gzip 92 so, uncompressed is the leader. Gzip and Snappy slower (not significantly, but slow). taking into account this as well as fact, that Kafka is a temporal store, I wouldn't recommend using compression without any exeptional need.  7) Use parallelize your processing. If for some reasons you are using a small number of partitions, you could use Hive metadata parameter "oracle.kafka.partition.chunk.size" for increase parallelism. This parameter defines a size of the input Split. So, if you set up this parameter equal 1MB and your topic has 4MB total, you will proceed it with 4 parallel threads. Here is the test case: - Drop Kafka topic $ kafka-topics --delete --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales - Create again with only one partition $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic store_sales - Check it $ kafka-topics --describe --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales ... Topic:store_sales PartitionCount:1 ReplicationFactor:3 Configs: Topic: store_sales Partition: 0 Leader: 79 Replicas: 79,76,77 Isr: 79,76,77 ... - Check the size of input file: $ du -h store_sales.dat 19G store_sales.dat - Load data to the Kafka topic $ cat store_sales.dat|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic store_sales --request-timeout-ms 30000 --batch-size 1000000 - Create Hive External table hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Create Oracle external table SQL> CREATE TABLE STORE_SALES_KAFKA ( TOPIC VARCHAR2(50), PARTITIONID NUMBER, VALUE VARCHAR2(4000), OFFSET NUMBER, TIMESTAMP TIMESTAMP, TIMESTAMPTYPE NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.store_sales_kafka ) ) REJECT LIMIT UNLIMITED PARALLEL ; - Run test query SQL> SELECT COUNT(1) FROM store_sales_kafka; it took 142 seconds - Re-create Hive external table with 'oracle.kafka.partition.chunk.size' parameter equal 1MB hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.chop.partition'='true', 'oracle.kafka.partition.chunk.size'='1048576', 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Run query again: SQL> SELECT COUNT(1) FROM store_sales_kafka; Now it took only 7 seconds One MB split is quite low, and for big topics we recommend to use 256MB. 8) Querying small topics. Sometimes it happens that you need to query really small topics (few hundreds of messages, for example), but very frequently. At this case, it makes sense to create a topic with fewer paritions. Here is the test case example: - Create topic with 1000 partitions $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1000 --topic small_topic - Load only one message there $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic - Create hive external table hive> CREATE EXTERNAL TABLE small_topic_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='small_topic' ); - Create Oracle external table SQL> CREATE TABLE small_topic_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.small_topic_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; - Query all rows from it SQL> SELECT * FROM small_topic_kafka it took 6 seconds - Create topic with only one partition and put only one message there and run same SQL query over it $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic small_topic   $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic SQL> SELECT * FROM small_topic_kafka now it takes only 0.5 second 9) Type of data in Kafka messages. You have few options for storing data in Kafka messages and for sure you want to do pushdown processing. Big Data SQL supports pushdown operations only for JSONs. This means that everything that you could expose thought the JSON will be pushed down to the cell side and will be prosessed there. Example - The query which could be pushed down to the cell side (JSON): SQL> SELECT COUNT(1) FROM WEB_RETURN_JSON_KAFKA t WHERE t.VALUE.after.WR_ORDER_NUMBER=233183247; - The query which could not be pushed down to the cell side (XML): SQL> SELECT COUNT(1) FROM WEB_RETURNS_XML_KAFKA t WHERE XMLTYPE(t.value).EXTRACT('/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') .getNumberVal() = 233183247; If amounts of data is not significant, you could use Big Data SQL for processing. If we are talking about big data volumes, you could process it once and convert into different file formats on HDFS, with Hive query: hive> select xpath_int(value,'/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') from WEB_RETURNS_XML_KAFKA limit 1 ; 10) JSON vs AVRO format in the Kafka topics In continuing to the previous point, you may be wondering which semi-structured format use? The answer is easy - use what your data source produce there is no significant performance difference between Avro and JSON. For example, a query like: SQL> SELECT COUNT(1) FROM WEB_RETURNS_avro_kafka t WHERE t.WR_ORDER_NUMBER=233183247; Will be done in 112 seconds in case of JSON and in 105 seconds in case of Avro. and JSON topic will take 286.33 GB and Avro will take 202.568 GB. There is some difference, but not worth for converting the original format. How to bring data from OLTP databases in Kafka? Use Golden Gate! Oracle Golden Gate is the well-known product for capturing commit logs on the database side and bring the changes into a target system. The good news that Kafka may play a role in the target system. I'd like to skip the detailed explanation of this feature, because it's already explained in very deep details here. Known Issue. Running Kafka broker on wildcard By default, Kafka doesn't use wildcard address (0.0.0.0) for brokers and pick some IP address. it maybe a problem in case of multi-network Kafka cluster. One network could be used for interconnect, second for external connection. Luckily, there is easy way to solve this issue and start Kafka Broker on Wildcard address. 1) go to: Kafka > Instances (Select Instance)  > Configuration > Kafka Broker  > Advanced > Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties 2) and add: listeners=PLAINTEXT://0.0.0.0:9092 advertised.listeners=PLAINTEXT://server.example.com:9092

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in...

Oracle Big Data SQL 3.2 is Now Available

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new data sources, enhanced security and improved performance. Big Data SQL has expanded its data source support to now include querying data streams – specifically Kafka topics: This enables streaming data to be joined with dimensions and facts in Oracle Database or HDFS.  It’s never been easier to combine data from streams, Hadoop and Oracle Database. New security capabilities enable Big Data SQL to automatically leverage underlying authorization rules on source data (i.e. ACLs on HDFS data) and then augment that with Oracle’s advanced security policies.  In addition, to prevent impersonation, Oracle Database servers now authenticate against Big Data SQL Server cells. Finally, secure Big Data SQL installations have become much easier to set up; Kerberos ticket renewals are now automatically configured. There has been significant performance improvements as well.  Oracle now provides its own optimized Parquet driver which delivers a significant performance boost – both in terms of speed and the ability to query many columns.  Support for CLOBs is also now available – which facilitates efficient processing of large JSON and XML data documents. Finally, there has been significant enhancements to the out-of-box experience.  The installation process has been simplified, streamlined and made much more robust.

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new...

Big Data

Roadmap Update for Big Data Appliance Releases (4.11 and beyond)

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To understand what we do before we ship the latest CDH on BDA and why we think we should spend that time, review this post. That all said, we have decide to rejigger the releases and do the following: Focus BDA 4.11 solely on up taking the latest CDH 5.13.1 and related OS and Java updates, thus catching up in timelines to the CDH releases Move all features that were planned for 4.11 to the next release, which will then be on track to uptake CDH 5.14 on our regular schedule So what does this mean in terms of release timeframes, and what does it mean for what we talked about at Openworld for BDA (shown as a image below, review full slide deck incl. our cloud updates at the Openworld site)? BDA version 4.11.0 will have the following updates: Uptake of CDH 5.13.1 - as usual, because we will be very close to the first update to 5.13, we will include that and time our BDA release as close to that as possible. This would get us to BDA 4.11.0 around mid December, assuming the CDH update retains it dates Update the latest OS versions, Kernel etc. to update to state of the art on Oracle Linux 6, and include all security patches Update MySQL, Java and again ensure all security patches are included BDA version 4.12.0 will have the following updates: Uptake of CDH 5.14.x - we are still evaluating the dates and timing for this CDH release and whether or not we go with the .0 or .1 version here. Goal here is to deliver this release 4 weeks or so after CDH drops. Expect early calendar 2018, with more precise updates coming to this forum as we know more. Include roadmap features as follows: Dedicated Kafka cluster on BDA nodes Full cluster on OL7 (aligning with the OL7 edge nodes) Big Data Manager available on BDA Non-stop Hadoop, and proceeding on making more and more components HA out of the box Fully managed BDA edge nodes The usual OS, Java and MySQL updates per the normal release cadence Updates to related components like Big Data Connectors etc. All of this means that we pulled in the 4.11.0 version to the mid December time frame, while we pushed out the 4.12.0 version by no more then maybe a week or so... So, this looks like a win-win on all fronts. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To...

Big Data

Review of Big Data Warehousing at OpenWorld 2017 - Now Available

Did you miss OpenWorld 2017? Then my latest book is definitely something you will want to download! If you went to OpenWorld this book is also for you because it covers all the most important big data warehousing messages and sessions during the five days of OpenWorld. Following on from OpenWorld 2017 I have put together a comprehensive review of all the big data warehousing content from OpenWorld 2017. This includes all the key sessions and announcements from this year's Oracle OpenWorld conference. This review guide contains the following information: Chapter 1 Welcome - an overview of the contents.   Chapter 2 Let’s Go Autonomous - containing all you need to know about Oracle’s new, fully-managed Autonomous Data Warehouse Cloud. This was the biggest announcement at OpenWorld so this chapter contains videos, presentations and podcasts to get you up to speed on this completely new data warehouse cloud service. Chapter 3 Keynotes - Relive OpenWorld 2017 by watching the most important highlights from this year’s OpenWorld conference with our on demand video service which covers all the major keynote sessions. Chapter 4 Key Presenters - a list of the most important speakers by product area such as database, cloud, analytics, developer and big data. Each biography includes all relevant social media sites and pages. Chapter 5 Key Sessions - a list of all the most important sessions with links to download the related presentations organized Chapter 6 Staying Connected - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data.  This covers all our websites, blogs and social media pages. This review is available in three formats: 1) For highly evolved users, i.e. Apple users, who understand the power of Apple’s iBook format, your multi-media enabled iBook version is available here. 2) For Windows users who are forced to endure a 19th-Century style technological experience, your PDF version is available here. 3) For Linux users, Oracle DBAs and other IT dinosaurs, all of whom are allergic to all graphical user interfaces, the basic version of this comprehensive review is available here. I hope you enjoy this review and look forward to seeing you next year at OpenWorld 2018, October 28 to November 1. If you’d like to be notified when registration opens for next year’s Oracle OpenWorld then register your email address here.  

Did you miss OpenWorld 2017? Then my latest book is definitely something you will want to download! If you went to OpenWorld this book is also for you because it covers all the most important big data...

New Release: BDA 4.10 is now Generally Available

As of today, BDA version 4.10 is Generally Available. As always, please refer to  If You Struggle With Keeping your BDAs up to date, Then Read This to learn about the innovative release process we do for BDA software. This new release includes a number of features and updates: Support for Migration From Oracle Linux 5 to Oracle Linux 6 - Clusters on Oracle Linux 5 must first be upgraded to v4.10.0 on Oracle Linux 5 and can then be migrated to Oracle Linux 6. This process must be done one server at a time. HDFS data and Cloudera Manager roles are retained. Please review the documentation for the entire process carefully before starting. BDA v4.10 is the last release built for Oracle Linux 5 and no further upgrades for Oracle Linux 5 will be released. Updates to NoSQL DB, Big Data Connectors, Big Data Spatial & Graph Oracle NoSQL Database 4.5.12 Oracle Big Data Connectors 4.10.0 Oracle Big Data Spatial & Graph 2.4.0 Support for Oracle Big Data Appliance X-7 systems - Oracle Big Data Appliance X7 is based on the X7–2L server. The major enhancements in Big Data Appliance X7–2 hardware are: CPU update: 2 24–core Intel Xeon processor Updated disk drives: 12 10TB 7,200 RPM SAS drives 2 M.2 150GB SATA SSD drives (replacing the internal USB drive) Vail Disk Controller (HBA) Cisco 93108TC-EX–1G Ethernet switch (replacing the Catalyst 4948E). Spark 2 Deployed by Default - Spark 2 is now deployed by default on new clusters and also during upgrade of clusters where it is not already installed. Oracle Linux 7 can be Installed on Edge Nodes - Oracle Linux 7 is now supported for installation on Oracle Big Data Appliance edge nodes running on X7–2L, X6–2L or X5–2L servers. Support for Oracle Linux 7 in this release is limited to edge nodes. Support for Cloudera Data Science Workbench - Support for Oracle Linux 7 on edge nodes provides a way for customers to host Cloudera Data Science Workbench (CDSW) on Oracle Big Data Appliance. CDSW is a web application that enables access from a browser to R, Python, and Scala on a secured cluster. Oracle Big Data Appliance does not include licensing or official support for CDSW. Contact Cloudera for licensing requirements.   Scripts for Download & Configuration of Apache Zeppelin, Jupyter Notebook, and RStudio -  This release includes scripts to assist in download and configuration of these commonly used tools. The scripts are provided as a convenience to users. Oracle Big Data Appliance does not include official support for the installation and use of Apache Zeppelin, Jupyter Notebook, or RStudio.   Improved Configuration of Oracle's R Distribution and ORAAH -  For these tools, much of the environment configuration that was previous done by the customer is now automated. Node Migration Optimization - Node migration time has been improved by eliminating some steps. Support for Extending Secure NoSQL DB clusters This release is based on Cloudera Enterprise (CDH 5.12.1 & Cloudera Manager 5.12.1) as well as Oracle NoSQL Database (4.5.12). Cloudera 5 Enterprise includes CDH (Core Hadoop), Cloudera Manager, Apache Spark, Apache HBase, Impala, Cloudera Search and Cloudera Navigator The BDA continues to support all security options for CDH Hadoop clusters : Kerberos authentication - MIT or Microsoft Active Directory, Sentry Authorization, HTTPS/Network encryption, Transparent HDFS Disk Encryption, Secure Configuration for Impala, HBase , Cloudera Search and all Hadoop services configured out-of-the-box. Parcels for Kafka 2.2, Spark 2.2, Kudu 1.4 and Key Trustee Server 5.12 are included in the BDA Software Bundle  

As of today, BDA version 4.10 is Generally Available. As always, please refer to  If You Struggle With Keeping your BDAs up to date, Then Read This to learn about the innovative release process we...

Announcing: Big Data Appliance X7-2 - More Power, More Capacity

Big Data Appliance X7-2 is the 6th hardware generation of Oracle's leading Big Data platform continuing the platform evolution from Hadoop workloads to Big Data, SQL, Analytics and Machine Learning workloads. Big Data Appliance combines dense IO with dense Compute in a single server form factor. The single form factor enables our customers to build a single data lake, rather then replicating data across more specialized lakes.  What is New? The current X7-2 generation is based on the latest Oracle Sun X7-2L servers, and leverages that infrastructure to deliver enterprise class hardware for big data workloads. The latest generation sports more cores, more disk space and the same level of memory per server. Big Data Appliance retains its InfiniBand internal network, support by a multi-homed Cloudera CDH cluster set up. The details can be found in the updated data sheet. Why a Single Form Factor? Many customers are embarking on a data unification effort, and the main data management concept used in that effort is the data lake. Within this data lake, we see and recommend a set of workloads to be run as is shown in this logical architecture:   In essence what we are saying is that the data lake will host the Innovation or Discovery Lab workloads as well as the Execution or production workloads on the same systems. This means that we need an infrastructure to both deal with large data volumes in a cost effective manner and deal with high compute volumes on a regular basis. Leveraging the hardware footprint in BDA, enables us to run both these workloads. The servers come with 2 * 24 cores AND 12 * 10TB drives enabling very large volumes of data and CPUs spread across a number of workloads. So rather then dealing with various form factors, and copying data from the main data lake to a side show Discovery Lab, BDA X7-2 consolidates these workloads. The other increasingly important data set in the data lake is streaming into the organization, typically via Apache Kafka. Both the CPU counts and the memory footprints can provide a great Kafka cluster, connecting it over InfiniBand to the main HDFS data stores. Again, while these nodes are very IO dense for Kafka, the simplicity of using the same nodes for any of the workloads makes Big Data Appliance a great Big Data platform choice. What is in the Box? Apart from the hardware specs, the software that is included in Big Data Appliance enables the data lake creation in a single software & hardware combination. Big Data Appliance comes with the full Cloudera stack, enabling the data lake as drawn above, with Kafka, HDFS, Spark all included in the cost of the system. The specific licensing for Big Data Appliance makes the implementation cost effective, and added to the simplicity of a single form factor makes Big Data Appliance an ideal platform to implement and grow the data lake into a successful venture.    

Big Data Appliance X7-2 is the 6th hardware generation of Oracle's leading Big Data platform continuing the platform evolution from Hadoop workloads to Big Data, SQL, Analytics and Machine Learning...

Data Warehousing

OpenWorld 2017 - Must-See Sessions for Day 1

It all starts today -  OpenWorld 2017. Each day I will provide you with a list of must-see sessions and hands-on labs. This is going to be one of the most exciting OpenWorlds ever! Today is Day 1 so here here is my definitive list of Must-See sessions for the opening day. The list is packed full of really excellent speakers such as Franck Pachot, Ami Aharonovich, Galo Balda and Rich Niemiec. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Of course you need to end your first day in Moscone North Hall D for Larry Ellison's welcome keynote - it's going to be a  great one!   SUNDAY'S MUST-SEE GUIDE Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year. Don't forget to make use of our Big DW #oow17 smartphone app which you can access by pointing your phone at this QR code:  

It all starts today -  OpenWorld 2017. Each day I will provide you with a list of must-see sessions and hands-on labs. This is going to be one of the most exciting OpenWorlds ever!Today is Day 1 so...

Big Data

UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

** NEW ** Chapter 5   *** UPDATED *** Must-See Guide now available as PDF and via Apple iBooks Store This updated version now contains details of all the most important hands-on labs AND a day-by-day calendar. This means that our comprehensive guide now covers absolutely everything you need to know about this year’s Oracle OpenWorld conference. Now, when you arrive at Moscone Conference Center you are ready to get the absolute most out of this amazing conference. The updated, and still completely free, big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store - click here, and in PDF format - click here. Just so you know…this guide contains the following information: Page 8 - On-Demand Videos Page 11 - Justify Your trip Page 18 - Key Presenters Page 39 - Must See Sessions Page 83 - Must See Day-by-Day Page 150 - Useful Maps Chapter 1  - Introduction to the must-see guide.  Chapter 2  - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company.  Chapter 3 - Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies.  Chapter 4  - List of the “must-see” sessions and hands-on labs at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017.  Chapter 5  - Day-by-Day “must-see” guide. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017.  Chapter 6  - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages.  Chapter 7   Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2017.  Chapter 8  - Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps.  Let me know if you have any comments. Enjoy and see you in San Francisco.

** NEW ** Chapter 5   *** UPDATED *** Must-See Guide now available as PDF and via Apple iBooks Store This updated version now contains details of all the most important hands-on labs AND a day-by-day...

Secure your Hadoop Cluster

Security is a very important aspect of many projects and you must not underestimate it, Hadoop security is very complex and consist of many components, it's better to enable one by one security features. Before starting the explanation of different security options, I'll share some materials that will help you to get familiar with the foundation of algorithms and technologies that underpin many security features in Hadoop. Before you begin First of all, I recommend that you watch this excellent video series, which explains how asymmetric key works and how RSA algorithm works (this is the basis for SSL/TLS).  Then, read this this blog about TLS/SSL principals. Also, if you mix up terms such as Active Directory, LDAP, OpenLDAP and so, it will be useful to check this page. After you get familiar with the concepts, you can concentrate on the implementation scenarios with Hadoop. Security building blocks There are few building blocks of secure system: - Authentication. Answers on questions who you are. It's like a passport validation. For example, if someone says that he is Bill Smith, he should prove this (pass Authentication) with certain document (like a passport) - Authorization. After passing Authentication, user could be trusted (we checked his passport and make sure that he is Bill Smith) and next question to be asked - what this user allowed to do on the cluster (usually cluster shared between multiple users and multiple groups and not each dataset should be available for everyone)? This is Authorization - Encryption in motion. Hadoop is distributed system and definitely there is data movement (over the network) happens. This traffic could be intercepted. To prevent this we have to encrypt it. This called encryption in motion - Encryption at REST. Data stored on hard disks and and some privileged user (like root) may have access to this disks and read any directories, including those that store Hadoop data. Also, disks could be physically stolen and mounted on any machine for future access. To protect you data from such kind of vulnerability, you have to encrypt data. This called encryption at REST - Audit. Sometimes data breach is happens and only one what you can do it's define the channel of this breach. For this you have to use Audit tools Step1. Authentication in Hadoop. Motivation By design, Hadoop doesn't have any security. So, if you spin up a cluster - by default it's insecure. It assumes a level of trust; and assumes that only trusted users have access to the cluster. In HDFS, files and folders have permissions - similar to Linux - and users access files based on access control lists - or ACLs.  See the diagram below that highlights different access paths into the cluster:  shell/CLI, JDBC, and tools like Hue.   As you can see below, each file/folder has access privileges assigned.  The oracle user is the owner of the items - and other users can access the items based on the access control definition (e.g. -rw-r--r-- means that Oracle can read/write to the file, users in the oinstall group can read the file, and the rest of the world can also read the file). $ hadoop fs -ls Found 57 items drwxr-xr-x   - oracle hadoop          0 2017-05-25 15:20 binTest drwxr-xr-x   - oracle hadoop          0 2017-05-30 14:04 clustJobJune drwxr-xr-x   - oracle hadoop          0 2017-05-26 11:47 exercise0 drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:07 hierarchyIndex drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:15 hierarchyIndexWithCities drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:46 hive_examples ... In this vary "relaxed" security level - it is very easy to subvert these ACLs.  Because there is no authentication, a user can impersonate someone else; the identity is determined by the current login identity on the client machine.  So, as shown below, a malicious user can define the "hdfs" user (a power user in Hadoop) on their local machine - access the cluster - and then delete import financial and healthcare data.  Additionally, accessing data thru tools like Hive are also completely open. The user that is passed as part of the JDBC connection will be used for data authorization.  Note, you can specify any(!) user that you want - there is no authentication!  So, all data in Hive is open for query. If you care about the data, then this is a real problem. Step1. Authentication in Hadoop. Edge node Instead of enabling connectivity from any client, a Edge node (you may think of it like client node) created that users log into it and has access to the cluster. Access to a Hadoop cluster is prohibited from other servers rather than this Edge node. Edge node is used for: - Run jobs and interact with the Hadoop Cluster - Run all gateway services of Hadoop components - User identity established on edge node (trusted Authentication happens during log in into this node) Because a user logged into this edge node and did not have the ability to alter his or her identity, the identity can be trusted.  This means that HDFS ACLs now have some meaning  - User identity established on edge node  - Connect only thru known access paths and hosts   Note: in HDFS there is feature extended ACL, which allow to have extended security lists, so you could grant permissions outside of the group. $ hadoop fs -mkdir /user/oracle/test_dir $ hdfs dfs -getfacl /user/oracle/test_dir # file: /user/oracle/test_dir # owner: oracle # group: hadoop user::rwx group::r-x other::r-x   $ hdfs dfs -setfacl -m user:ben:rw- /user/oracle/test_dir $ hdfs dfs -getfacl /user/oracle/test_dir # file: /user/oracle/test_dir # owner: oracle # group: hadoop user::rwx user:ben:rw- group::r-x mask::rwx other::r-x Challenge: JDBC is still insecure  - User identified in JDBC connect string not authenticated   Here is the example how I can use beeline tool from cli for work on behalf of "superuser" who may do whatever he wants on the cluster: $ beeline ... beeline> !connect jdbc:hive2://localhost:10000/default; ... Enter username for jdbc:hive2://localhost:10000/default;: superuser Enter password for jdbc:hive2://localhost:10000/default;: * ... 0: jdbc:hive2://localhost:10000/default> select current_user(); ... +------------+--+ |    _c0     | +------------+--+ | superuser  | +------------+--+ To ensure that identities are trusted, we need to introduce a capability that you probably use all the time and didn't even know it:  Kerberos. Step1. Authentication in Hadoop. Kerberos Kerberos ensures that both users and services are authenticated. Kerberos is *the* authentication mechanism for Hadoop deployments: Before interacting with cluster user have to obtain Kerberos ticket (think of it like a passport). The two most common ways to use Kerberos with Oracle Big Data Appliance is using it with: a) Active Directory Kerberos b) MIT Kerberos.   On the Oracle support site you will find the step by step instruction for enabling both of this configuration. Check for “BDA V4.2 and Higher Active Directory Kerberos Install and Upgrade Frequently Asked Questions (FAQ) (Doc ID 2013585.1)”  for the Active Directory implementation and “Instructions to Enable Kerberos on Oracle Big Data Appliance with Mammoth V3.1/V4.* Release (Doc ID 1919445.1)” for MIT Kerberos. Oracle recommends using MIT local Kerberos for system services like HDFS, YARN and AD Kerberos for the human users (like user John Smith). Also, if you want to set up trusted relationships between them (to make possible for AD users work with Cluster), you have to follow support note “How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)”. Note: Big Data Appliance greatly simplifies the implementation of highly available Kerberos deployment on a Hadoop cluster.  You do not need to do the manual setup (and should not) of Kerberos settings. Use tools, which is provided by BDA. Using MOS 1919445.1 I've enabled MIT Kerberos on my BDA cluster. So, what has been changed in my daily life with Hadoop cluster? First of all, I'm trying to list files in HDFS: # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # hadoop fs -ls / ... Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]   ... Oops... seems something is missing. I can not access the data in HDFS - "No valid credentials provided".  In order to gain access to HDFS, I must first obtain a Kerberos ticket: # kinit oracle/scajvm1bda01.vm.oracle.com kinit: Client not found in Kerberos database while getting initial credentials Still not able to access HDFS!  That's because the user principal must be added to the Key Distribution Center - or KDC.  As the Kerberos admin, add the principal: # kadmin.local -q "addprinc oracle/scajvm1bda01.vm.oracle.com" Now, I can successfully obtain the Kerberos ticket: # kinit oracle/scajvm1bda01.vm.oracle.com Password for oracle@ORACLE.TEST: Practical Tip: Use keytab files Here we go! I'm ready to work with my Hadoop cluster. But I don't want to enter the password every single time when I obtaining the ticket (this is important for services as well). For this, I need to create keytab file. # kadmin.local kadmin.local: xst -norandkey -k oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com and I can obtain a new Kerberos ticket without the password: # kinit -kt oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: oracle/scajvm1bda01.vm.oracle.com@ORALCE.TEST Valid starting     Expires            Service principal 07/28/17 12:59:28  07/29/17 12:59:28  krbtgt/ORALCE.TEST@ORALCE.TEST         renew until 08/04/17 12:59:28 Now I can work with Hadoop cluster on behalf of Oracle user: # hadoop fs -ls /user/   Found 7 items drwx------   - hdfs   supergroup          0 2017-07-27 18:21 /user/hdfs drwxrwxrwx   - mapred hadoop              0 2017-07-27 00:32 /user/history drwxr-xr-x   - hive   supergroup          0 2017-07-27 00:33 /user/hive drwxrwxr-x   - hue    hue                 0 2017-07-27 00:32 /user/hue drwxr-xr-x   - oozie  hadoop              0 2017-07-27 00:34 /user/oozie drwxr-xr-x   - oracle hadoop              0 2017-07-27 18:57 /user/oracle drwxr-x--x   - spark  spark               0 2017-07-27 00:34 /user/spark # hadoop fs -ls /user/oracle Found 4 items drwx------   - oracle hadoop          0 2017-07-27 19:00 /user/oracle/.Trash drwxr-xr-x   - oracle hadoop          0 2017-07-27 18:54 /user/oracle/.sparkStaging drwx------   - oracle hadoop          0 2017-07-27 18:57 /user/oracle/.staging drwxr-xr-x   - oracle hadoop          0 2017-07-27 18:57 /user/oracle/oozie-oozi # hadoop fs -ls /user/spark ls: Permission denied: user=oracle, access=READ_EXECUTE, inode="/user/spark":spark:spark:drwxr-x--x WARNING: please keep in mind that you have to keep it in a safe directory and set permissions carefully on it (because anyone who can read it can impersonate them). It's interesting to note, that if you work on Hadoop servers, you already have many keytab files and for example, if you want to get an HDFS tickets you may easily do this. For getting the list of the principals for certain keytab file just run: # klist -ket `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` and for obtaining the ticket, run: # kinit -kt `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` hdfs/`hostname` Practical Tip: Debug kinit To debug kinit, you should export Linux environment variable KRB5_TRACE=/dev/stdout. Here is an example: $ kinit -kt /opt/kafka/security/testuser.keytab testuser@BDACLOUDSERVICE.ORACLE.COM $ export KRB5_TRACE=/dev/stdout $ kinit -kt /opt/kafka/security/testuser.keytab testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.290407: Getting initial credentials for testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.292692: Looked up etypes in keytab: aes256-cts, aes128-cts, des3-cbc-sha1, rc4-hmac, des-hmac-sha1, des, des-cbc-crc [88092] 1529001733.292734: Sending request (230 bytes) to BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.292863: Resolving hostname cfclbv3870.us2.oraclecloud.com [88092] 1529001733.293130: Sending initial UDP request to dgram 10.196.64.44:88 [88092] 1529001733.293587: Received answer from dgram 10.196.64.44:88 [88092] 1529001733.293663: Response was not from master KDC [88092] 1529001733.293732: Processing preauth types: 19 [88092] 1529001733.293773: Selected etype info: etype aes256-cts, salt "(null)", params "" [88092] 1529001733.293802: Produced preauth for next request: (empty) [88092] 1529001733.293830: Salt derived from principal: BDACLOUDSERVICE.ORACLE.COMtestuser [88092] 1529001733.293857: Getting AS key, salt "BDACLOUDSERVICE.ORACLE.COMtestuser", params "" [88092] 1529001733.293958: Retrieving testuser@BDACLOUDSERVICE.ORACLE.COM from FILE:/opt/kafka/security/testuser.keytab (vno 0, enctype aes256-cts) with result: 0/Success [88092] 1529001733.294013: AS key obtained from gak_fct: aes256-cts/7606 [88092] 1529001733.294095: Decrypted AS reply; session key is: aes256-cts/D28B [88092] 1529001733.294214: FAST negotiation: available [88092] 1529001733.294264: Initializing FILE:/tmp/krb5cc_1001 with default princ testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.294408: Removing testuser@BDACLOUDSERVICE.ORACLE.COM -> krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM from FILE:/tmp/krb5cc_1001 [88092] 1529001733.294450: Storing testuser@BDACLOUDSERVICE.ORACLE.COM -> krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM in FILE:/tmp/krb5cc_1001 [88092] 1529001733.294534: Storing config in FILE:/tmp/krb5cc_1001 for krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM: fast_avail: yes [88092] 1529001733.294584: Removing testuser@BDACLOUDSERVICE.ORACLE.COM -> krb5_ccache_conf_data/fast_avail/krbtgt\/BDACLOUDSERVICE.ORACLE.COM\@BDACLOUDSERVICE.ORACLE.COM@X-CACHECONF: from FILE:/tmp/krb5cc_1001 [88092] 1529001733.294618: Storing testuser@BDACLOUDSERVICE.ORACLE.COM -> krb5_ccache_conf_data/fast_avail/krbtgt\/BDACLOUDSERVICE.ORACLE.COM\@BDACLOUDSERVICE.ORACLE.COM@X-CACHECONF: in FILE:/tm/krb5cc_1001 Practical Tip: Obtaining Kerberos Ticket without acsess to KDC In my expirience it could be a cases when client machine could not acsess KDC directy, but need to work with Kerberos protected resources. Here is workaround for it. First go to machine which has acsess to KDC and generate cache ticket: $ cp /etc/krb5.conf /tmp/TMP_TICKET_CACHE/krb5.conf $ export KRB5_CONFIG=/tmp/TMP_TICKET_CACHE/krb5.conf $ export KRB5CCNAME=DIR:/tmp/TMP_TICKET_CACHE/ $ kinit oracle Password for oracle@BDACLOUDSERVICE.ORACLE.COM:  Copy to client machine: afilanov:ssh afilanov$ scp -i id_rsa_new.dat opc@cfclbv3872.us2.oraclecloud.com:/tmp/TMP_TICKET_CACHE/* /tmp/ Enter passphrase for key 'id_rsa_new.dat':  krb5.conf                                                                                                                                                               100%  795    12.3KB/s   00:00     primary                                                                                                                                                                 100%   10     0.2KB/s   00:00     tkt0kVvY6                                                                                                                                                               100%  874    13.6KB/s   00:00     afilanov:ssh afilanov$ rename ticket cache file and check that current user has it: afilanov:ssh afilanov$ export KRB5_CONFIG=/tmp/krb5.conf afilanov:ssh afilanov$ export KRB5CCNAME=/tmp/tkt0kVvY6  afilanov:ssh afilanov$ klist  Credentials cache: FILE:/tmp/tkt0kVvY6         Principal: oracle@BDACLOUDSERVICE.ORACLE.COM     Issued                Expires               Principal Jun 18 09:33:10 2018  Jun 19 09:33:10 2018  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM afilanov:ssh afilanov$    Practical Tip: Access to WebUI of Kerberized cluster from Windows machine In case if you want (and certainly you do) access to WebUI of your cluster (Resource Manager, JobHistory) from your windows browser, you will need to obtain kerberos ticket on your Windows machine. Here you could find great step by step instructions. Step1. Authentication in Hadoop. Integration MIT Kerberos with Active Directory It is quite common that companies use the Active Directory server to manage users and groups and want to provide them access to the secure Hadoop cluster accordingly their roles and permissions. For example, I have my corporate login afilnov and I want to work with Hadoop cluster as afilanov. For doing this you have to build trusted relationships between Active Directory and MIT KDC on BDA. All details you could find in the MOS: "How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)", but here I'll show a quick example how it works. First, I will log in to the Active Directory server and configure the trusted relationships with my BDA KDC: C:\Users\Administrator> netdom trust ORALCE.TEST /Domain:BDA.COM /add /realm /passwordt:welcome1 C:\Users\Administrator> ksetup /SetEncTypeAttr ORALCE.TEST AES256-CTS-HMAC-SHA1-96 After this I'm going to create user in AD: I skipped all explanations here because you may find all details in MOS and here I just wanted to show that I create a user on AD (not on Hadoop) side. On the KDC side you have to create one more principal as well: # kadmin.local Authenticating as principal hdfs/admin@ORALCE.TEST with password. kadmin.local:  addprinc -e "aes256-cts:normal" krbtgt/ORALCE.TEST@BDA.COM # direction from MIT to AD After this we are ready to use our corporate login/password to work with Hadoop on behalf of this user: # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # hadoop fs -put file.txt /tmp/ # hadoop fs -ls /tmp/file.txt -rw-r--r--   3 afilanov supergroup          5 2017-08-08 18:00 /tmp/file.txt Step1. Authentication in Hadoop. SSSD integration Well, now we can obtain a Kerberos ticket and work with Hadoop as a certain user. It's important to note  that on the OS we can be logged in as any user (e.g. we could be login as root and work with Hadoop cluster as a user from AD), here an example: # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM   Valid starting     Expires            Service principal 09/04/17 17:29:24  09/05/17 03:29:16  krbtgt/BDA.COM@BDA.COM         renew until 09/11/17 17:29:24 # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023   For Hadoop, it's important to have users (and their groups) available thru the OS (user have to exist on each node in the cluster). Services like HDFS perform lookups at the OS level to determine what groups a user belongs to - and then uses that information to authorize access to files and folders.  But what if I want to have OS users from Active Directory? This is where SSSD steps in.  It is a PAM module that will forward user lookups to Active Directory.  This means that you don't need to replicate user/group information at the OS level; it simply leverages the information found in Active Directory.  See the Oracle Support site you for MOS notes about how to set it up (well written detailed instruction): Guidelines for Active Directory Organizational Unit Setup Required for BDA 4.9 and Higher SSSD Setup (Doc ID 2289768.1) and How to Set up an SSSD on BDA V4.9 and Higher (Doc ID 2298831.1) After you pass all steps listed there you may use AD user/password for login to the Linux servers of your Hadoop cluster. Here is the example: # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # ssh afilanov@$(hostname) afilanov@scajvm1bda01.vm.oracle.com's password: Last login: Fri Sep  1 20:16:37 2017 from scajvm1bda01.vm.oracle.com $ id uid=825201368(afilanov) gid=825200513(domain users) groups=825200513(domain users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023   Step2. Authorization in Hadoop. Sentry Another one powerful capability of Hadoop in the security field is role-based access for hive queries. In the Cloudera distribution, it is managed by Sentry. Kerberos is required for Sentry installation. As many things on Big Data Appliance installation of Sentry automated and could be done within one command: # bdacli enable sentry and follow the tools guide. More information you could find in MOS "How to Add or Remove Sentry on Oracle Big Data Appliance v4.2 or Higher with bdacli (Doc ID 2052733.1)".  After you enable Sentry, you can now create and enable Sentry policies. I want to mention, that in Sentry there is strict hierarchy; users always belongs to groups, groups have some roles, and roles have some privileges. And, you have to follow this hierarchy. You can't assign some privileges directly to the user or group.  I will show how to setup this policies with HUE. For this test case I'm going to create test data and load it in HDFS: # echo "1, Finance, Bob,100000">> emp.file # echo "2, Marketing, John,70000">> emp.file # hadoop fs -mkdir /tmp/emp_table/; hadoop fs -put emp.file /tmp/emp_table/ After creating the file, I log in into Hive and create an external table (for power users) and view with a restricted set of the columns(for limited users): hive> create external table emp(id int,division string,  name string, salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location "/tmp/emp_table"; hive> create view emp_limited as select division, name from emp; After I have data in my cluster, I'm going to create test users across all nodes in the cluster (or, if I'm using AD - create the users/groups there): # dcli -C "useradd limited_user" # dcli -C "groupadd limited_grp" # dcli -C "usermod -g limited_grp limited_user" # dcli -C "useradd power_user" # dcli -C "groupadd power_grp" # dcli -C "usermod -g power_grp power_user" Now, let's use user-friendly HUE graphical interface. First, go to the Security bookmark and find that we have two objects and don't have any security rules there: After this I clicked on the "Add policy" link and created policy for power user, which allows it to read the "emp" table: After this I did the same for limited_user, but allowed it to read only emp_limited view. Here is my roles with policies: Now I log in as "limited_user" and ask to show the list of the tables: Only emp_limited is available. Let's query it: Perfect. I don't have table "emp" in the list of my tables, but let's imagine that I do know the name and want to query it. My attempt failed, because of lack of priviliges.  This highlighted table level granularity, but Sentry allows you to restrict access to certain columns. I'm going to reconfig power_role and allow it to select all columns except "salary": After this I running test queries with and without salary column in where predicate: Here we go! If I list "salary" in the select statement my query fails because of lack of privileges. Practical Tip: Useful Sentry commands Alternatively, you may use beeline cli to view and manage sentry roles. Below, I'll show how to manage privileges with beeline cli. First, I obtain a Kerberos ticket for the hive user and login into hive cli ("hive" is the admin) after this drop role and create it again. Assign to it some privileges and link this role with some group. After this login as limited_user, who belongs to limited_grp and check the permissions: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST # beeline beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; ... // Check which roles sentry has 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> SHOW ROLES; ... +---------------+--+ |     role      | +---------------+--+ | limited_role  | | admin         | | power_role    | | admin_role    | +---------------+--+ 4 rows selected (0.146 seconds) // Check which role has current session jdbc:hive2://scajvm1bda04.vm.oracle.com:10> SHOW CURRENT ROLES; ... +-------------+--+ |    role     | +-------------+--+ | admin_role  | +-------------+--+ // drop role 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> drop role limited_role; ... // create this role again 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> create role limited_role; ... // give a grant to the newly created role 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> grant select on emp_limited to role limited_role; ... // link role and the group 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> grant role limited_role to group limited_grp; ... 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> !exit   go back to linux for getting ticket for another user: // obtain ticket for limited_user # kinit limited_user Password for limited_user@ORALCE.TEST: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: limited_user@ORALCE.TEST   // check groups for this user # hdfs groups limited_user limited_user : limited_grp # beeline ... beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; ... 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> show current roles; ... +---------------+--+ |     role      | +---------------+--+ | limited_role  | +---------------+--+   // Check my privilegies 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10>  show grant role limited_role; +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | database  |    table     | partition  | column  | principal_name  | principal_type  | privilege  | grant_option  |    grant_time     | grantor  | +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | default   | emp_limited  |            |         | limited_role    | ROLE            | select     | false         | 1502250173482000  | --       | +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+   // try to select table which I've not permied to query 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> select * from emp; Error: Error while compiling statement: FAILED: SemanticException No valid privileges  User limited_user does not have privileges for QUERY  The required privileges: Server=server1->Db=default->Table=emp->Column=division->action=select; (state=42000,code=40000)   // select query which we allowed to query 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> select * from emp_limited; ... +-----------------------+-------------------+--+ | emp_limited.division  | emp_limited.name  | +-----------------------+-------------------+--+ |  Finance              |  Bob              | |  Marketing            |  John             | +-----------------------+-------------------+--+   Step3. Encryption in Motion Another important aspect of security is network encryption. For example, even if you protect access to the servers, somebody may listen to the network between the cluster and client and intercept network packets for future analysis. Here is an example how it could be hacked (note: my cluster already Kerberized). Intruder server: # tcpdump -XX -i eth3 > tcpdump.file ------------------------------------ Client machine # echo "encrypt your data" > test.file # hadoop fs -put test.file /tmp/ ------------------------------------ Intruder server: # less tcpdump.file |grep -B 1 data         0x0060:  0034 ff48 2a65 6e63 7279 7074 2079 6f75  .4.H*encrypt.you         0x0070:  7220 6461 7461 0a                        r.data. Now we are hacked. Fortunately, Hadoop has the capability to protect the network between clients and cluster. It will cost you some performance, but the performance impact should not prevent you from enabling the network encryption between clients and cluster. Before enabling encryption I've run a simple performance test: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... Both jobs took 3.7 minutes. Remember it for now. Fortunately, Oracle Big Data Appliance provides an easy way for enabling network encryption with bdacli.  You can set it up it with one command: # bdacli enable hdfs_encrypted_data_transport ... You will need to answer some questions about your specific cluster configs, such as Cloudera Manager admin password and OS passwords. After finishing the command, I ran performance test again: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... and now I took 4.5 and 4.2 minutes respectively. The jobs perform a bit more slowly, but it is worth it. For improved performance of transferring encrypted data, we may use an advantage of Intel embedded instructions and change encryption algorithm (go to Cloudera Manager -> HDFS -> Configuration -> dfs.encrypt.data.transfer.algorithm -> AES/CTR/NoPadding): Another vulnerability is network interception during the shuffle (step between Map and Reduce operation) and communication between clients. To prevent this vulnerability, you have to encrypt shuffle traffic.  BDA again has an easy solution: run bdacli enable hadoop_network_encryption command. Now we will encrypt intermediate files, generated after shuffle step: # bdacli enable hadoop_network_encryption ... Like in the previous example, simply answer a few questions the encryption will be enabled. Let's check performance numbers again: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... we have 5.1 minute and 4.4 minutes respectively. A bit slower, but it is important in order to keep data safe. Conclusion: Network encryption prevents intruders from capturing data in flight Network encryption degrades your overall performance.  However, this degradation shouldn't stop you from enabling it because it's very important from a security perspective Step4. Encryption at Rest. HDFS transparent encryption All right, now we protected the cluster from external unauthorized access (by enabling Kerberos), encrypted network communication between the cluster and clients, encrypted intermediate files, but we still have vulnerabilities. If a user gets access to the cluster's server, he or she could read the data (despite on ACL).  Let me give you an example.  Some ordinary user put sensitive information in the file and put it on HDFS: # echo "sensetive information here" > test.file # hadoop fs -put test.file /tmp/test.file Intruder knows the file name and wants to get the content of it (sensitive information) # hdfs fsck /tmp/test.file -locations -files -blocks ... 0. BP-421546782-192.168.254.66-1501133071977:blk_1073747034_6210 len=27 Live_repl=3 ... # find / -name "blk_1073747034*" /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034_6210.meta /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034 # cat /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034 sensetive information here The hacker found the blocks that store the data and then reviewed the physical files at the OS level.  It was so easy to do. What can you do to prevent this? The answer is to use HDFS encryption.  Again, BDA has a single command to do enable HDFS transparent encryption: bdacli enable hdfs_transparent_encryption. You may find more details in MOS "How to Enable/Disable HDFS Transparent Encryption on Oracle Big Data Appliance V4.4 with bdacli (Doc ID 2111343.1)". I'd like to note, that Cloudera has  great blogpost about HDFS transparent encryption and I would recommend you to read it. So, after encryption had been enabled, I'll repeat my previous test case.  Prior to running the test, we will create an encryption zone and copy files into that zone. // obtain hdfs kerberos ticket for working with cluster on behalf of hdfs user # kinit -kt `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` hdfs/`hostname` // create directory, which will be our security zone in future # hadoop fs -mkdir /tmp/EZ/ // create key in Key Trustee Server # hadoop key create myKey // create encrypted zone, using key created earlier # hdfs crypto -createZone -keyName myKey -path /tmp/EZ // make Oracle owner of this directory # hadoop fs -chown oracle:oracle /tmp/EZ // Switch user to oracle # kinit -kt oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com // Load file in encrypted zone # hadoop fs -put test.file /tmp/EZ/ // define physical location of the file on the Linus FS # hadoop fsck /tmp/EZ/test.file -blocks -files -locations ... 0. BP-421546782-192.168.254.66-1501133071977:blk_1073747527_6703  ... [root@scajvm1bda01 ~]# find / -name blk_1073747527* /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527_6703.meta /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527 // trying to read the file [root@scajvm1bda01 ~]# cat /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527 ▒i#▒▒C▒x▒1▒U▒l▒▒▒[ Bingo! The file is encrypted and the person who attempted to access the data can only see a series of nonsensical bytes. Now let me tell couple words how encryption works. There are a few types of keys (screenshots I took from Cloudera's blog): 1) Encryption Zone key. You may encrypt files in a certain directory using some unique key. This directory called Encryption Zone (EZ) and a key called EZ key. This approach maybe quite useful, when you share Hadoop cluster among different divisions within the same company. This key stored in KMS (Key Managment Server). KMS handles generating encryption keys (EZ and DEK), also it communicates with key server and decrypts EDEK. 2) Encrypted data encryption keys (EDEK) is an attribute of the files, which stored in Name Node. 3) DEK is not persistent, you compute it on the fly from EDEK and EZ. Here is the flow of how to client write data to the encrypted HDFS. I took the explanation of this from Hadoop Security book: 1) The HDFS client calls create() to write to the new file. 2) The NameNode requests the KMS to create a new EDEK using the EZK-id/version. 3) The KMS generates a new DEK. 4) The KMS retrieves the EZK from the key server. 5) The KMS encrypts the DEK, resulting in the EDEK. 6) The KMS provides the EDEK to the NameNode. 7) The NameNode persists the EDEK as an extended attribute for the file metadata. 8) The NameNode provides the EDEK to the HDFS client. 9) The HDFS client provides the EDEK to the KMS, requesting the DEK. 10) The KMS requests the EZK from the key server. 11) The KMS decrypts the EDEK using the EZK. 12) The KMS provides the DEK to the HDFS client. 13) The HDFS client encrypts data using the DEK. 14) The HDFS client writes the encrypted data blocks to HDFS for data reading you will follow this steps: 1) The HDFS client calls open() to read a file. 2) The NameNode provides the EDEK to the client. 3) The HDFS client passes the EDEK and EZK-id/version to the KMS. 4) The KMS requests the EZK from the key server. 5) The KMS decrypts the EDEK using the EZK. 6) The KMS provides the DEK to the HDFS client. 7) The HDFS client reads the encrypted data blocks, decrypting them with the DEK. I'd like to highlight again, that all these steps are completely transparent and the end user doesn't feel any difference while working with HDFS. Tip: Key Trustee Server and Key Trustee KMS For those of you who are just starting to work with HDFS data encryption - the terms Key Trustee Server and KMS maybe a bit confusing. Which component do you need to use and for what purpose? From Cloudera's documentation: Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys. With Key Trustee Server, encryption keys are separated from the encrypted data, ensuring that sensitive data is protected in the event that unauthorized users gain access to the storage media. Key Trustee KMS - for HDFS Transparent Encryption, Cloudera provides Key Trustee KMS, a customized Key Management Server. The KMS service is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API. Encryption and decryption of EDEKs happen entirely on the KMS. More importantly, the client requesting creation or decryption of an EDEK never handles the EDEK's encryption key (that is, the encryption zone key). This picture (again from cloudera's documentation) shows that KMS is intermediate service in between Name Node and Key Trustee Server: Practical Tip: HDFS transparent encryption Linux operations // For get list of the keys you may use: # hadoop key list Listing keys for KeyProvider: org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5386659f myKey   //For get list of encryprion zones use: # hdfs crypto -listZones /tmp/EZ  myKey   // For encrypt existing data you need to copy it: # hadoop distcp /user/dir /encryption_zone Step5. Audit. Cloudera Navigator Auditing tracks who does what on the cluster - making it easy to identify improper attempts to access. Fortunately, Cloudera provides easy and efficient way to do so, it's called Cloudera Navigator. Cloudera Navigator is included with BDA and Big Data Cloud Service.  It is accessible thru Cloudera Manager: After this you may use "admin" password from Cloudera Manager. After logon you may choose "Audit" section. Where you may create different Audit reports. Like "which files user afilanov created on HDFS for a last hour":   Step6. User management and tools in Hadoop. Group Mapping HDFS is filesystem and as we discussed earlier it has ACL for managing file permissions. As you know those three magic numbers define access rules for owner-group-others. But how to understand which group belong my user? There are two types of user group lookup - LDAP based and UnixShell based. In Cloudera Manager it defined through hadoop.security.group.mapping parameter: for check a list of the groups for certain user from Linux console, just run: $ hdfs groups hive hive : hive oinstall hadoop   Step6. User management and tools in Hadoop. Connect to the hive from the bash console The two most common ways of connecting to the hive from the shell are the 1) hive cli and 2) beeline. The first is deprecated as it bypasses the security in HiveServer2; it communicates directly with the metastore.  Therefore, beeline is the recommended; it communicates with HiveServer2 - enabling authorization rules to engage. Hive cli tool is big back door for the security and it's highly recommended to disable it. To accomplish this, you need to configure hive properly. You may use hadoop.proxyuser.hive.groups parameter to allow only the users belong to the group specified in the proxy list to connect to the metastore (the application components) and as consequence, a user that does not belong to these group and run the hive cli will not connect to the metastore. Go to the Cloudera Manager -> Hive -> Configuration -> in search bar type "hadoop.proxyuser.hive.groups" and add hive, Impala and hue users: Restart hive server.  You will be able to connect to the hive cli only as a privileged user (belongs to hive, hue, Impala groups). # kinit -kt hive.keytab hive/scajvm1bda04.vm.oracle.com # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST ... # hive ... hive> show tables; OK emp emp_limited Time taken: 1.045 seconds, Fetched: 2 row(s) hive> exit; # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM ... # hive ... hive> show tables; FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset hive> Great, now we locked old hive cli and now it's a good time to use new modern beeline console. So, for running beeline, you need to invoke beeline from cli and put following connection string: beeline> !connect jdbc:hive2://<FQDN to HS2>:10000/default;principal=hive/<FQDN to HS2>@<YOUR REALM>; For example: # beeline Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Beeline version 1.1.0-cdh5.11.1 by Apache Hive beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; Note: before the connect to beeline you must obtain the Kerberos ticket that is used to confirm your identity. # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM   Valid starting     Expires            Service principal 08/14/17 04:53:27  08/14/17 14:53:17  krbtgt/BDA.COM@BDA.COM         renew until 08/21/17 04:53:27 Note, that you may have SSL/TLS encryption (Oracle recommends to have it). You may check it, by running: $ openssl s_client -debug -connect node03.us2.oraclecloud.com:10000 If you enable hive TLS/SSL encryption (for ensuring integrity and confidence between your client and server connection) you need to use quite a tricky authentification with beeline. You have to use SSL Trust Store file and trust Store Password. When a client connects to a server and that server sends a public certificate across to the client to begin the encrypted connection the client must determine if it 'trusts' the server's certificate. In order to do this, it checks the server's certificate against a list of things it's been configured to trust called a trust store You may find it with Cloudera Manager REST API: # curl -X GET -u "admin:admin1" -k -i https://<Cloudera Manager Host>:7183/api/v15/cm/config ...  "name" : "TRUSTSTORE_PASSWORD",     "value" : "Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN"   },{     "name" : "TRUSTSTORE_PATH",     "value" : "/opt/cloudera/security/jks/scajvm.truststore"   } Alternatively, you may use bdacli tool on BDA: # bdacli getinfo cluster_https_truststore_password # bdacli getinfo cluster_https_truststore_path Now you know trustee password and trrustee path. If you doubt that it matches, you could try to take a look on the trustore file content: afilanov-mac:~ afilanov$ keytool -list -keystore testbdcs.truststore  Enter keystore password:   Keystore type: JKS Keystore provider: SUN Your keystore contains 5 entries   cfclbv3874.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): F0:5D:28:36:99:67:FB:C0:B1:D5:B3:75:DF:D6:51:9B:DF:EB:3E:3A cfclbv3871.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): AF:3A:20:90:04:0A:27:B5:BD:DF:83:32:C7:4A:AF:AF:C4:97:E1:30 cfclbv3873.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): 30:09:B9:A8:79:D7:F4:02:3F:72:8C:05:F1:A4:BF:04:9B:8B:78:CA cfclbv3870.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): EA:F0:38:1E:BB:89:E2:05:38:CA:F2:FB:4D:41:82:75:BE:5D:F7:88 cfclbv3872.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): C5:7D:F2:FA:96:8C:AB:4A:D2:03:02:DA:D3:F5:0C:7C:45:8E:26:E7 and after this connect to the certain database: For example, in my case I used: # beeline Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Beeline version 1.1.0-cdh5.11.1 by Apache Hive beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; Please note, that trustore is the public key and it's not a big secret; it generally is not a problem to use it in scripts and share. Also, alternatively, if you don't want to use so long connection string all the time, you could just put truststore credentials in Linux environment: # export HADOOP_OPTS="-Djavax.net.ssl.trustStore=/opt/cloudera/security/jks/scajvm.truststore -Djavax.net.ssl.trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN" # beeline ... beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; you may also automate all steps above by scripting them: $ cat beeline.login #!/bin/bash export CMUSR=admin export CMPWD=admin_password tspass=`bdacli getinfo cluster_https_truststore_password` tspath=`bdacli getinfo cluster_https_truststore_path` domain=`bdacli getinfo cluster_domain_name` realm=`bdacli getinfo cluster_kerberos_realm` hivenode=`json-select --jpx=HIVE_NODE /opt/oracle/bda/install/state/config.json` set +o histexpand echo "!connect jdbc:hive2://$hivenode:10000/default;ssl=true;sslTrustStore=$tspath;trustStorePassword=$tspass;principal=hive/$hivenode@$realm" beeline -u "jdbc:hive2://$hivenode:10000/default;ssl=true;sslTrustStore=$tspath;trustStorePassword=$tspass;principal=hive/$hivenode@$realm" $ ./beeline.login   Step6. User management and tools in Hadoop. HUE and LDAP authentification. One more interesting thing which you could do with your Active Directory (or any other LDAP implementation) is integrate HUE and LDAP and use LDAP passwords for authenticate your users in HUE.  Before doing this you have to enable TLSv1 in your Java settings on the HUE server. Here is detailed MOS note how to do this. Search: Disables TLSv1 by Default For Cloudera Manager/Hue/And in System-Wide Java Configurations (Doc ID 2250841.1) After this, you may want to watch these youtube videos to understand how easy is to do this integration.  Authenticate Hue with LDAP and Search Bind or Authenticate Hue with LDAP and Direct Bind. It's really not too hard. Potentially, you may need to define your base_dn and here is the good article about how to do this. Next you may need the bind user for make the first connection and import all other users (here is the explanation from Cloudera Manager: Distinguished name of the user to bind as. This is used to connect to LDAP/AD for searching user and group information. This may be left blank if the LDAP server supports anonymous binds.). For this purposes I used my AD account afilanov.  After this I have to login to the HUE using afilanov login/password: Then click on the user name and choose "Manage Users".  Add/Sync LDAP users: Optionally, you may put the Username pattern and click Sync: Here we go! Now we have list of the LDAP users imported into HUE: and now we could use any of this accounts to log in into HUE:

Security is a very important aspect of many projects and you must not underestimate it, Hadoop security is very complex and consist of many components, it's better to enable one by one security...

Big Data

How Enabling CDSW Will Help You Make Better Use of Your Big Data Appliance

No one has to elaborate on the interest and importance of Data Science, so we won't go into why you should be looking at frameworks and tools to enable AI/ML and more fun things on your Hadoop infrastructure. One way to do this on Oracle Big Data Appliance is to use Cloudera Data Science Workbench (CDSW). See at the end of this post for some information on CDSW and its benefits. How does it work? Assuming you want to go with CDSW for your data science needs, here is what is being enabled with Big Data Appliance and what we did to enable support for CDSW. CDSW will run on (a set of) edge nodes on the cluster. These nodes must adhere to some specific OS versions, and so we released a new BDA base image for edge nodes that provides Oracle Linux 7.x with UEK 4. CDSW supports Oracle Linux 7 as of CDSW 1.1 (more version information here). With the OS version squared away, we are set to support CDSW, and on a BDA (schematic shown below) with 8 nodes, you would re-image the two edges to the BDA OL7 base image, configure the network and integrate the nodes as edges into the cluster. After this you apply the CDSW install as documented by Cloudera.   As you can see in the image, the two edge nodes are running OL7, but they form an integral part of the BDA cluster. They are also covered under the embedded Cloudera Enterprise Data Hub license. The remainder of the cluster nodes, as would be done in almost all instances, remains your regular OL6 OS, with the Hadoop stack installed. Cloudera Manager if available for you to administer the cluster (no changes there of course). And that really is it. Detailed steps for Oracle customers are tested as well as published via My Oracle Support. What is Cloudera Data Science Workbench? [From Cloudera - Neither I nor Oracle take credit for the below]  The Cloudera Data Science Workbench (CDSW) is a self-service environment for data science on Cloudera Enterprise. Based on Cloudera’s acquisition of data science startup Sense.io, CDSW allows data scientists to use their favorite open source languages -- including R, Python, and Scala -- and libraries on a secure enterprise platform with native Apache Spark and Apache Hadoop integration, to accelerate analytics projects from exploration to production. CDSW delivers the following benefits: For data scientists:  Use R, Python, or Scala with their favorite libraries and frameworks, directly from a web browser. Directly access data in secure Hadoop clusters with Spark and Impala. Share insights with their entire team for reproducible, collaborative research. For IT professionals:  Give your data science team the freedom to work how they want, when they want.  Stay compliant with out-of-the-box support for full Hadoop security, especially Kerberos. Run on Private Cloud, Cloud at Customer, or Public Cloud. Read more on CDSW here. [End Cloudera bit] If you are reading this you must be interested in Analytics, AI/ML on Hadoop. This post is very cool and uses the freely downloadable Big Data Lite VM. Check it out...

No one has to elaborate on the interest and importance of Data Science, so we won't go into why you should be looking at frameworks and tools to enable AI/ML and more fun things on your Hadoop...

Big Data

If You Struggle With Keeping your BDAs up to date, Then Read This

[Updated on October 15th to reflect the release of BDA 4.10, with CDH 5.12.1] One of the interesting aspects of keeping your Oracle Big Data infrastructure up to date (Hadoop, but also the OS and the JDK) is trying to get a hold of the latest information enabling everyone to plan their upgrades and see what is coming. The following is a list of versions released over the past quarters and a look ahead to what is coming. What is the Schedule? The intention is to release a software bundle - often referred to as a Mammoth bundle (the install utility is called Mammoth) - for our systems roughly 4-8 weeks after Cloudera releases their release. We are getting in the habit to actually release the BDA versions with the .1 update to the Cloudera version. As an example: CDH 5.11.0 was released on April 19, 2017 CDH 5.11.1 was released on June 13, 2017 BDA 4.9.0 was release on June 18, 2017, picking up 5.11.1 for both CDH and CM So, what is going on in the time between a CDH release and a BDA release? We do a few things (all on the same hardware our customers run): We pre-test with pre-GA drops and try to uncover major issues early We do the full OS, MySQL Database and Java upgrades (a Mammoth bundle ups the infra, not just Hadoop and Spark), test those and then do the below. As part of this, we also run security scans to ensure we pick up the latest security fixes - which is of course one of the reasons we update the OS with every release  We fully test deploying secure and non-secure clusters with the latest version as soon as the final bits drop (this is not weeks in advance, but when the SW is released), and we run smoke tests  We fully test upgrading clusters (secure and non-secure) from a variety of versions to the new version on BDA hardware We fully test Node Migration, which is our automated way of dealing with node failures. Eg. in the unlikely event that node 2 fails, you run a single command to migrate node 2 and you are back in full HA mode... We update the Big Data Connectors and other related components and run smoke tests We update relevant parameters to comply with best practices and with the BDA hardware profiles to optimize the clusters before we ship them Add BDA specific features like (we just enabled OL5 to OL6 migration of the clusters in BDA 4.9 as an example of such features) Make sure that we do all of this quickly again to pick up the .1 and not the .0 if we pick up .1 Stuff I tend to forget about... What is Past and what is Next The table below captures - and will be updated going forward - where we are now and what is coming next in terms of estimated timing for releases. All SUBJECT TO CHANGE WITHOUT NOTICE (also see the note below on Safe Harbor): Date BDA Version CDH Version Comments Apr 11, 2017 4.8.0 5.10.1 MySQL version uptick Jun 18, 2017 4.9.0 5.11.1 OL5 --> OL6 migration Oct 15, 2017 4.10.0 5.12.1 Final OL5 release | Faster Node migration Futures:       Dec 2017 4.11.0 5.13.x             As we move forward, we will attempt to keep this up to date, so folks can look ahead a little. For Cloudera's release sequence and times, please refer to Cloudera's communications. To configure and set up BDA systems, ensure you download the configurator utility (here) and review the documentation on Mammoth and BDA in general (here). Learn more about BDA, Mammoth on OTN. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

[Updated on October 15th to reflect the release of BDA 4.10, with CDH 5.12.1] One of the interesting aspects of keeping your Oracle Big Data infrastructure up to date (Hadoop, but also the OS and the...

Data Warehousing

MATCH_RECOGNIZE and predicates - everything you need to know

  MATCH_RECOGNIZE and predicates At a recent user conference I had a question about when and how  predicates are applied when using MATCH_RECOGNIZE so that’s the purpose of this blog post. Will this post cover everything you will ever need to know for this topic? Probably! Where to start….the first thing to remember is that the table listed in the FROM clause of your SELECT statement acts as the input into the MATCH_RECOGNIZE pattern matching process and this raises the question about how and where are predicates actually applied. I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE: SQL Pattern Matching Deep Dive - Part 1. In that first post I looked at the position of predicates within the explain plan and their impact on sorting. In this post I am going to use the built in measures (MATCH_NUMBER and CLASSIFIER) to show the impact of applying predicates to the results that are returned. First, if you need a quick refresher course in how to use the MATCH_RECOGNIZE built-in measures then see part 2 of the deep dive series: SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER().  As per usual I am going to use my normal stock ticker schema to illustrate the specific points. You can find this schema listed on most of the pattern matching examples on livesql.oracle.com. There are three key areas within the MATCH_RECOGNIZE clause that impact on predicates…  PARTITION BY column ORDER BY column All other columns   1. Predicates on the PARTITION BY column  Let’ start with a simple query: select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 );   Note that we are using an always-true pattern STRT which is defined as 1=1 to ensure that we process all rows and the pattern has no range so it will be matched once and then reset to find the next match. As our ticker table contains 60 rows, the output also contains 60 rows     Checkout the column headed mn which contains our match_numnber() measure. This shows that within the first partition for ACME we matched the always-true event 20 times, i.e. all rows were matched. If we check the explain plan for this query we can see that all 60 rows (3 symbols, and 20 rows for each symbol) were processed:   If we now apply a predicate on the PARTITION BY column, SYMBOL, then we can see that the first “block” of our output looks exactly the same, however, the explain plan shows that we have processed fewer rows - only 20 rows.  Let’ modify and rerun our simple query: select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt+) DEFINE strt as 1=1 ) WHERE symbol = ‘ACME';    the results look similar but note that the output summary returned by SQL Developer indicates that only 20 rows were fetched:   notice that the match_number() column (mn) is showing 1 - 20 as values returned from the pattern matching process. If we look at the explain plan…. …this also shows that we processed 20 rows - so partition elimination filtered out the other 40 rows before pattern matching started. Therefore, if you apply predicates on the PARTITION BY column then MATCH_RECOGNIZE is smart enough to perform partition elimination to reduce the number of rows that need to be processed. Conclusion - predicates on the PARTITION BY column. Predicates on the partition by column reduce the amount of data being passed into MATCH_RECOGNIZE. Built-in measures such as MATCH_NUMBER work as expected in that a contiguous sequence is returned.   2. Predicates on the ORDER BY column What happens if we apply a predicate to the ORDER BY column? Let’s amend the query and add a filter on the tstamp column: select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 ) WHERE symbol='ACME' AND tstamp BETWEEN '01-APR-11' AND '10-APR-11';  returns a smaller resultset of only 10 rows and match_number is correctly sequenced from 1-10 - as expected:  however, the explain plan shows that we processed all the rows within the partition (20).   This becomes a little clearer if remove the predicate on the SYMBOL column:   select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 ) WHERE tstamp BETWEEN ’01-APR-11' AND '10-APR-11';   now we see that 30 rows are returned   but all 60 rows have actually been processed! Conclusion Filters applied to non-partition by columns are applied after the pattern matching process has completed: rows are passed in to MATCH_RECOGNIZE, the pattern is matched and then predicates on the ORDER BY/other columns are applied. Is there a way to prove that this is actually what is happening? 3.Using other columns Lets add another column to our ticker table that shows the day name for each trade. Now let’s rerun the query with the predicate on the SYMBOL column: select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 ) WHERE symbol = ‘ACME';   the column to note is MN which contains a contiguous sequence of numbers from 1 to 20. What happens if we filter on the day_name column and only keep the working-week days (Mon-Fri):   select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 ) WHERE symbol = ‘ACME' AND day_name in (‘MONDAY’, ’TUESDAY’, ‘WEDNESDAY’, ’THURSDAY’, ‘FRIDAY’);    now if we look at the match_number column, mn, we can see that the sequence is no longer contiguous: the value in row 2 is now 4 and not 2, row 7 the value of mn is 11 even though the previous row was 8:   It is still possible to “access” the rows that have been removed. Consider the following query with the measure PREV(day_name): select * from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn, prev(day_name) as prev_day ALL ROWS PER MATCH PATTERN (strt) DEFINE strt as 1=1 ) WHERE symbol='ACME' AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');   this returns the following: where you can see that on row 2 the value for SUNDAY has been returned even though logically looking at the results the previous day should be FRIDAY. This has important implications for numerical calculations such as running totals, final totals, averages, counts, min and max etc etc because these will take into account all the matches (depending on how your pattern is defined) prior to the final set of predicates (i.e. non-PARTITION BY columns) being applied.   One last example Let’s now change the always-true pattern to search for as many rows as possible (turn it into a greedy quantifier) select symbol, tstamp, mn, price, day_name, prev_day, total_rows from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn, prev(day_name) as prev_day, count(*) as total_rows ALL ROWS PER MATCH PATTERN (strt+) DEFINE strt as 1=1 ) WHERE symbol='ACME' AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');   the results from the following two queries: Query 1: select symbol, tstamp, mn, price, day_name, prev_day, total_rows, avg_price, max_price from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn, prev(day_name) as prev_day, count(*) as total_rows, trunc(avg(price),2) as avg_price, max(price) as max_price ALL ROWS PER MATCH PATTERN (strt+) DEFINE strt as 1=1 ) WHERE symbol=‘ACME';  Query 2: select symbol, tstamp, mn, price, day_name, prev_day, total_rows, avg_price, max_price from ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() as mn, prev(day_name) as prev_day, count(*) as total_rows, trunc(avg(price),2) as avg_price, max(price) as max_price ALL ROWS PER MATCH PATTERN (strt+) DEFINE strt as 1=1 ) WHERE symbol='ACME' AND day_name in ('MONDAY', 'WEDNESDAY', 'FRIDAY');   the number of rows returned is different but the values for the calculated columns (previous day, count, max and min) are exactly the same: Resultset 1: Resultset 2:   Conclusion When I briefly touched on this topic in part 1 of my deep dive series on MATCH_RECOGNIZE, SQL Pattern Matching Deep Dive - Part 1, the focus was on the impact predicates had on sorting - would additional sorting take place if predicates were used. In this post I have looked at the impact on the data returned. Obviously by removing rows at the end of processing there can be a huge impact on calculated measures such as match_number, counts and averages etc. Hope this has been helpful. If you have any questions then feel free to send me an email: keith.laker@oracle.com. Main image courtesy of wikipedia    

  MATCH_RECOGNIZE and predicates At a recent user conference I had a question about when and how  predicates are applied when using MATCH_RECOGNIZE so that’s the purpose of this blog post. Will this...

Big Data

See How Easily You Can Copy Data Between Object Store and HDFS

Object Stores tend to be a place where people put there data in the cloud (see also The New Data Lake - You Need More Than HDFS). Add data here and then share it, load it or use it across various other services. Here we won't discuss the architecture and whether or not the data lake now is the object store (hint: not yet...), but instead focus on how to easily move data back and forth between object stores and your Big Data Cloud Service (BDCS) cluster(s). ODCP The underlying foundation for the coming screen shots and for Big Data Manager - a free component included with Big Data Cloud Service - is Oracle Distributed CoPy. The utility is based loosely on DistCP but made data movement leveraging Object Stores scalable and simple. For a good overview and some performance numbers on ODCP and a comparison with a host of other ways of loading data into BDCS I would recommend reviewing this post from the A-team at Oracle. For production workloads I would expect everyone to go command line, as it enables scripting of jobs or embedding this in your favorite ETL tool for execution in a more comprehensive flow. The command line reference manual is published here. Big Data Manager For those looking to get going, the command line may be a bit intimidating. Big Data Manager resolves that by providing an elegant way of: Creating reusable storage providers, and managing access to these providers Providing an intuitive file browser and drag and drop capabilities between providers Providing a simple GUI to choose between scheduled (and repeated) and immediate execution of jobs Creating Data Providers (Storages) The cluster pages for BDCS have a link to Big Data Manager. The tool requires a specific log in once working in the cluster. After you log in you will end up on the main page:     Selecting the Administration Tab in the tool enables the creation and editing of the Storages as they are called. You can create these providers for a number of - ever expanding - providers. For example, Oracle Storage Cloud, Amazon S3, BDCS HDFS etc. Check back for new ones frequently or simply keep an eye on your updated Big Data Manager. Tip: Creating a Storage for Oracle Object Store, the Tenant starts with "storage-" and you add your identity domain after that. Once you have your storages created, you are in business and dragging and dropping can start. In my example here, I am going from Oracle Storage Cloud Service to my HDFS in my BDCS cluster, and so I am loading data into my BDCS system:               Now simply drag and drop from left to right (or of course the other way) and you will be asked whether or not to do the move from Object Store to HDFS now, or schedule it and repeat on a specified frequency.               Clicking Create will spawn an Apache Spark job on the BDCS cluster, open a connection to Object Store and run a data transfer in parallel based on the setting you can tweak in the advanced tab. Switching the "Run Immediately" toggle to "Repeated Execution" gives you the scheduling information:                 Once done, the job runs, and can be monitored in Big Data Manager:                 SDK Last but not least, there is both a Python and Java SDK for Big Data Manager. Feel free to give all this a whirl in your BDCS instance and let us know how things go. 

Object Stores tend to be a place where people put there data in the cloud (see also The New Data Lake - You Need More Than HDFS). Add data here and then share it, load it or use it across various...

Big Data

Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

  It’s here - at last! I have just pushed my usual must-see guide to the Apple iBooks Store.   The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store - click here, and yes it’s completely free. This comprehensive guide covers everything you need to know about this year’s Oracle OpenWorld conference so that when you arrive at Moscone Conference Center you are ready to get the most out of this amazing conference. The guide contains the following information: Page 8 - On-Demand Videos Page 17 - Justify Your trip Page 19 - Key Presenters Page 41 - Must See Sessions Page 90 - Useful Maps Chapter 1 - Introduction to the must-see guide. Chapter 2 - A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year's OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company. Chapter 3 - Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies. Chapter 4 - List of the “must-see” sessions at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2016. Chapter 5 - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages. Chapter 6 -  Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2016. Chapter 7 - Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps.   What’s missing? At the moment there is no information about hands-on labs or the demogrounds but as soon as that information is available I will update the contents and push it to the iBooks Store. Stay tuned for update notifications posted on Twitter, Facebook, Google+ and LinkedIn. Let me know if you have any comments. Enjoy.  

  It’s here - at last! I have just pushed my usual must-see guide to the Apple iBooks Store.   The free big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple...

Big Data SQL

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22 Many thanks to Dario Vega, who is the actual author of this content, I'm just publishing it in the Big Data SQL blog. Create a hive table with a binary field and cast to BLOB type in RDBMS when using big data sql For text files, hive is storing in a base64 representation the binary fields. Normally, there is no problem with newline character and not extra work inside the Oracle database the conversion is done by Big Data SQL Using json files org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: JsonSerDe does not support BINARY type org.openx.data.jsonserde.JsonSerDe is generating null values, So you need write in base64 and do the transformation in the database. TIP : The standard ECMA-404 “The JSON Data Interchange Format” is suggesting “JSON is not indicated for applications requiring binary data” but it is used for many people including our cloud services (of course using base 64). Using ORC, parquet, avro it is working well When using avro-tools the json file is generated using base32 but each format is storing using their own representation [oracle@tvpbdaacn13 dvega]$ /usr/bin/avro-tools tojson avro.file.dvega | more {"zipcode":{"string":"00720"}, "lastname":{"string":"ALBERT"}, "firstname":{"string":"JOSE"}, "ssn":{"long":253181087}, "gender":{"string":"male"}, "license":{"bytes":"S11641384"} } [oracle@tvpbdaacn13 dvega]$ /usr/bin/parquet-tools head parquet.file.dvega zipcode = 00566 lastname = ALEXANDER firstname = PETER ssn = 637221663 gender = male license = UzY4NTkyNTc4 Simulating using Linux tools On hive: create table image_new_test (img binary); On Oracle: SQL> CREATE TABLE image_new_test ( IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_new_test ) ); On Linux: base64 --w 10000000 YourImage.PNG > YourImage.BASE64 #Be sure to have only one line before copy to hadoop. If not fix wc -l YourImage.BASE64 # you can concat many images on the same BASE64 file - one image by line hadoop fs -put Capture.BASE64 hdfs://tvpbdaacluster3-ns/user/hive/warehouse/pmt.db/image_new_test or use load hive commands Validate using SQL Developer: Compare to the original one: Copying images stored in the database to Hadoop Original tables: SQL> create table image ( id number, img BLOB); insert an image using sqldeveloper REM create an external table to copy the dmp files to hadoop CREATE TABLE image_dmp ORGANIZATION EXTERNAL ( TYPE oracle_datapump DEFAULT DIRECTORY DEFAULT_DIR LOCATION ('filename1.dmp') ) AS SELECT * FROM image; Hive Tables: # copy files to hadoop eg. on /user/dvega/images/filename1.dmp CREATE EXTERNAL TABLE image_hive_dmp ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/oracle/dvega/images/'; create table image_hive_text as select * from image_hive_dmp ; Big Data SQL tables: CREATE TABLE IMAGE_HIVE_DMP ( ID NUMBER , IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_hive_dmp ) ); CREATE TABLE IMAGE_HIVE_TEXT ( ID NUMBER , IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_hive_text ) );

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22 Many thanks to Dario Vega, who is the actual author of this content, I'm just publishing it in the Big Data SQL blog. Create a hive...

Big Data SQL

Big Data SQL Quick Start. Complex Data Types – Part 21

Big Data SQL Quick Start. Complex Data Types – Part 21 Many thanks to Dario Vega, who is the actual author of this content. I'm just publishing it on this blog. A common potentially mistaken approach that people take regarding the integration of NoSQL, Hive and ultimately BigDataSQL is to use only a RDBMS perspective and not an integration point of view. People generally think about all the features and data types they're already familiar with from their experience using one of these products; rather than realizing that the actual data is stored in the Hive (or NoSQL) database rather than RDBMS. Or without understanding that the data will be querying from RDBMS.  When using Big Data SQL with complex types, we are thinking to use JSON/SQL without taking care of differences between Oracle Database and Hive use of Complex Types. Why ? Because the complex types are mapped to varchar2 in JSON format, so we are reading the data in JSON style instead of the original system.  The Best sample of this is from a Json perspective JSON ECMA-404 - Map type does not exist.  Programming languages vary widely on whether they support objects, and if so, what characteristics and constraints the objects offer. The models of object systems can be wildly divergent and are continuing to evolve. JSON instead provides a simple notation for expressing collections of name/value pairs. Most programming languages will have some feature for representing such collections, which can go by names like record, struct, dict, map, hash, or object. The following built-in collection functions are supported in Hive: int size (Map<K.V>) Returns the number of elements in the map type. array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map. array<V> map_values(Map<K.V>)Returns an unordered array containing the values of the input map. Are they supported in RDBMS? the answer is NO but may be YES if using APEX PL/SQL or JAVA programs.  In the same way, there is also a difference between Impala and Hive. Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types. The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other programming languages to deconstruct the underlying data structures. Best practices We would advise taking a conservative approach. This is because the mappings between the NoSQL data model, the Hive data model, and the Oracle RDBMS data model is not 1-to-1. For example, the NoSQL data model is quite a rich and there are many things one can do with nested classes in NoSQL that have no counterpart in either Hive or Oracle Database (or both). As a result, integration of the three technologies had to take a 'least-common-denominator' approach; employing mechanisms common to all three. Edit But let me show a sample Impala code `phoneinfo` map<string,string> impala> SELECT ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO.* FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO WHERE zipcode = '02610' AND lastname = 'ACEVEDO' AND firstname = 'TAMMY' AND ssn = 576228946 ; +---------+----------+-----------+-----------+--------+------+--------------+ | zipcode | lastname | firstname | ssn | gender | KEY | VALUE | +---------+----------+-----------+-----------+--------+------+--------------+ | 02610 | ACEVEDO | TAMMY | 576228946 | female | WORK | 617-656-9208 | | 02610 | ACEVEDO | TAMMY | 576228946 | female | cell | 408-656-2016 | | 02610 | ACEVEDO | TAMMY | 576228946 | female | home | 213-879-2134 | +---------+----------+-----------+-----------+--------+------+--------------+ Oracle code: `phoneinfo` IS JSON SQL> SELECT /*+ MONITOR */ a.json_column.zipcode ,a.json_column.lastname ,a.json_column.firstname ,a.json_column.ssn ,a.json_column.gender ,a.json_column.phoneinfo FROM pmt_rmvtable_hive_json_api a WHERE a.json_column.zipcode = '02610' AND a.json_column.lastname = 'ACEVEDO' AND a.json_column.firstname = 'TAMMY' AND a.json_column.ssn = 576228946 ; ZIPCODE : 02610 LASTNAME : ACEVEDO FIRSTNAME : TAMMY SSN : 576228946 GENDER : female PHONEINFO :{"work":"617-656-9208","cell":"408-656-2016","home":"213-879-2134"} QUESTION : How to transform this JSON - PHONEINFO in two “arrays” keys, values- Map behavior expected. Unfortunately, the nested path JSON_TABLE operator is only available for JSON ARRAYS. In the other side, when using JSON, we can access to each field as columns. SQL> SELECT /*+ MONITOR */ ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,LICENSE ,a.PHONEINFO.work ,a.PHONEINFO.home ,a.PHONEINFO.cell FROM pmt_rmvtable_hive_orc a WHERE zipcode = '02610' AND lastname = 'ACEVEDO' AND firstname = 'TAMMY' AND ssn = 576228946; ZIPCODE LASTNAME FIRSTNAME SSN GENDER LICENSE WORK HOME CELL -------------------- -------------------- -------------------- ---------- -------------------- ------------------ --------------- --------------- --------------- 02610 ACEVEDO TAMMY 576228946 female 533933353734363933 617-656-9208 213-879-2134 408-656-2016 and what about using map columns on the where clause Looking for a specific phone number Impala code `phoneinfo` map<string,string> SELECT ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO.* FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO WHERE PHONEINFO.key = 'work' AND PHONEINFO.value = '617-656-9208' ; +---------+------------+-----------+-----------+--------+------+--------------+ | zipcode | lastname | firstname | ssn | gender | KEY | VALUE | +---------+------------+-----------+-----------+--------+------+--------------+ | 89878 | ANDREWS | JEREMY | 848834686 | male | WORK | 617-656-9208 | | 00183 | GRIFFIN | JUSTIN | 976396720 | male | WORK | 617-656-9208 | | 02979 | MORGAN | BONNIE | 904775071 | female | WORK | 617-656-9208 | | 14462 | MCLAUGHLIN | BRIAN | 253990562 | male | WORK | 617-656-9208 | | 83193 | BUSH | JANICE | 843046328 | female | WORK | 617-656-9208 | | 57300 | PAUL | JASON | 655837757 | male | WORK | 617-656-9208 | | 92762 | NOLAN | LINDA | 270271902 | female | WORK | 617-656-9208 | | 14057 | GIBSON | GREGORY | 345334831 | male | WORK | 617-656-9208 | | 04336 | SAUNDERS | MATTHEW | 180588967 | male | WORK | 617-656-9208 | ... | 23993 | VEGA | JEREMY | 123967808 | male | WORK | 617-656-9208 | +---------+------------+-----------+-----------+--------+------+--------------+ Fetched 852 ROW(s) IN 99.80s But let me continue showing the same code on Oracle (querying on work phone). Oracle code `phoneinfo` IS JSON SELECT /*+ MONITOR */ ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO FROM pmt_rmvtable_hive_parquet a WHERE JSON_QUERY("A"."PHONEINFO" FORMAT JSON , '$.work' RETURNING VARCHAR2(4000) ASIS WITHOUT ARRAY WRAPPER NULL ON ERROR)='617-656-9208' ; 35330 SIMS DOUGLAS 295204437 male {"work":"617-656-9208","cell":"901-656-9237","home":"303-804-7540"} 43466 KIM GLORIA 358875034 female {"work":"617-656-9208","cell":"978-804-8373","home":"415-234-2176"} 67056 REEVES PAUL 538254872 male {"work":"617-656-9208","cell":"603-234-2730","home":"617-804-1330"} 07492 GLOVER ALBERT 919913658 male {"work":"617-656-9208","cell":"901-656-2562","home":"303-804-9784"} 20815 ERICKSON REBECCA 912769190 female {"work":"617-656-9208","cell":"978-656-0517","home":"978-541-0065"} 48250 KNOWLES NANCY 325157978 female {"work":"617-656-9208","cell":"901-351-7476","home":"213-234-8287"} 48250 VELEZ RUSSELL 408064553 male {"work":"617-656-9208","cell":"978-227-2172","home":"901-630-7787"} 43595 HALL BRANDON 658275487 male {"work":"617-656-9208","cell":"901-351-6168","home":"213-227-4413"} 77100 STEPHENSON ALBERT 865468261 male {"work":"617-656-9208","cell":"408-227-4167","home":"408-879-1270"} 852 ROWS selected. Elapsed: 00:05:29.56 In this case, we can also use the dot-notation A.PHONEINFO.work = '617-656-9208' Note: for make familiar with Database JSON API you may use follow blog series: https://blogs.oracle.com/jsondb

Big Data SQL Quick Start. Complex Data Types – Part 21 Many thanks to Dario Vega, who is the actual author of this content. I'm just publishing it on this blog. A common potentially mistaken approach...

Big Data SQL

Big Data SQL Quick Start. Custom SerDe – Part 20

Big Data SQL Quick Start. Custom SerDe – Part 20 Many thanks to Bilal Ibdah, who is actual author of this content, I'm just publishing it in the Big Data SQL blog. A modernized data warehouse is a data warehouse augmented with insights and data from a Big Data environment, typically Hadoop, now rather than moving and pushing the Hadoop data to a database, companies tend to expose this data through a unified layer that allows access to all data storage platforms, Hadoop, Oracle DB & NoSQL to be more specific. The problem lies when the data that we want to expose is stored in its native format and in the lowest granularity possible, for example packet data, which can be in a binary format (PCAP), typical uses of packet data is in the telecommunications industry where this data is generated from a packet core, and can contain raw data records, known in the telecom industry as XDRs. Here as an example of traditional architecture when source data is loading into mediation and after this TEXT (CSV) files parsed to some ETL engine and then load data into Database: here is an alternative architecture, when you load the data directly to the HDFS (which is the part of your logical datawarehouse) and after this parse it on the fly during SQL running: In this blog we’re going to use Oracle Big Data SQL to expose and access raw data stored in PCAP format living in hadoop. The first step is up store the PCAP files in HDFS using the “copyFromLocal” command. This is what the file pcap file looks like in HDFS: In order to expose this file using Big Data SQL, we need to register this file in the Hadoop Metastore, once it’s registered in the metastore Big Data SQL can access the metadata, create an external table, and run pure Oracle SQL queries on the file, but registering this file requires to unlock the content using a custom SerDe, more details here. Start by downloading the PCAP project from GitHub here, the project contains two components: The hadoop-pcap-lib, which can be used in MapReduce jobs and, The hadoop-pcap-serde, which can be used to query PCAPs in HIVE For this blog, we will only use the serde component. If the serde project hasn’t been compiled, compile it in an IDE or in a cmd window using the command “mvn package -e -X” Copy the output jar named “hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar” found in the target folder to each node in your hadoop cluster: Then add the pcap serde to the HIVE environment variables through Cloudera Manager: Then save the changes and restart HIVE (you might also need to redeploy the configuration and restart the stale services). Now let’s create a HIVE table and test the serde; copy the below to create a HIVE table: DROP table pcap; ADD JAR hadoop-pcap-serde-0.1-jar-with-dependencies.jar; SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.DnsPcapReader; CREATE EXTERNAL TABLE pcap (ts bigint, ts_usec string, protocol string, src string, src_port int, dst string, dst_port int, len int, ttl int, dns_queryid int, dns_flags string, dns_opcode string, dns_rcode string, dns_question string, dns_answer array<string>, dns_authority array<string>, dns_additional array<string>) ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer' STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs:///user/oracle/pcap/'; Now it’s time to test the serde on HIVE, let’s run the below query: select * from pcap limit 5; The query ran successfully. Next we will create an Oracle external table that points to the pcap file using Big Data SQL, for this purpose we need to add the PCAP serde file to the Big Data SQL environment variables (this must be done on each node in your hadoop cluster). Create a directory on each server in the Oracle Big Data Appliance such as “/home/oracle/pcapserde/ ” Copy the serde jar to each node in your Big Data Appliance. Browse to /opt/oracle/bigdatasql/bdcell-12.1 Add the the pcap jar file to the environment variables list in the configuration file “bigdata.properties” The class also needs to be updated in bigdata.properties file on the database nodes. First we need to copy the jar to the database nodes:  Copy jar to db side Add jar to class path Create db external table and run query Restart “bdsql” service in Cloudera Manager After this we are goot to define External table in Oracle RDBMS and query it! Just in case I will highlight that in the last query we query (read as parse and query) binary data on the fly.

Big Data SQL Quick Start. Custom SerDe – Part 20 Many thanks to Bilal Ibdah, who is actual author of this content, I'm just publishing it in the Big Data SQL blog. A modernized data warehouse is a data...

Data Warehousing

Connecting Apache Zeppelin to your Oracle Data Warehouse

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your browser so as an end user there is no desktop software to download and install. Interestingly, I had a very quick response to this article asking about how to setup a connection within Zeppelin to an Oracle Database. Therefore, in this post I am going to look at how you can install the Zeppelin server and create a connection to your Oracle data warehouse. This aim of this post is to walk you through the following topics: Installing Zeppelin Configuring Zeppelin What is an interpreter Finding and installing the Oracle JDBC drivers Setting up a connection to an Oracle PDB Firstly a quick warning! There are a couple of different versions of Zeppelin available for download. At the moment I am primarily using version 0.6.2 which works really well. Currently, for some reason I am seeing performance problems with the latest iterations around version 0.7.x and this issue. I have discussed this a few people here at Oracle and we are all seeing the same behaviour - queries will run, they just take 2-3 minutes longer for some unknown reason compared with earlier versions, pre-0.7.x, of Zeppelin. In the interests of completeness in this post I will cover setting up a 0.6.2 instance of Zeppelin as well as a 0.7.1 instance. Installing Zeppelin The first thing you need to decide is where to install the Zeppelin software. You can run on your own PC or on a separate server or on the same server that is running your Oracle Database. I run all my linux based database environments within Virtualbox images so I always install onto the same virtual machine as my Oracle database - makes life easier for moving demos around when I am heading off to user conference. Step two is to download the software. The download page is here: https://zeppelin.apache.org/download.html. Simply pick the version you want to run and download the corresponding compressed file - my recommendation, based on my experience, is to stick with version 0.6.2 which was released on Oct 15, 2016. I always select to download the full application - “Binary package with all interpreters” just to make life easy and it also gives me access the full range of connection options which, as you will discover in my next post, is extremely useful. Installing Zeppelin - Version 0.6.2 After downloading the zeppelin-0.6.2-bin-all.tgz file onto my Linux Virtualbox machine I simply expand the file to create a “zeppelin-0.6.2-bin-all” directory. The resulting directory structure looks like this: Of course you can rename the folder name to something more meaningful, such as “my-zeppelin” if you wish….obviously, the underlying folder structure remains the same! Installing Zeppelin - Version 0.7.x The good news is that if you want to install one of the later versions of Zeppelin then the download and unzip process is exactly the same. At this point in time there are two versions of 0.7, however, both 0.7.0 and 0.7.1 seem to suffer from poor query performance when using the JDBC driver (I have only tested the JDBC driver against Oracle Database but I presume the same performance issues are affecting other types of JDBC-related connections). As with the previous version of Zeppelin you can, if required, change the default directory name to something more suitable. Now we have our notebook software unpacked and ready to go! Configuring Zeppelin (0.6.2 and 0.7.x) This next step is optional. If you have installed the Zeppelin software on the same server or virtual environment that runs your Oracle Database then you will need to tweak the default configuration settings to ensure there are no clashes with the various Oracle Database services. By default, you access the Zeppelin Notebook home page via the port 8080. Depending on your database environment this may or may not cause problems. In my case, this port was already being used by APEX, therefore, it was necessary to change the default port… Configuring the Zeppelin http port If you look inside the “conf” directory there will be a file named “zeppelin-site.xml.template”, rename this to “zeppelin-site.xml”. Find the following block of tags: <property> <name>zeppelin.server.port</name> <value>8080</value> <description>Server port.</description> </property> the default port settings in the conf file will probably clash with the APEX environment in your Oracle Database. Therefore, you will need to change the port setting to another value, such as: <property> <name>zeppelin.server.port</name> <value>7081</value> <description>Server port.</description> </property> Save the file and we are ready to go! It is worth spending some time reviewing the other settings within the conf file that let you use cloud storage services, such as the Oracle Bare Metal Cloud Object Storage service. For my purposes I was happy to accept the default storage locations for managing my notebooks and I have not tried to configure the use of an SSL service to manage client authentication. Obviously, there is a lot more work that I need to do around the basic setup and configuration procedures which hopefully I will be able to explore at some point in time - watch this space! OK, now we have everything in place: software, check…. port configuration, check. It’s time to start your engine! Starting Zeppelin This is the easy part. Within the bin directory there is a shell script to run the Zeppelin daemon: . ../my-zeppelin/bin/zeppelin-daemon.sh start There is a long list of command line environment settings that you can use, see here: https://zeppelin.apache.org/docs/0.6.2/install/install.html. In my Virtualbox environment I found it useful to configure the following settings: ZEPPELIN_MEM: amount of memory available to Zeppelin. The default setting is - -Xmx1024m -XX:MaxPermSize=512m ZEPPELIN_INTP_MEM: amount of memory available to the Zeppelin Interpreter (connection) engine and the default setting is derived from the setting of ZEPPELIN_MEM ZEPPELIN_JAVA_OPTS: simply lists any additional JVM options  therefore, my startup script looks like this: set ZEPPELIN_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m set ZEPPELIN_INTP_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m set ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=8g -Dspark.cores.max=16" . ../my-zeppelin/bin/zeppelin-daemon.sh start    Fingers crossed, once Zeppelin has started the following message should appear on your command line:  Zeppelin start                                             [  OK  ]   Connecting to Zeppelin Everything should now be in place to test whether your Zeppelin environment is up and running. Open a browser and type the ip address/host name and port reference which in my case is: http://localhost:7081/#/ then the home page should appear: The landing pad interface is nice and simple. In the top right-hand corner you will see a green light which tells me that the Zeppelin service is up and running. “anonymous” is my user id because I have not enabled client side authentication. In the main section of the welcome screen you will see links to the help system and the community pages, which is where you can log any issues that you find. The Notebook section is where all the work is done and this is where I am going to spend the next post exploring in some detail. If you are used using a normal BI tool then Zeppelin (along with most other notebook applications) will take some getting used to because it creating reports follows is more of scripting-style process rather than a wizard-driven click-click process you get with products like Oracle Business Intelligence. Anyway, more on this later, What is an Interpreter? To build notebooks in Zeppelin you need to make connections to your data sources. This is done using something called an “Interpreter”. This is a plug-in which enables Zeppelin to use not only a specific query language but also provides access to backend data-processing capabilities. For example, it is possible to include shell scripting code within a Zeppelin notebook by using the %sh interpreter. To access an Oracle Database we use the JDBC interpreter. Obviously, you might want to have lots of different JDBC-based connections - maybe you have an Oracle 11g instance, a 12cR1 instance and a 12c R2 instance. Zeppelin allows you to create new interpreters and define their connection characteristics. It’s at this point that version 0.6.2 and versions 0.7.x diverge. Each has its own setup and configuration process for interpreters so I will explain the process for each version separately. Firstly, we need to track down some JDBC files… Configuring your JDBC files Finally, we have reached the point of this post - connecting Zeppelin to your Oracle data warehouse. But before we dive into setting up connections we need to track down some Oracle specific jdbc files. You will need to locate one of the following files to use with Zeppelin: ojdbc7.jar  (Database 12c Release 1) or ojdbc8.jar (Database 12c Release 2). You can either copy the relevant file to your Zeppelin server or simply point the Zeppelin interpreter to the relevant directory. My preference is to keep everything contained within the Zeppelin folder structure so I have taken my Oracle JDBC files and moved them to my Zeppelin server. If you want to find the JDBC files that come with your database version then you need to find the jdbc folder within your version-specific folder. In my 12c Release 2 environment this was located in the folder shown below: alternatively, I could have copied the files from my local SQL Developer installation: take the jdbc file(s) and copy them to the /interpreter/jdbc directory within your Zeppelin installation directory, as shown below: Creating an Interpreter for Oracle Database At last we are finally ready to create a connection to our Oracle Database! Make a note of the directory containing the Oracle JDBC file because you will need that information during the configuration process. There is a difference between the different versions of Zeppelin in terms of creating a connection to an Oracle database/PDB. Personally, I think the process in version 0.7.x makes more sense but the performance of jdbc is truly dreadful for some reason. There is obviously been a major change of approach in terms of how connections are managed within Zeppelin and this seems to causing a few issues. Digging around in the documentation it would appear that 0.8.x version will be available shortly so I am hoping the version 0.7x connection issues will be resolved! Process for creating a connection using version 0.6.2 Starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu. Select “Interpreter” as shown below: this will take you to the home page for managing your connections, or interpreters. Each query language and data processing language has its own interpreter and these are all listed in alphabetical order. scroll down until you find the entry for jdbc: here you will see that the jdbc interpreter is already configured for two separate connections: postgres and hive. By clicking on the “edit” button on the right-hand side we can add new connection attributes and in this case I have removed the hive and postgres attributes and added new attributes osql.driver osql.password osql.url osql.user the significance of the “osql.” prefix will become obvious when we start to build our notebooks - essentially this will be our reference to these specific connection details. I have added a dependency by including an artefact that points to the location of my jdbc file. In the screenshot below you will see that I am connecting to the example sales history schema owned by user sh, password sh, which I have installed in my pluggable database dw2pdb2. The listener port for my jdbc connection is 1521. If you have access to SQL Developer then an easy solution for testing your connection details is to setup a new connection and run the test connection routine. If SQL Developer connects to your database/pdb using your jdbc connection string then Zeppelin should also be able to connect successfully. FYI…error messages in Zeppelin are usually messy and long listings of a Java program stack. Not easy to workout where the problem actually originates. Therefore, the more you can test outside of Zeppelin the easier life will be - at least that is what I have found! Below is my enhanced configuration for the jdbc interpreter: The default.driver is simply the entry point into the Oracle jdbc driver which is oracle.jdbc.driver.OracleDriver. The last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 1 driver stored in the ../zeppelin/intepreter/jdbc folder. Process for creating a connection using version 0.7.x As before, starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu shown below: now with version 0.7.0 and 0.7.1 we need to actually create a new interpreter, therefore, just click on the “+Create” button: this will bring up the “Create new interpreter” form that will allow you to define the attributes for the new interpreter: I will name my new interpreter “osql” and assign it to the JDBC group: this will pre-populate the form with the default attributes needed to define a JDBC-type connection such as: default.driver: driver entry point into the Oracle JDBC driver default.password: Oracle user password default.url: JDBC connection string to access the Oracle database/pDB  default.user: Oracle username the initial form will look like this: and in my case I need to connect to a PDB called dw2pdb2 on the same server accessed via the listener port 1521, the username is sh and the password is sh. The only non-obvious entry is the default.driver which is oracle.jdbc.driver.OracleDriver. As before, the last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 2 driver stored in the ../zeppelin/intepreter/jdbc folder. Once you have entered the configuration settings, hit Save and your form should look like this:   Testing your new interpreter To test the your interpreter will successfully connect to your database/pdb and run a SQL statement we need to create a new notebook. Go back to the home page and click on the “Create new note” link in the list on the left side of the screen. Enter a name for your new note: which will bring you to the notebook screen which is where you write your scripts - in this case SQL statements. This is similar in layout and approach as many worksheet-based tools (SQL Developer, APEX SQL Worksheet etc etc). If you are using version 0.6.x of Zeppelin then you can bypass the following… If you are using version 0.7.x then we have to bind our SQL interpreter (osql) to this new note which will allow us to run SQL commands against the sh schema. To add the osql interpreter simply click on the gear icon in the top right-hand side of the screen: this will then show you the list of interpreters which are available to this new note. You can switch interpreters on and off by clicking on them and for this example I have reduced the number of interpreters to just the following: markup (md), shell scripting (sh), file management (file), our Oracle SH pdb connection (osq) and jdbc connections (jdbc). Once you are done, click on the “Save” button to return to the note. I will explain the layout the of the note interface in my next post. For the purposes of testing the connection to my pdb I need to use the “osql” interpreter and give it a SQL statement to run. This is two-lines of code as shown here On the right side of the screen there is a triangle icon which is will execute or “Run” my SQL statement: SELECT sysdate FROM dual note that I have not included a semi-colon (;) at the end of the SQL statement! In version 0.6.2 if you include the semi-colon (;) you will get a java error. Version 0.7x is a little more tolerant and does not object to having or not having a semi-colon (;). Using my Virtualbox environment the first time I make a connection to execute a SQL statement the query takes 2-3 minutes to establish the connection to my PDB and then run the query. This is true even for simple queries such as SELECT * FROM dual. Once the first query has completed then all subsequent queries run in the normal expected timeframe (i.e. around the same time as executing the query from within SQL Developer). Eventually, the result will be displayed. By default, output is shown in tabular layout (as you can see from the list of available icons, "graph-based layouts are also available" …and we have now established that the connection to our SH schema is working. Summary In this post we have covered the following topics: How to install Zeppelin How configure and start Zeppelin Finding and installing the correct Oracle JDBC drivers Set up a connection to an Oracle PDB and tested the connection As we have seen during this post, there are some key differences between the 0.6.x and 0.7.x versions of Zeppelin in terms of the way interpreters (connections) are defined. Now we have a fully working environment (Zeppelin connected to my Oracle 12c Release 2 PDB which includes sales history sample schema). Therefore, in my next post I am going to look at how you can use the powerful notebook interface to access remote data files, load data into a schema, create both tabular and graph-based reports, briefing books and even design simple dashboards. Stay tuned for more information about how to use Zeppelin with Oracle Database. If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com. (image at top of post is courtesy of wikipedia)  

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your...

New Look Blog - Site-Under-Construction

Welcome to our new look blog. We are currently in the process of moving all of our blog posts from the old blogging platform to our completely new Oracle blogging platform.  As the title of this posts suggests, we are having some teething issues....what this all means is that we are currently in "site-under-construction" mode. This new platform offers us a lot of significant improvements over the old blogging software: 1) Posts will display correctly on any size of screen so now you can read our blogs on your smartphone as well as your desktop browser, 2) simpler interaction with social media pages means it's now much less painful for us to push content to you via all the usual social media pages and 3) improved RSS feed capabilities. While we get used to this new software, which has a lot of great features for us as writers, please cut us some slack over any layout, content and formatting issues. Note that at the moment we are still working through the migration process which means that a lot of our posts look quite ugly. We hope to get all the formatting issues resolved ASAP that are affecting our existing posts.  We are working hard to go through our old posts and get them fixed. Enjoy our new blogging platform and please let us know what you think about the new-look blog. All feedback gratefully received.  

Welcome to our new look blog. We are currently in the process of moving all of our blog posts from the old blogging platform to our completely new Oracle blogging platform.  As the title of this posts...

Functionality

The latest in Oracle Partitioning - Part 3: Auto List Partitioning

This is the third blog about new partitioning functionality in Oracle Database 12c Release 2. It's about the new and exciting Auto List Partitioning, an extension to List Partitioning. And yes, it works for both "old" single column list and the new multi column list partitioned tables ... As the name already suggests, something is done 'automatic'. In this . That's it. So conceptually auto-list partitioning is a similar extension to list partitioning than interval partitioning is to range partitioning. But there's a couple of subtle differences we will discuss later. Stay on the tour. So let' start and create our first simple auto-list partitioned table and see what happens: CREATE TABLE alp (col1 NUMBER, col2 NUMBER) PARTITION BY LIST (col1) AUTOMATIC ( PARTITION p1 VALUES (1,2,3,4), PARTITION p2 VALUES (5)); The first thing you'll see is that the syntax does not look really different. There is only one new keyword named 'AUTOMATIC'. That's it. As you can also see with this first little example, the fact that we use the new automatic functionality to create new partitions for every new partition key values does not preclude us from having multiple partition key values stored in a single partition. That's somewhat similar to interval partitioning, where you can have different ranges for partitions in the range section (as contrast to the equi-width partitions in the interval section). Let's now quickly look at the metadata of this newly created partition before we want to see the automatic partition creation in action. SQL> SELECT table_name, partitioning_type, autolist, partition_count FROM user_part_tables WHERE table_name='ALP'; TABLE_NAME PARTITIONING_TYPE AUTOLIST PARTITION_COUNT ------------------------------ ------------------------------ ---------- --------------- ALP LIST YES 2 As you can see, we have a new metadata flag for this extension of list partitioning, but it is still a list partitioned table. There is also one subtle difference to interval partitioning: you will note that there is not a million partitions unlike what you would see for an interval partitioned table. You see the absolute number of partitions created at that point in time, and not the theoretical maximum of 1048575 (1024^2 -1) (roughly "a million"). Hmm, so what's the difference here? Well, an interval partitioned table is defined by having a fixed interval definition for all range-based partitions for the future. So by knowing an absolute partition bound, the number of existing partitions, and the interval, all future partitions are mathematically pre-defined. For example, with a numeric interval of 1 and one existing partition with a 'values less than (2)' boundary (partition number one), we know that a partition key value of 99 would end up in partition number 99' we would also know that "one million" is the highest number we can enter as partition key value. With auto-list on the other hand we do not know (or care) what partition key values will be added to the table; there's nothing pre-defined, and Oracle takes this one 'one partition at a time' until we reach the maximum of "one million". Ok, enough theory. Let's see the new functionality in action by adding a record with a new partition key value: INSERT INTO alp VALUES (999,999); COMMIT; You can see that the kernel created a new partition for the new partition key value of 999: SQL> SELECT partition_name, high_value FROM user_tab_partitions WHERE table_name='ALP' ORDER BY partition_position; PARTITION_NAME HIGH_VALUE ------------------------------ ------------------------------ P1 1, 2, 3, 4 P2 5 SYS_P2581 999 Now back to another subtle difference between interval partitioning and auto-list partitioning. With interval partitioning you cannot ADD a new partition, because all possible partitions are mathematically pre-created with the interval definition, as discussed earlier. With auto-list there is no pre-creation per se and no pre-defined partition key values. So just like the kernel will add new partitions for every new partition key value, you can do so, too. The usage of the new auto-list partitioning does not limit any partition maintenance operation for such a table. Let's do this now. ALTER TABLE alp ADD PARTITION pnew VALUES  (10,11,12); Voila, we can manually add partitions, as we can see in the data dictionary (as a side note: the partition position is derived by the order of creation, not the partition key values): SQL> SELECT partition_name, high_value FROM user_tab_partitions WHERE table_name='ALP' ORDER BY partition_position; PARTITION_NAME HIGH_VALUE ------------------------------ ------------------------------ P1 1, 2, 3, 4 P2 5 SYS_P2581 999 PNEW 10, 11, 12 So an auto-list partitioned table is almost identical to a manual list partitioned table. There is ultimately only ONE difference from a partition setup and maintenance perspective: an auto-list partitioned table must not have a DEFAULT partition. Conceptually the functionality of auto-list and a having a default partitioning are contradictory and mutually exclusive: A DEFAULT partition acts as a catch-it-all partition. Any new partition key value that is not explicitly covered with an existing partition will be stored in this catch-it-all partition An auto-list partitioned table will create a new partition for any new partition key value that is not explicitly covered with an existing partition. Any new partition key value will get its own partition One important key functionality that auto-list partitioning is having in common with interval partitioning is the capability for tables to evolve. List partitioned tables, single and multi-column ones, can be evolved into auto-list partitioned tables as long as they do not have a DEFAULT partition, as discussed previously. Let do this quickly: CREATE TABLE mc (col1 NUMBER, col2 NUMBER) PARTITION BY LIST (col1,col2) ( PARTITION p1 VALUES (1,1), PARTITION p2 VALUES (2,2), PARTITION p3 VALUES (3,3)); As you can guess, we cannot insert any record that has any partition key column value that is not defined with the existing partitions. We get an error message: SQL> INSERT INTO mc VALUES (1234,5678); Error starting at line : 1 in command - INSERT INTO mc VALUES (1234,5678) Error report - ORA-14400:inserted partition key does not map to any partition We are now evolving table MC into an auto-list partitioned table and try it again: SQL> ALTER TABLE mc SET AUTOMATIC; Table MC altered. SQL> INSERT INTO mc VALUES (1234,5678); 1 row inserted. You see that the new auto-list functionality has kicked in and has created a new partition for our inserted record, using the inserted data as new partition key values. SQL> SELECT partition_name, high_value FROM user_tab_partitions WHERE table_name='MC' ORDER BY partition_position; PARTITION_NAME HIGH_VALUE ------------------------------ ------------------------------ P1 ( 1, 1 ) P2 ( 2, 2 ) P3 ( 3, 3 ) SYS_P2586 ( 1234, 5678 ) That's about it for now with auto-list partitioning. Feel free to take these little example tables and play around with it a bit more .. for example, try other partition maintenance operations that I did not explicitly cover. I am almost sure as well that I have forgotten some little details here and there, so please send me any comments, corrections, or questions that you are having around Partitioning. Don't have to be specific to the available functionality in the Oracle Database Exadata Express Cloud Service but can be any topic you want to know more about it or want to see covered as a future blog post. You can always reach me at hermann.baer@oracle.com. Another one down, more to come.

This is the third blog about new partitioning functionality in Oracle Database 12c Release 2. It's about the new and exciting Auto List Partitioning, an extension to List Partitioning. And yes,...

Data Warehousing

Using Zeppelin Notebooks with your Oracle Data Warehouse - Part 1

Over the past couple of weeks I have been looking at one of the Apache open source projects called Zeppelin. It’s a new style of application called a “notebook” which typically runs within your browser. The idea behind notebook-style applications like Zeppelin is to deliver an adhoc data-discovery tool - at least that is how I see it being used. Like most notebook-style applications, Zeppelin provides a number of useful data-discovery features such as: a simple way to ingest data access to languages that help with data discovery and data analytics some basic data visualization tools a set of collaboration services for sharing notebooks (collections of reports) Zeppelin is essentially a scripting environment for running ordinary SQL statements along with a lot of other languages such as Spark, Python, Hive, R etc. These are controlled by a feature called “interpreters” and there is a list of the latest interpreters available here. A good example of a notebook-type of application is R Studio which many of you will be familiar with because we typically use it when demonstrating the R capabilities within Oracle Advanced Analytics. However, R Studio is primarily aimed at data scientists whilst Apache Zeppelin is aimed at other types of report developers and business users although it does have a lot of features that data scientists will find useful. Use Cases What’s a good use case for Zeppelin? Well, what I like about Zeppelin is that you can quickly and easily create a notebook, or workflow, that downloads a log file from a URL, reformats the data in the file and then displays the resulting data set as a graph/table. Nothing really earth-shattering in that type of workflow except that Zeppelin is easy to install, it’s easy to setup (once you understand its architecture), and it seems to be easy to share your results. Here’s a really simple workflow described above that I built to load data from a file, create an external table over the data file and then run a report: This little example shows how notebooks differ from traditional BI tools. Each of the headings in the above image (Download data from web url, Create directory to data file location, Drop existing staging table etc etc) is a separate paragraph within the “Data Access Tutorial” notebook. The real power is that each paragraph can use a different language such as SQL, or java, shell scripting or python etc etc. In the workbook shown above I start by running a shell script that pulls a data file from a remote server. Then using a SQL paragraph I create a directory object to access the data file. The next SQL paragraph drops my existing staging table and the subsequent SQL paragraph creates the external table over the data file. The final SQL paragraph looks like this: %osql select * from ext_bank_data  where %osql tells me the language, or interpreter, I am using which in this case is SQL connecting to a specific schema in my database. Building Dashboards  You can even build relatively simple briefing books containing data from different data sets and even different data sources (Zeppelin supports an ever growing number of data sources) - in this case I connected Zeppelin to two different schemas in two different PDBs: What’s really nice is that I can even view these notebooks on my smartphone (iPhone) as you can see below. The same notebook shown above appears on my iPhone screen in a vertical layout style to make best use of the screen real estate: I am really liking Apache Zeppelin because it’s so simple to setup (I have various versions running on Mac OSX and Oracle Linux) and start. It has just enough features to be very useful and not overwhelming. I like the fact that I can create notebooks, or reports, using a range of different languages and show data from a range of different schemas/PDBs/database alongside each other. It is also relatively easy to share those results. And I can open my notebooks (reports) on my iPhone. Visualizations There is a limited set of available visualizations within the notebook (report) editor when you are using a SQL-based interpreter (connector). Essentially you have a basic, scrollable table and five types of graph to choose for viewing your data. You can interactively change the layout of the graph by clicking on the “settings” link but there are no formatting controls to alter the x or y labels - if you look carefully at the right-hand area graph in the first screenshot you will probably spot that the time value labels on the x-axis overlap each other. Quick Summary Now, this may be obvious but I would not call Zeppelin a data integration tool nor a BI tool for reasons that will become clear during the next set of blog posts. Having said that, overall, Zeppelin is a very exciting and clever product. It is relatively easy to setup connections to your Oracle Database, the scripting framework is very powerful and there are visualization features are good enough. It's a new type of application that is just about flexible enough for data scientists, power users and report writers. What’s next? In my next series of blog posts, which I aiming to write over the next couple of weeks, I will explain how to download and install Apache Zeppelin, how to setup connections to an Oracle Database and how to use some of the scripting features to build reports similar to the ones above. If you are comfortable with writing your own shell scripts, SQL scripts, markup scripts for formatting text then Zeppelin is very flexible tool. If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com. (image at top of post is courtesy of wikipedia). 

Over the past couple of weeks I have been looking at one of the Apache open source projects called Zeppelin. It’s a new style of application called a “notebook” which typically runs within your...

Data Warehousing

The latest in Oracle Partitioning - Part 2: Multi Column List Partitioning

This is the second blog about new partitioning functionality in Oracle Database 12c Release 2, available on-premise for Linux x86-64, Solaris Sparc64, and Solaris x86-64 and for everybody else in the Oracle Cloud . This one will talk about multi column list partitioning, a new partitioning methodology in the family of list partitioning. There will be more for this method, coming in a future blog post (how about that for a teaser?). Just like read only partitions, this functionality is rather self-explaining. Unlike in earlier releases, we now can specify more than one column as partition key columns for list partitioned tables, enabling you to model even more business use cases natively with Oracle Partitioning. So let's start off with a very simple example: CREATE TABLE mc PARTITION BY LIST (col1, col2) (PARTITION p1 VALUES ((1,2),(3,4)),  PARTITION p2 VALUES ((4,5)),  PARTITION p3 VALUES (DEFAULT)) AS SELECT rownum col1, rownum+1 col2 FROM DUAL CONNECT BY LEVEL <= 10; Yes, you can have a partitioned table with ten records, although I highly recommend NOT to assume this as best practice for real world environments. Just because you can create partitions - and many of them - you should always bear in mind that partitions come with a "cost" in terms of additional metadata in the data dictionary (and row cache, library cache), with additional work for parsing statements and so forth. Ten records per partition don't cut it. You should always consider having a reasonable amount of data per partition, but that's a topic for another day. When we now look at the metadata of this newly created table you will see the partition value pairs listed as HIGH VALUE in the partitioning metadata: SQL> SELECT partition_name, high_value FROM user_tab_partitions WHERE table_name='MC'; PARTITION_NAME                 HIGH_VALUE ------------------------------ ------------------------------ P1                             ( 1, 2 ), ( 3, 4 ) P2                             ( 4, 4 ) P3                             DEFAULT Now, while I talked about a "new partitioning strategy" a bit earlier, from a metadata perspective it isn't one. For the database metadata it is "only" a functional enhancement for list partitioning: the number of partition key columns is greater than one: SQL> SELECT table_name, partitioning_type, partitioning_key_count FROM user_part_tables WHERE table_name='MC'; TABLE_NAME                     PARTITION PARTITIONING_KEY_COUNT ------------------------------ --------- ---------------------- MC                             LIST                           2 Let's now look into the data placement using the partition extended syntax and query our newly created table. Using the extended partition syntax is equivalent to specifying a filter predicate that exactly matches the partitioning criteria and an easy way to safe some typing. Note that both variants of the partition extended syntax - specifying a partition by name or by pointing to a specific record within a partition - can be used for any partition maintenance operation and also in conjunction with DML. SQL> SELECT * FROM mc PARTITION (p1);       COL1       COL2 ---------- ----------          1          2          3          4 I can get exactly the same result when I am using the other variant of the partition extended syntax: SQL> SELECT * FROM mc PARTITION FOR (1,2);       COL1       COL2 ---------- ----------          1          2          3          4 After having built a simple multi column list partitioned table with some data, let's just do one basic partition maintenance operation, namely a split operation on partition P1 that we just looked at. You might remember that this partition has two sets of key pairs as partition key definition, namely (1,2) and (3,4). We use the new functionality of doing this split in an online mode: SQL> ALTER TABLE mc SPLIT PARTITION p1 INTO (PARTITION p1a VALUES (1,2), PARTITION p1b) ONLINE; Table MC altered. Unlike offline partition maintenance operations (PMOP) that take an exclusive DML lock on the partitions the database is working on (which prohibits any DML change while the PMOP is in flight), an online PMOP does not take any exclusive locks and allows not only queries (like offline operations) but also continuous DML operations while the operation is ongoing. After we have now done this split, let's check the data containment in our newly created partition P3: SQL> SELECT * FROM mc PARTITION (p1a);       COL1       COL2 ---------- ----------          1          2 That's about it for now for multi column list partitioned tables. I am sure I have forgotten some little details here and there and I am sure that this short blog post is probably not answering all questions you are having. So please, stay tuned and if you have any comments about this specific one or suggestions for future blog posts, then please let me know. You can always reach me at hermann.baer@oracle.com. Another one down, many more to go.

This is the second blog about new partitioning functionality in OracleDatabase 12c Release 2, available on-premise for Linux x86-64, Solaris Sparc64, and Solaris x86-64 and for everybody else in...

Functionality

The latest in Oracle Partitioning - Part 1: Read Only Partitions

Now that Oracle Database 12c Release 2 is available on-premise for  Linux x86-64, Solaris Sparc64, and Solaris x86-64 and on the Oracle Cloud for everybody else - the product we had the pleasure to develop and to play with for quite some time now - it's also time to introduce some of the new functionality in more detail to the broader audience. This blog post will be the first of hopefully many over the course of the next months (time permits) to specifically highlight individual features of Oracle Partitioning and why we actually implemented it. A big thanks at this point to the user community out there: a lot of the features would not be possible without the continuous feedback from you out there. Yes, we have a lot of ideas - some better than others - but we could never come up with all these features by ourselves. Keep on going and send us ideas and suggestions! For Partitioning that would be me, hermann.baer@oracle.com, but I am digressing ... So let's talk about the first feature to highlight: Read Only Partitions. As the name suggest, it is a rather self-explanatory functionality: With the new release we now can set an object to read only with the granularity of a partition or subpartition. Prior to this release this was an all-or-nothing attribute on table level. And that's it. Just like other attributes, e.g. compression, tablespace location, you specify READ ONLY (or READ WRITE, the default) on a table, partition, or subpartition level and voila, you're done. The most simple example is probably CREATE TABLE toto1 PARTITION BY LIST (col1) (PARTITION p0 VALUES (0) READ ONLY,  PARTITION p1 VALUES (1) READ WRITE,  PARTITION p2 VALUES (2)) AS SELECT mod(rownum,2) col1, mod(rownum,10) col2 FROM DUAL CONNECT BY LEVEL < 100; Voila. We now have a table with 99 rows and three partitions, one of them being read only. Let's now create a second table as composite partitioned table: CREATE TABLE toto2 PARTITION BY LIST (col1) SUBPARTITION BY LIST (col2) SUBPARTITION TEMPLATE (SUBPARTITION sp1 VALUES (0) READ ONLY,  SUBPARTITION sp2 VALUES (DEFAULT)) (PARTITION p0 VALUES (0) READ ONLY,  PARTITION p1 VALUES (1) READ WRITE,  PARTITION p2 VALUES (2)) AS SELECT mod(rownum,2) col1, mod(rownum,10) col2 FROM DUAL CONNECT BY LEVEL < 100; This statement creates a table with two partitions, one of them in read only mode; each partition has two subpartitions, one of them being set explicitly to READ WRITE. Let's now look a little bit closer of how inheritance works here. Note that we specified READ WRITE only for partition p1 and not for partition p2 in both examples. While on a first glance this seems to be completely arbitrary from a SQL syntax perspective, there is a subtle difference when it comes to composite partitioned tables (and that difference is true for other attributes as well). We come to that in a second. The data dictionary reflects this new attribute in all relevant views: on a table level for the default being chosen (*_PART_TABLES.DEF_READ_ONLY), on a partition level the attribute setting of a partition or the default for subpartitions (*_TAB_PARTITIONS.READ_ONLY), and on subpartition level the attribute setting of the subpartition (*_TAB_SUBPARTITIONS.READ_ONLY). So in our case it looks as follows: SQL> SELECT table_name, def_read_only FROM user_part_tables WHERE table_name IN ('TOTO1','TOTO2'); TABLE_NAME                     DEF_READ_ONLY ------------------------------ ------------------------------ TOTO1                          NO TOTO2                          NO SQL> SELECT table_name, partition_name, read_only FROM user_tab_partitions WHERE table_name IN ('TOTO1','TOTO2') ORDER BY 1,2; TABLE_NAME                     PARTITION_NAME                 READ_ONLY ------------------------------ ------------------------------ ------------------------------ TOTO1                          P0                             YES TOTO1                          P1                             NO TOTO1                          P2                             NO TOTO2                          P0                             YES TOTO2                          P1                             NO TOTO2                          P2                             NONE As mentioned before, the READ ONLY attribute on a partition level is overloaded and serves a purpose for both partitioned and subpartitioned tables. For partitioned tables it reflects that actual attribute for the partition segment, so you will see that partitions p1 and p2 of table toto1 are set to READ WRITE and partition p0 is set to READ ONLY. However. looking at table toto2 you will see that only partition p1 is set to READ WRITE, while partition p2 shows undefined (NONE). That's because in the case of a composite partitioned table an attribute on partition level is only set when explicitly specified, which then acts as default for a partition. If a default is set on partition level then this default will be used for any new subpartition that will be created underneath a partition if an attribute is not specified explicitly. If none is set then the table level default will be used. On a subpartition level there are no surprises: subpartitions *_sp1 - with explicit setting of READ ONLY - are set to read only, supbartitions *_sp2 inherit the attribute from either the partition or table level. SQL> SELECT table_name, partition_name, subpartition_name, read_only FROM user_tab_subpartitions WHERE table_name = 'TOTO2' ORDER BY 1,2,3; TABLE_NAME                     PARTITION_NAME       SUBPARTITION_NAME    READ_ONLY ------------------------------ -------------------- -------------------- -------------------- TOTO2                          P0                   P0_SP1               YES TOTO2                          P0                   P0_SP2               YES TOTO2                          P1                   P1_SP1               NO TOTO2                          P1                   P1_SP2               NO TOTO2                          P2                   P2_SP1               YES TOTO2                          P2                   P2_SP2               NO OK, so far I should not have told you anything new here .. just should have shown that the standard inheritance of attribute for partitioned tables works exactly the same way for the new READ ONLY attribute as it does for other ones. So let's have a quick look into what it means to have a partition set to read only. We will use table toto1 only from now on. The first thing is probably the most-intuitive behavior: you cannot update data in a read only partition. So when you try to update partition p0 and partition p1 you will see that DML works on the read write partition but throws an error for the read only partition. SQL> UPDATE toto1 PARTITION (p0) SET col2=col2+1; Error starting at line : 1 in command - update toto1 partition (p0) set col2=col2+1 Error report - ORA-14466: Data in a read-only partition or subpartition cannot be modified. SQL> UPDATE toto1 PARTITION (p1) SET col2=col2+1; 50 rows updated. You can change the attribute for a partition, so if you wanted to update partition p0 in table toto1, you would have to set it to read write: SQL> ALTER TABLE toto1 MODIFY PARTITION p0 READ WRITE; Table toto1 altered. SQL> UPDATE toto1 PARTITION (p0) SET col2=col2+1; 49 rows updated. But let's ignore this one and skip that step, so partition p0 is still read only. The question begs now what operations are allowed? The conceptual rule for read only partitions is that Oracle must not allow any change to the data that would change the data at the time a partition was set to read only. Oracle provides data immutability for the data as it existed when a partition is set to read only. Or in other more SQL-like words, the SELECT <column_list> FROM <table> [SUB]PARTITION <read_only_[sub]partition> must not change under any circumstances. Let's set our partition back to read only again: SQL> ALTER TABLE toto1 MODIFY PARTITION p0 READ ONLY; Table toto1 altered. The first logical consequence is that you cannot drop a read only partition. Conceptually a drop partition removes a subset of data from a table, equivalent to a DELETE FROM <table> WHERE <data_happens_to_b_in_partition>, so this violates the above-mentioned 'must-not-change-date' rule: SQL> ALTER TABLE toto1 DROP PARTITION p0; Error starting at line : 1 in command - alter table toto1 drop partition p0 Error report - ORA-14466: Data in a read-only partition or subpartition cannot be modified. Note, however, that you can drop a read only table as well as a partitioned table with one or multiple read only partitions. Ooops, a bug? Nope, works as designed. Removing an object is different than removing data. The read only attribute is not meant to protect an object; that's what privileges are good for or, if you really want to protect a dedicated object, the locking capabilities of a table: you can disable a table lock. Second, you cannot exchange a read only partition. That changes data in the table as well. SQL> CREATE TABLE xtoto FOR EXCHANGE WITH TABLE toto1; Table xtoto created. SQL> ALTER TABLE toto1 EXCHANGE PARTITION p0 WITH TABLE xtoto; Error starting at line : 1 in command - alter table toto1 exchange partition p0 with table xtoto Error report - ORA-14466: Data in a read-only partition or subpartition cannot be modified. You cannot drop a column of a table with read only partitions or set columns to unused. Kind of obvious, isn't it? That removes data partially. However, you can set a column to invisible. No data is removed, it is just not part of the * notation of a table anymore. After knowing the basics of what you can't do, what is possible? In short, anything that does not violate the above-mentioned rules. So you can split a read only partition, you can merge partitions where one or multiple partitions are read only, you can move partitions. Pretty much everything else is doable from a partition maintenance operation (there is some subtleties, like not allowing online partition maintenance operations when read only partitions are involved, or not allowing filtered partition maintenance operations, another new and exciting functionality in 12.2.0.1 and topic of a later blog, but that's beyond this blog post). A simple example with our table toto1: SQL> ALTER TABLE toto1 MERGE PARTITIONS p0, p1 INTO PARTITION p0_1 READ ONLY; Table toto1 altered. The most important thing you can do is that you can add columns to a table with read only partitions, irrespective of whether they have default values or not, whether they're nullable or not, or whether the addition of a column might require touching the blocks on disk. All this is allowed and does not violate the above-mentioned rule of data immutability. The output of the select list at read-only-setting time does not change. SQL> ALTER TABLE toto1 ADD col_new varchar2(100) DEFAULT 'i am not empty'; Table toto1 altered. Why do we allow this? The answer is simple. Partitioned tables are 'living beasts' and fundamental components of tens of thousands of applications. These tables never go away (unless an application goes away) and are growing. Another given is that applications change over time. If we had decided to disallow the most basic schema evolution for a partitioned table - adding columns - and defined data immutability blindly as "select * must not change", then we had made this new functionality way too limited to benefit our broad customer base. Any application change involving partitioned tables with read only partitions would have been ruled out unless all read only partitions were set to read write for doing the add column to then set the formerly read only partitions back to their original state ... plus ensuring that the data of these partitions must not be changed throughout this process. Way too complicated and convoluted. Our decision is based on our customer's interest and feedback that we got throughout the early process of our beta program. That's about it for now for read only partitions. I am sure I have forgotten some little details here and there, but there's always soo many things to talk (write) about that you will never catch it all. But stay tuned. There's more blog posts to come. And please, if you have any comments about this specific one or suggestions for future ones, then please let me know. You can always reach me at hermann.baer@oracle.com. Cheers, over and out.

Now that Oracle Database12c Release 2 is available on-premise for  Linux x86-64, Solaris Sparc64, and Solaris x86-64 and on the Oracle Cloud for everybody else - the product we had the pleasure to...

Data Warehousing

Database 12c Release 2 available for download

Database 12c Release 2 available for download  Yes, it’s the moment the world has been waiting for: the latest generation of the world’s most popular database, Oracle Database 12c Release 2 (12.2) is now available everywhere - in the Cloud and on-premises. You can download this latest version from the database home page on OTN - click on the Downloads tab. So What’s New in 12.2 for Data Warehousing? This latest release provides some incredible new features for data warehouse and big data. If you attended last year’s OpenWorld event in San Francisco then you probably already know all about the new features that we have added to 12.2 - checkout my blog post from last year for a comprehensive review of #oow16: Blog: The complete review of data warehousing and big data content from Oracle OpenWorld 2016 If you missed OpenWorld and if you are a data warehouse architect, developer or DBA then here are the main feature highlights 12.2 with links to additional content from OpenWorld and my personal data warehouse blog:  General Database Enhancements 1) Partitioning Partitioning: External TablesPartitioned external tables provides both the functionality to map partitioned Hive tables into the Oracle Database ecosystem as well as providing declarative partitioning on top of any Hadoop Distributed File System (HDFS) based data store. Partitioning: Auto-List PartitioningThe database automatically creates a separate (new) partition for every distinct partition key value of the table. Auto-list partitioning removes the management burden from the DBAs to manually maintain a list of partitioned tables for a large number of distinct key values that require individual partitions. It also automatically copes with the unplanned partition key values without the need of a DEFAULT partition. Partitioning: Read-Only PartitionsPartitions and sub-partitions can be individually set to a read-only state. This then disables DML operations on these read-only partitions and sub-partitions. This is an extension to the existing read-only table functionality. Read-only partitions and subpartitions enable fine-grained control over DML activity.  Partitioning: Multi-Column List PartitioningList partitioning functionality is expanded to enable multiple partition key columns. Using multiple columns to define the partitioning criteria for list partitioned tables enables new classes of applications to benefit from partitioning. For more information about partitioning see: #OOW16 - Oracle Partitioning: Hidden Old Gems and Great New Tricks, by Hermann Baer, Senior Director Product Management Partitioning home page on OTN 2) Parallel Execution Parallel Query Services on RAC Read-Only NodesOracle parallel query services on Oracle RAC read-only nodes represents a scalable parallel data processing architecture. The architecture allows for the distribution of a high number of processing engines dedicated to parallel execution of queries. For more information about parallel execution see:  #OOW16 - The Best Way to Tune Your Parallel Statements: Real-Time SQL Monitoring by Yasin Baskan, Senior Principal Product Manager Parallel Execution home page on OTN Schema Enhancements Dimensional In-Database analysis with Analytic Views Analytic views provide a business intelligence layer over a star schema, making it easy to extend the data set with hierarchies, levels, aggregate data, and calculated measures. Analytic views promote consistency across applications. By defining aggregation and calculation rules centrally in the database, the risk of inconsistent results in different reporting tools is reduced or eliminated.The analytic view feature includes the new DDL statements, such as CREATE ATTRIBUTE DIMENSION, CREATE HIERARCHY and CREATE ANALYTIC VIEW, new calculated measure expression syntax, and new data dictionary views.These analytic views allow data warehouse and BI  developers to extend the star schema with time series and other calculations eliminating the need to define calculations within the application. Calculations can be defined in the analytic view and can be selected by including the measure name in the SQL select list. For more information about Analytic Views see: #OOW16 - Analytic Views: A New Type of Database View for Simple, Powerful Analytics by Bud Endress, Director, Product Management SQL Enhancements Cursor-Duration Temporary Tables Cached in MemoryComplex queries often process the same SQL fragment (query block) multiple times to answer a question. The results of these queries are stored internally, as cursor-duration temporary tables, to avoid the multiple processing of the same query fragment. With this new functionality, these temporary tables can reside completely in memory avoiding the need to write them to disk. Performance gains are the result of the reduction in I/O resource consumption. Enhancing CAST Function With Error HandlingThe existing CAST function is enhanced to return a user-specified value in the case of a conversion error instead of raising an error. This new functionality provides more robust and simplified code development. New SQL and PL/SQL Function VALIDATE_CONVERSIONThe new function, VALIDATE_CONVERSION, determines whether a given input value can be converted to the requested data type. The VALIDATE_CONVERSION function provides more robust and simplified code development. Enhancing LISTAGG FunctionalityLISTAGG aggregates the values of a column by concatenating them into a single string. New functionality is added for managing situations where the length of the concatenated string is too long. Developers can now control the process for managing overflowing LISTAGG aggregates. This increases the productivity and flexibility of this aggregation function. Approximate Query ProcessingThis release extends the area of approximate query processing by adding approximate percentile aggregation. With this feature, the processing of large volumes of data is significantly faster than the exact aggregation. This is especially true for data sets that have a large number of distinct values with a negligible deviation from the exact result. Approximate query aggregation is a common requirement in today's data analysis. It optimizes the processing time and resource consumption by orders of magnitude while providing almost exact results. Approximate query aggregation can be used to speed up existing processing. Parallel Recursive WITH EnhancementsOracle Database supports recursive queries through the use of a proprietary CONNECT BY clause, and an ANSI compliant resursive WITH clause. The parallel recursive WITH clause enables this type of query to run in parallel mode. These types of queries are typical with graph data found in social graphs, such as Twitter graphs or call records and commonly used in transportation networks (for example, for flight paths, roadways, and so on). Recursive WITH ensures the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. Bi-directional searching is used to ensure the efficient computation of the shortest path from a single source node to single or multiple destination nodes in a graph. A bi-directional search starts from both source and destination nodes, and then advancing the search in both directions. For more information about the new data warehouse SQL enhancements see: #OOW16 - Oracle Database 12c Release 2: Top 10 Data Warehouse Features for Developers and DBAs by Keith Laker, Senior Principal Product Manager, SQL for Analysis, Reporting and Modeling home page on OTN. Blog: Is an approximate answer just plain wrong? Blog: Are approximate answers the best way to analyze big data Blog: How to intelligently aggregate approximations Blog: Dealing with very very long string lists using Database 12.2 Blog: Simplifying your data validation code with Database 12.2 Blog: My query just got faster - new in 12.2: in-memory temp tables (coming soon!) on oracle-big-data.blogspot.co.uk In addition to the above features, we have made a lot of enhancements and added new features to the Optimizer and there is a comprehensive review by Nigel Bayliss, senior principle product manager, available on the optimizer blog. Obviously, the above is my take on what you need to know about for 12.2 and it’s not meant to be an exhaustive list of all the data warehouse and big data features. For the complete list of all the new features in 12.2 please refer to the New Features Guide in the database documentation set. I would really like to thank my amazing development team for all their hard work on the above list of data warehouse features and the all the time they have spent proof-reading and fact-checking my blog posts on these new features.  Enjoy using this great new release and checkout all the 12.2 tutorials and scripts on livesql!

Database 12c Release 2 available for download  Yes, it’s the moment the world has been waiting for: the latest generation of the world’s most popular database, Oracle Database 12c Release 2 (12.2) is...

Functionality

The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters

We missed to document some functionality !!! With the next milestone for Oracle Database 12c Release 2 just taking place - the availability on premise for Linux x86-64, Solaris Sparc64, and Solaris x86-64, in addition to the Oracle Cloud - I managed to use this as an excuse to play around with it for a bit .. and found that we somehow missed to document new functionality. Bummer. But still better than the other way around .. ;-) We missed to document the capability to override some parameters of an external table at runtime. So I decided to quickly blog about this to not only fill the gap in documentation (doc bug is filed already) but also to ruthlessly hijack the momentum and to start highlighting new functionality (there's more blogs to come, specifically around my pet peeve Partitioning, but that's for later). So what does it mean to override some parameters of an external table at runtime? It simply means hat you can use one external table definition stub as proxy for external data access of different files, with different reject limits, at different points in time. Without the need to do a DDL to modify the external table definition. The usage is pretty simple and straightforward, so let me quickly demonstrate this with a not-so-business-relevant sample table. The pre-requirement SQL for this one to run is at the end of this blog and might make its way onto github as well; I have not managed that yet and just wanted to get this blog post out. Here is my rather trivial external table definition. Works for me since version 9, so why not using it with 12.2 as well. CREATE TABLE et1 (col1 NUMBER, col2 NUMBER, col3 NUMBER) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY d1 ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII NOBADFILE NOLOGFILE FIELDS TERMINATED BY "," ) LOCATION ('file1.txt') ) REJECT LIMIT UNLIMITED ; Pretty straightforward vanilla external table. Let's now see how many rows this external table returns (the simple "data generation" is at the end of this blog): SQL> SELECT count(*) FROM et1; COUNT(*) ---------- 99 So far, so good. And now the new functionality. We will now access the exact same external table but tell the database to do a runtime modification of the file (location) we are accessing: SQL> SELECT count(*) FROM et1 EXTERNAL MODIFY (LOCATION ('file2.txt')); COUNT(*) ---------- 9 As you can see, the row count changes without me having done any change to the external table definition like an ALTER TABLE. You will also see that nothing has changed in the external table definition: SQL> SELECT table_name, location FROM user_external_locations WHERE table_name='ET1'; TABLE_NAME LOCATION ------------------------------ ------------------------------ ET1 file1.txt And there's one more thing. You might have asked yourself right now, right this moment ... why do I have to specify a location then for the initial external table creation? The answer is simple: you do not have to do this anymore. Here is my external table without a location: CREATE TABLE et2 (col1 NUMBER, col2 NUMBER, col3 NUMBER) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY d1 ACCESS PARAMETERS ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII NOBADFILE NOLOGFILE FIELDS TERMINATED BY "," ) ) REJECT LIMIT UNLIMITED ; When I now select from it, guess what: you won't get any rows back. The location is NULL. SQL> SELECT * FROM et2; no rows selected Using this stub table in the same way as before gives me access to my data. SQL> SELECT count(*) FROM et1 EXTERNAL MODIFY (LOCATION ('file2.txt')); COUNT(*) ---------- 9 You get the idea. Pretty cool stuff.  Aaah, and to complete the short functional introduction: the following clauses can be over-ridden: DEFAULT DIRECTORY, LOCATION, ACCESS PARAMETERS (BADFILE, LOGFILE, DISCARDFILE) and REJECT LIMIT.  That's about it for now for online modification capabilities for external tables. I am sure I have forgotten some little details here and there, but there's always soo many things to talk (write) about that you will never catch it all. And hopefully the documentation will cover it all rather sooner than later. Stay tuned for now. There's more blog posts about 12.2 to come. And please, if you have any comments about this specific one or suggestions for future ones, then please let me know. You can always reach me at hermann.baer@oracle.com. Cheers, over and out.  And here's the most simple "data generation" I used for the examples above to get "something" in my files. Have fun playing. rem my directory rem create or replace directory d1 as '/tmp'; rem create some dummy data in /tmp set line 300 pagesize 5000 spool /tmp/file1.txt select rownum ||','|| 1 ||','|| 1 ||','  from dual connect by level < 100; spool off spool /tmp/file2.txt select rownum ||','|| 22 ||','|| 22 ||','  from dual connect by level < 10; spool off  

We missed to document some functionality !!! With the next milestone for Oracle Database 12c Release 2 just taking place - the availability on premise for Linux x86-64, Solaris Sparc64, and Solaris...

Big Data

Data loading into HDFS - Part3. Streaming data loading

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it already has many blogposts. Today I'm going to talk about Flume and Kafka. What is Kafka?  Kafka is distributed service bus. Ok, but what is service bus? Let's imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this: but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this: What is Flume?  "Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store." - this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn't just use Hadoop client? Challenge 1. Small files. Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it's still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops. Challenge 2. Lots of data sources Let's imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS. life is good,  if files are large enough it's not gonna be a problem. But now let's imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe - just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources.  Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS. In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it's only two connections - it's affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents. I also want to note, that it's a good practice to use AVRO sink when you move data from one tier to next. Here is an example of the flume config file: agent.sinks = avroSink agent.sinks.avroSink.type = avro agent.sinks.avroSink.channel = memory agent.sinks.avroSink.hostname = avrosrchost.example.com agent.sinks.avroSink.port = 4353 Kafka Architecture. Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts. Producer - a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role. Consumer - a process that reads the data from Kafka cluster.  Brocker - a member of Kafka cluster. Set of members is Kafka cluster.  Flume Architecture. You could find a lot of useful information about Flume in this book, here I just highlight key concepts. Flume  has 3 major components: 1) Source - where I get the data 2) Chanel - where I buffer it. It could be memory or disk, for example.  3) Sink - where I load my data. For example, it could be another tier of Flume agents, HDFS or  HBase.   Between source and channel, there are two minor components: Interceptor and Selector. With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header.  Flume and Kafka similarities and differences. It's a frequent question: "what is the difference between Flume and Kafka", the answer could be very expanded, but let me briefly explain key points. 1) Pull and Push. Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk Kafka accumulates data until client initiate reads. So client pulls data whenever he wants. 2)  Data processing Flume could do simple transformations by interceptors Kafka doesn't do any data processing, just store that data.  3) Clustering Flume usually is a batch of single instances. Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts.  4) Message size Flume doesn't have any obvious restrictions for size of the message Kafka was designed for few KB messages 5) Coding vs Configuring Flume usually configurable tool (users usually don't write the code, instead of this they use configure capabilities). With Kafka, you have to write code for load/unload the data. Flafka. Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It's quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there). First of all, Flafka is not a dedicated project. It's just bunch of Java classes for integration Flume and Kafka. Now  Kafka could be either source for Flume by flume config: flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource or channel by the following directive: flume1.channels.kafka-channel-1.type = org.apache.flume.channel.kafka.KafkaChannel Use Case1. Kafka as a source or Chanel if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don't like it, you may use Kafka as a Flume source.  in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case. Also, you could use Kafka as a Flume Chanel for high availability purposes (it's distributed by application design).  Use case 2. Kafka as a sink. If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) - you have this option. The only one that you need is config Kafka as a sink. Use case 3. Flume as the tool to enrich data. As I Already told before - Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.   Conclusion. There are two major tools for loading stream data - Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it's why Flafka have been created - it's just a combination of those two tools.

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle...

Big Data SQL

Big Data SQL Quick Start. Machine Learning and Big Data SQL – Part 19

It's very frequent case when somebody talks about Big Data he or she also wants to know how to apply Machine Learning algorithms over this data sets. Oracle Big Data SQL provides the easy and seamless way to do this. Secret of this in Oracle Advanced Analytics (OAA) option, which has been existing for many years. This is the set of the existing algorithms, together with SQL Developer Data Minner interface, allow easily in drag and drop style create advanced models. OAA works over the Oracle DB tables. Big Data SQL allows us to get access to Hadoop data through the External table. Easy! It's better to see once rather than hear multiple times. let me give an example. Let's imagine we have some store and for sure we have customers who make purchases there. Roughly customers could be divided into 3 categories: who spend a little money, who spend a lot of money and average customer. Now, having statistics of the sales and some personal information about customers, we want to understand what is the profile of Big Spender (a person who spend a lot of money in our store). I simplified my example and I only have two tables: - Fact of the sales - Dimension table with customer's info Big Fact stores data into JSON format on the HDFS. Dimension tables stored into Oracle RDBMS.  As you may remember from my previous post, Big Data SQL allows us to represent semi-structure (or even structure) data as a table in Oracle Database. Now we have two tables, which is related by customer key (primary key for the dimension table, foreign key for the fact table). After this I build the model in SQL Developer: Data Minner allows us to write SQL inside the model, let's have a look:   This query defines first 5% of customers who spend more money as "big spender". Bottom 5% as "low spender". Others are average customers. I materialize this aggregate and join it with customers table. After this run Machine learning algorithms (they already created I just use it). Decision Trees. Let's have a look at the results: First is decision tree: it's hard to see here something. Let's zoom out two nodes which show us Big Spenders (our target):  From this Node we could conclude, that there is pretty high probability, that Big Spender profile is: - Female - With high annual_income (more than 84 500) - Who lives in either California or Maryland or New Jersey - Her marital status is Divorced or Married or Separated or Widowed The second node shows us: that, there is also pretty high probability, that Big Spender profile is: - Female - With high annual_income (more than 84 500) - Who is younger than 43.5 - Her marital status is Single Naive Bayes I also used Naive Bayes classifier in my example: it shows, that most probably big Spender is: - Female - Younger than 34.5 - Single - With high (more than 118 712 annual income) Conclusion. The couple of Oracle Advances Analytics and Big Data SQL allows us to: - Use HDFS as a cheap storage - Use HDFS as schema-on read storage, which allows us to define the data schema during the read (parse semi-structure data) - Use Oracle Advanced analytics Drag and Drop intuitive interface for building classification models over Big Data (data stored into Hadoop). All code and Advanced Analytics models, which have been used in my example available here  

It's very frequent case when somebody talks about Big Data he or she also wants to know how to apply Machine Learning algorithms over this data sets. Oracle Big Data SQL provides the easy and seamless...

Big Data SQL

Big Data SQL Quick Start. Oracle Text Integration – Part 18

Today, we’ll focus on the integration of Oracle Text, the Full-Text Indexing capabilities from the Oracle database with documents residing on HDFS.     Oracle Text has been available since years and has evolved to address today’s needs regarding indexing:       150+ document formats supported (PDF, DOCX, XLSX, PPTX, XML, JSON…) Dozens of languages Files can be stored inside the database (SecureFiles), or outside on File System or accessible through URL Advanced search functions: Approximate hit, Wild-Card, Fuzzy, Stemming, Boolean operators, Entity extraction… Inherits the database capabilities: security, high availability, performance… One API: SQL that allows joining with other types of data (clients, products…) residing inside or outside of the database (i.e. NoSQL and/or HDFS using Big Data SQL) Machine Learning: Classification and Segmentation (supervised or not)   …     As numerous requests have been made to allow indexing for external documents on HDFS, this is the time to demonstrate one way to do it.     For the demonstration, I’ll use the Oracle Big Data Lite VM which contains all the required components.   Oracle Text and WebHDFS Integration The “magic” will reside in the right integration of WebHDFS with Oracle Text URL_DATASTORE capability. As WebHDFS is already configured on the Big Data Lite VM, the only point to take care of is the need to provide the right document URL to Oracle Text. Let’s start by copying some documents into HDFS: [oracle@bigdatalite ~]$ cd /usr/share/doc/search-1.0.0+cdh5.8.0+0/examples/test-documents [oracle@bigdatalite test-documents]$ ls -l *.doc* *.xls* *.pdf *.ppt* -rw-r--r--. 1 root root 4355 Jul 12 2016 NullHeader.docx -rw-r--r--. 1 root root 13824 Jul 12 2016 testEXCEL.xls -rw-r--r--. 1 root root 9453 Jul 12 2016 testEXCEL.xlsx -rw-r--r--. 1 root root 34824 Jul 12 2016 testPDF.pdf -rw-r--r--. 1 root root 164352 Jul 12 2016 testPPT_various.ppt -rw-r--r--. 1 root root 56659 Jul 12 2016 testPPT_various.pptx -rw-r--r--. 1 root root 35328 Jul 12 2016 testWORD_various.doc [oracle@bigdatalite test-documents]$ hdfs dfs -mkdir /user/oracle/documents [oracle@bigdatalite test-documents]$ hdfs dfs -put NullHeader.docx /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testEXCEL.xls /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testEXCEL.xlsx /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testPDF.pdf /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testPPT_various.ppt /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testPPT_various.pptx /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -put testWORD_various.doc /user/oracle/documents/ [oracle@bigdatalite test-documents]$ hdfs dfs -ls /user/oracle/documents Found 7 items -rw-r--r-- 1 oracle oracle 4355 2016-12-16 11:44 /user/oracle/documents/NullHeader.docx -rw-r--r-- 1 oracle oracle 13824 2016-12-16 11:44 /user/oracle/documents/testEXCEL.xls -rw-r--r-- 1 oracle oracle 9453 2016-12-16 11:44 /user/oracle/documents/testEXCEL.xlsx -rw-r--r-- 1 oracle oracle 34824 2016-12-16 11:44 /user/oracle/documents/testPDF.pdf -rw-r--r-- 1 oracle oracle 164352 2016-12-16 11:44 /user/oracle/documents/testPPT_various.ppt -rw-r--r-- 1 oracle oracle 56659 2016-12-16 11:44 /user/oracle/documents/testPPT_various.pptx -rw-r--r-- 1 oracle oracle 35328 2016-12-16 11:44 /user/oracle/documents/testWORD_various.doc [oracle@bigdatalite test-documents]$         Now, the WebHDFS URL to access the NullHeader.docx document would be: http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN Using the aforementioned URL, WebHDFS provides us the content of the document. However, you can remark that the combo box says “from http://bigdatalite.localdomain:50075” which means some redirection is being done behind the curtain! This point has to be managed properly or nothing will be indexed. On the Oracle database side, we have to prepare some preferences to index the documents properly according to the type of search we want to perform:   The command hereunder needs to be run once by SYS to give the rights to create a Full-Text Index to all users in the ORCL pluggable database: [oracle@bigdatalite test-documents]$ sql sys/welcome1@//localhost:1521/cdb as sysdba SQLcl: Release 4.2.0 Production on Fri Feb 10 15:33:53 2017 Copyright (c) 1982, 2017, Oracle. All rights reserved.   New version: 4.1.0 available to download log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.RequestAddCookies). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> alter session set container = ORCL; Session altered. SQL> exec ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE', 'PUBLIC'); PL/SQL procedure successfully completed. SQL> SELECT par_value FROM ctxsys.ctx_parameters WHERE par_name = 'FILE_ACCESS_ROLE'; PAR_VALUE ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- PUBLIC SQL> Now we’ll create the user IDX with the appropriate rights. The CTXAPP role will allow IDX to create the Full-Text Index. SQL> create user idx identified by idx 2 default tablespace users 3 temporary tablespace temp 4 quota unlimited on users; User IDX created. SQL> SQL> grant create session, create table, ctxapp, create sequence, create procedure to idx; Grant succeeded. SQL> SQL> grant execute on UTL_HTTP to idx; Grant succeeded. SQL> begin 2 DBMS_NETWORK_ACL_ADMIN.append_host_ace ( 3 host => 'bigdatalite.localdomain', 4 lower_port => 50000, 5 upper_port => 50100, 6 ace => xs$ace_type(privilege_list => xs$name_list('http'), 7 principal_name => 'idx', 8 principal_type => xs_acl.ptype_db)); 9 end; 10 11 / PL/SQL procedure successfully completed. SQL> col host format a30 SQL> col acl format a30 SQL> col acl_owner format a30 SQL> SELECT HOST, 2 LOWER_PORT, 3 UPPER_PORT, 4 ACL, 5 ACLID, 6 ACL_OWNER 7 FROM dba_host_acls 8 ORDER BY host; HOST LOWER_PORT UPPER_PORT ACL ACLID ACL_OWNER ------------------------------ ---------- ---------- ------------------------------ ---------------- ------------------------------ * NETWORK_ACL_DD7C57F0D3BE0871E0 0000000080002710 SYS 4325AAE80A17A8 bigdatalite.localdomain 50000 50100 NETWORK_ACL_44A2890856044C91E0 0000000080002760 SYS 530100007F5A1A localhost /sys/acls/oracle-sysman-ocm-Re 0000000080002738 SYS solve-Access.xml SQL> The grants to use the PL/SQL package UTL_HTTP will be useful to manage the URL Follow Redirection that WebHDFS is doing. Remark that the DBMS_NETWORK_ACL_ADMIN package is used for security purpose to allow accessing the URL resources from PL/SQL code. Also the view DBA_NETWORK_ACLS can be queried to check for the URL access rights. Full Text Index Creation   Now it is time to focus on the index part. We need to connect with user IDX and execute the following steps:     Create a table that will store the URLs of the documents in HDFS to index Create preferences to create the index Create the index Insert the proper URLs managing WebHDFS redirections SQL> connect idx/idx@ORCL Connected. SQL> create table hdfs_docs ( 2 id number GENERATED ALWAYS AS IDENTITY not null, 3 doc_url varchar2(4000) not null 4 ); Table HDFS_DOCS created. SQL> begin 2 -- must be run by SYS: ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE','PUBLIC'); 3 -- reset 4 -- ctx_ddl.drop_preference('hdfs_url_pref'); 5 -- ctx_ddl.drop_preference('hdfs_stem_fuzzy_pref'); 6 7 ctx_ddl.create_preference('hdfs_url_pref','URL_DATASTORE'); 8 9 -- ctx_ddl.set_attribute('hdfs_url_pref','HTTP_PROXY','www-proxy.us.example.com'); 10 -- ctx_ddl.set_attribute('hdfs_url_pref','NO_PROXY','us.example.com'); 11 ctx_ddl.set_attribute('hdfs_url_pref','Timeout','300') ; 12 13 ctx_ddl.create_preference('hdfs_stem_fuzzy_pref', 'BASIC_WORDLIST'); 14 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_MATCH', 'ENGLISH'); 15 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_SCORE', '70'); 16 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','FUZZY_NUMRESULTS', '200'); 17 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','SUBSTRING_INDEX', 'TRUE'); 18 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_INDEX', 'TRUE'); 19 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_MIN_LENGTH', 3); 20 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','PREFIX_MAX_LENGTH', 4); 21 22 ctx_ddl.set_attribute('hdfs_stem_fuzzy_pref','STEMMER','ENGLISH'); 23 24 end; 25 26 / PL/SQL procedure successfully completed. SQL> SQL> create index hdfs_full_text_index on hdfs_docs( doc_url ) 2 indextype is ctxsys.context 3 parameters ('datastore hdfs_url_pref Wordlist hdfs_stem_fuzzy_pref filter ctxsys.auto_filter sync (on commit)') 4 ; Index HDFS_FULL_TEXT_INDEX created. SQL>   We then insert the simple URLs surrounded by a call to a PL/SQL function to store the right URL to let the indexing process work accordingly: SQL> create or replace function get_real_url( url in varchar2 ) return varchar2 is 2 l_http_request UTL_HTTP.req; 3 l_http_response UTL_HTTP.resp; 4 5 value VARCHAR2(1024); 6 7 begin 8 value := url; 9 l_http_request := UTL_HTTP.begin_request(url); 10 UTL_HTTP.SET_HEADER(l_http_request, 'User-Agent', 'Mozilla/4.0'); 11 UTL_HTTP.SET_FOLLOW_REDIRECT(l_http_request, 0); 12 l_http_response := UTL_HTTP.get_response(l_http_request); 13 IF(l_http_response.STATUS_CODE in (UTL_HTTP.HTTP_MOVED_PERMANENTLY,UTL_HTTP.HTTP_TEMPORARY_REDIRECT, UTL_HTTP.HTTP_FOUND)) 14 THEN 15 UTL_HTTP.GET_HEADER_BY_NAME(l_http_response, 'Location', value); 16 END IF; 17 UTL_HTTP.end_response(l_http_response); 18 19 return value; 20 21 end; 22 23 / Function GET_REAL_URL compiled SQL> set define off SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testEXCEL.xls?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testEXCEL.xlsx?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPDF.pdf?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPPT_various.ppt?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testPPT_various.pptx?user.name=oracle&op=OPEN')); 1 row inserted. SQL> insert into hdfs_docs (doc_url) values (get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/testWORD_various.doc?user.name=oracle&op=OPEN')); 1 row inserted. SQL> set timing on SQL> commit; Commit complete. Elapsed: 00:00:01.083 SQL> Notice it took one second to index these documents. Search examples We are now ready to proceed with some full text queries: Related to the document excerpt  We have the following output: SQL> select score(1), doc_url 2 from hdfs_docs 3 where contains(doc_url, 'signaled', 1) > 0 4 order by score(1) desc; no rows selected Elapsed: 00:00:00.639 SQL> select score(1), doc_url 2 from hdfs_docs 3 where contains(doc_url, 'fuzzy(signaled, 70, 200, w)', 1) > 0 4 order by score(1) desc; SCORE(1) DOC_URL ---------- --------------------------------------------------------------------------------------------------------------------------------------- 47 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 Elapsed: 00:00:00.619 SQL>   Only the second query using Fuzzy function is able to detect approximate syntax. Did you saw the word signalled was misspelled in the query? We can even search using the Soundex function which performs a phonetic search: For instance, the query hereunder will find the document having [parts of] words which sounds like "econo". SQL> select score(1), doc_url 2 from hdfs_docs 3 where contains(doc_url, '!econo', 1) > 0 4 order by score(1) desc; SCORE(1) DOC_URL ---------- --------------------------------------------------------------------------------------------------------------------------------------- 11 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 Elapsed: 00:00:00.448 SQL> Finally, we can have a look at boolean operators such as: SQL> select score(1), doc_url 2 from hdfs_docs 3 where contains(doc_url, 'Yemen or Gothic', 1) > 0 4 order by score(1) desc; SCORE(1) DOC_URL ---------- --------------------------------------------------------------------------------------------------------------------------------------- 44 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testPPT_various.pptx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testWORD_various.doc?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 4 http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/testPPT_various.ppt?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 Elapsed: 00:00:00.113 SQL> We can see that the score for the first document (44) is higher because “Yemen” is mentioned 9 times instead of once per document for “Gothic”. You can now enrich your existing queries targeting data in HDFS, NoSQL or RDBMS using Big Data SQL and now adding Full Text search capabilities.   HTTP Redirect in PL/SQL As stated previously, to get the final (right) URL to retrieve from WebHDFS the content of a document as done by any browser, we need to find out the Location for redirection. The following PL/SQL function will give us this information (now that we have the appropriate ACLs): SQL> set define off SQL> set serveroutput on size 1000000 SQL> create or replace function get_real_url( url in varchar2 ) return varchar2 is 2 l_http_request UTL_HTTP.req; 3 l_http_response UTL_HTTP.resp; 4 5 value VARCHAR2(1024); 6 7 begin 8 value := url; 9 l_http_request := UTL_HTTP.begin_request(url); 10 UTL_HTTP.SET_HEADER(l_http_request, 'User-Agent', 'Mozilla/4.0'); 11 UTL_HTTP.SET_FOLLOW_REDIRECT(l_http_request, 0); 12 l_http_response := UTL_HTTP.get_response(l_http_request); 13 IF(l_http_response.STATUS_CODE in (UTL_HTTP.HTTP_MOVED_PERMANENTLY,UTL_HTTP.HTTP_TEMPORARY_REDIRECT, UTL_HTTP.HTTP_FOUND)) 14 THEN 15 UTL_HTTP.GET_HEADER_BY_NAME(l_http_response, 'Location', value); 16 END IF; 17 UTL_HTTP.end_response(l_http_response); 18 19 return value; 20 21 end; 22 23 / Function GET_REAL_URL compiled SQL> begin 2 dbms_output.put_line(get_real_url('http://bigdatalite.localdomain:50070/webhdfs/v1/user/oracle/documents/NullHeader.docx?user.name=oracle&op=OPEN')); 3 end; 4 5 / http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0 PL/SQL procedure successfully completed. SQL> As you can see the redirected URL is now: http://bigdatalite.localdomain:50075/webhdfs/v1/user/oracle/documents/NullHeader.docx?op=OPEN&user.name=oracle&namenoderpcaddress=bigdatalite.localdomain:8020&offset=0   And Oracle Text can now locate the content and index it!

Today, we’ll focus on the integration of Oracle Text, the Full-Text Indexing capabilities from the Oracle database with documents residing on HDFS.     Oracle Text has been available since years and...

Functionality

How to intelligently aggregate approximations

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long time frame. Therefore, as both the type of data and the volume of data continue to grow the challenge for many businesses is how to process this every expanding pool of data and at the same time, make timely decisions based on all the available data. (Image above courtesy of http://taneszkozok.hu/)In previous posts I have discussed whether an approximate answer is just plain wrong and whether approximate answers really are the best way to analyze big data. As with the vast majority of data analysis at some point there is going to be a need to aggregate a data set to get a higher level view across various dimensions. When working with results from approximate queries, dealing with aggregations can get a little complicated because it is not possible to “reuse” an aggregated approximate result as a basis for aggregating the results to an even higher level across the various dimensions of the original query. To obtain a valid approximate result set requires a query to rescan the source data and compute the required analysis for the given combination of levels. Just because I have a result set that contains a count of the number of unique products sold this week at the county level does not mean that I can simply reuse that result set to determine the number of distinct products sold this week at the state level. In many cases you cannot just rollup aggregations of aggregations.. Until now….With Database 12c Release 2 we have introduced a series of new functions to deal with this specific issue - the need to create reusable aggregated results that can be “rolled-up” to higher aggregate levels. So at long last you can now intelligently aggregate approximations! Here is how e do it….Essentially there are three parts that provide the “intelligence”: APPROX_xxxxxx_DETAIL APPROX_xxxxxx_AGG TO_APPROX_xxxxxx Here is a quick overview of each function: APPROX_xxxxxx_DETAILThis function takes a numeric expression and builds summary result set containing results for all dimensions in GROUP BY clause. The output from this function is a column containing BLOB data. As with other approximate functions the results can be deterministic or non-deterministic depending on your requirements. APPROX_xxxxxx_AGG This function builds a higher level summary table based on results from the _DETAIL function. This means that it is not necessary to re-query base fact table in order to derive new aggregates results. As with the _DETAIL function, the results are returned as a blob. TO_APPROX_xxxxxxReturns results from the _AGG and _DETAIL functions in user readable format. A worked exampleLet’s build a working example using the sample SH schema. Our product marketing team wants to know within each year the approximate number of unique customers within each product category. Thinking ahead we know that once they have this result set we expect them to do further analysis such as drilling on the time and product dimension levels to get deeper insight. The best solution is to build a reusable aggregate approximate result set using the new functions in 12.2. SELECT  t.calendar_year,  t.calendar_quarter_number AS qtr,  p.prod_category_desc AS category_desc,  p.prod_subcategory_desc AS subcategory_desc, APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg FROM sales s, products p, times t WHERE s.time_id= t.time_id AND p.prod_id = s.prod_id GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc; this returns my result set as a blob as shown below and this blob contains the various tuple combinations from my GROUP BY clause. As a result I can reuse this result set to answer new questions based around higher levels of aggregation. and the explain plan for this query shows the new sort keywords (GROUP BY APPROX) which tells us that approximate processing has been used as part of this query. If we want to convert the BLOB data into a readable format we can transform it by using the TO_APPROX_xxx function as follows: SELECT t.calendar_year, t.calendar_quarter_number AS qtr, p.prod_category_desc AS category_desc, p.prod_subcategory_desc AS subcategory_desc, TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_DETAIL(s.cust_id)) as acd_agg FROM sales s, products p, times t WHERE s.time_id= t.time_id AND p.prod_id = s.prod_id GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc; this creates the following results alternatively we coult create a table using the above query and then simply pass the BLOB column directly into the TO_APPROX function as follows: CREATE TABLE agg_cd ASSELECT  t.calendar_year,  t.calendar_quarter_number AS qtr,  p.prod_category_desc AS category_desc,  p.prod_subcategory_desc AS subcategory_desc, APPROX_COUNT_DISTINCT_DETAIL(s.cust_id) as acd_agg FROM sales s, products p, times t WHERE s.time_id= t.time_id AND p.prod_id = s.prod_id GROUP BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc ORDER BY t.calendar_year, t.calendar_quarter_number, p.prod_category_desc, p.prod_subcategory_desc;using this table we can simplify our query to return the approximate number of distinct customers directly from the above table: SELECT  calendar_year,  qtr,  category_desc,  subcategory_desc,  TO_APPROX_COUNT_DISTINCT(acd_agg) FROM agg_cd ORDER BY calendar_year, qtr, category_desc, subcategory_desc;which returns the same results as before - as you would expect! using the aggregated table as our source we can now change the levels that we wish to calculate without having to go back to the original source table and again scan all the rows. However, to extract the new aggregations we need to introduce the third function APPROX_COUNT_DISTINCT_AGG to our query and wrap this within the TO_APPROX_COUNT_DISTINCT function to see the results: SELECT  calendar_year,  subcategory_desc, TO_APPROX_COUNT_DISTINCT(APPROX_COUNT_DISTINCT_AGG(acd_agg)) FROM agg_cd GROUP BY calendar_year, subcategory_desc ORDER BY calendar_year, subcategory_desc;will return the following results based only on the new combination of levels included in the GROUP BY clause: SummaryThis post has reviewed the three new functions that we have introduced in Database 12c Release 2 that allow you to reuse aggregated approximate result sets: APPROX_xxxxxx_DETAIL APPROX_xxxxxx_AGG TO_APPROX_xxxxxx  Database 12c Release 2 makes it possible to intelligently aggregate approximations. In the next post I will explore how you can combine approximate processing with existing query rewrite functionality so you can have intelligent approximate query rewrite.

The growth of low-cost storage platforms has allowed many companies to actively seeking out new external data sets and combine them with internal historical data that goes back over a very long...

Functionality

Dealing with very very long string lists using Database 12.2

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon - you can determine this yourself within the code by supplying your own separator symbol. Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated: ORA-01489: result of string concatenation is too long. Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem - including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers. If you want to replicate the problem and you have access to the sample SH schema then try executing this query: SELECT g.country_region, LISTAGG(c.cust_first_name||' '||c.cust_last_name, ',') WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN.  I have a published a tutorial on LiveSQL (it's a completely free service, all you need to do is register for an account) so you can play with all the new keywords covered in this post. What have we changed in 12.2? One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects. Larger object sizes The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command: show parameter MAX_STRING_SIZE in my demo environment this returns the following: NAME            TYPE   VALUE --------------- ------ -------- max_string_size string STANDARD Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit. ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE; However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets. What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2. Better list management With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used: ON OVERFLOW ERROR ON OVERFLOW TRUNCATE WITH COUNT vs. WITHOUT COUNT Let’s look a little closer at each of these features…..  1. Keeping Pre-12.2 functionality If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned: ERROR at line xxx:ORA-01489: result of string concatenation is too long However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens: SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error.  2. New ON OVERFLOW TRUNCATE… keywords If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here: SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows: SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’) WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords: SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,' ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; which will now generate the normal error message - i.e. replicates the pre-12.2 behaviour: ORA-01489: result of string concatenation is too long01489. 00000 - "result of string concatenation is too long"*Cause: String concatenation result is more than the maximum size.*Action: Make sure that the result is less than the maximum size.  of course you can simply omit the new keywords and get the same behaviour: SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,') WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; which, as before, generates the normal error message - i.e. replicates the pre-12.2 behaviour: ORA-01489: result of string concatenation is too long01489. 00000 - "result of string concatenation is too long"*Cause: String concatenation result is more than the maximum size.*Action: Make sure that the result is less than the maximum size. 3. How many values are missing? If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ - this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour. SELECT g.country_region,LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS CustomerFROM customers c, countries gWHERE g.country_id = c.country_idGROUP BY country_regionORDER BY country_region; 4. Do we split values when truncation occurs? No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name  “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed.  5. How do we calculate the overall length of the string values? The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”.  The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list.  Summary With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords: ON OVERFLOW TRUNCATE ON OVERFLOW ERROR (default behaviour) WITH COUNT WITHOUT COUNT  (default behaviour) Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality. Don't forget to try the tutorial on livesql - it's a completely free service, all you need to do is register for an account.

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are...

Big Data

Oracle Big Data SQL: Simplifying Information Lifecycle Management

For many years, Oracle Database has provided rich support for Information Lifecycle Management (ILM).  Numerous capabilities are available for data tiering – or storing data in different media based on access requirements and storage cost considerations.  These tiers may scale from in-memory for real time data analysis – to Database Flash for frequently accessed data – to operational data captured in Database Storage and Exadata Cells.   Hadoop offers yet another storage layer for the – the Hadoop Distributed File System (HDFS) – which offers a cost effective alternative for storing massive volumes of data.  Oracle Big Data SQL makes access to this data seamless from Oracle Database 12c; Big Data SQL is a data virtualization technology that allows users and applications to use Oracle’s rich SQL language across data stored in Oracle Database, Hadoop and NoSQL stores.  One query can combine data from all these sources.     What this means is that ILM can now be extended to use Hadoop to store raw and archived data.  This is especially important since retaining many years of historical information in data warehouses is increasingly a requirement for both analytics and regulatory compliance. Oracle Big Data SQL offers two approaches to implement this strategy – the first approach has been available since Big Data SQL’s initial release: Store data in any format in Hadoop.  This data may already exist in Hadoop – or you can use Oracle Copy to Hadoop or open source tools to transfer data from Oracle Database onto the Hadoop cluster Create a Big Data SQL-enabled external table over that data  Applications query the data stored in Hadoop thru that external table as it would any other data in the Oracle Database – using Oracle SQL.  Big Data SQL Server Cells on the Hadoop cluster will utilize features such as Smart Scan, Storage Indexes, predicate pushdown, partition pruning and bloom filters to optimize query performance (see this series of Big Data SQL Quick Start blog posts for more details) Note:  In this scenario, the HDFS data accessed by Big Data SQL is not dedicated to Oracle Database processing.  The HDFS data can be used by any Hadoop process - including Spark, Oracle Big Data Discovery, Oracle Big Data Spatial and Graph – the list goes on and on.  Big Data SQL is just another process accessing that data. Big Data SQL Smart Scan on Oracle Tablespaces in HDFS Some Background Oracle Big Data SQL 3.1 offers an innovative new approach to extending ILM to Hadoop – and it is going to look very familiar to those who understand existing Oracle Database features – specifically tablespaces, partitioning and Exadata/Big Data SQL Smart Scan.  In summary, you will now be able to store Oracle Database tablespaces in HDFS – and then offload query processing to the Hadoop cluster using Big Data SQL Server Cells.  Queries will benefit from all of the Big Data SQL performance features, including Storage Indexes, Bloom Filters and Smart Scan.  This capability is available on both Engineered Systems (Exadata & BDA) and commodity Oracle Databases and DIY clusters.  Let’s examine this more closely with a use case. Partitioning is an enabling technology for ILM; it provides powerful functionality to logically divide tables into smaller, more manageable pieces.  This yields significant query performance benefits; operations only take place on relevant partitions – including scans, join operations, and more.  Based on data importance (where importance may be the age of data or other criteria) – you can store partitions in tablespaces deployed to different storage tiers: In our scenario, every three months we will offload the oldest quarterly operational data into HDFS - providing desirable cost savings.  Because we have monthly table partitions, moving this data will be easy; storage is an attribute of a partition.   It is important to note that HDFS is a write-once file system; files cannot be arbitrarily updated.  As a result, the Oracle tablespaces that contain the data files targeted for HDFS will need to be marked read-only.  This restriction is fine for our scenario because it’s archive data.  Of course, the table’s data that is not stored in HDFS can be updated as normal. Archive Oracle Database Tablespace to HDFS  The following steps are required to set up the HDFS archive (the end of this post includes the script required to perform these actions): Create a tablespace – we’ll call it cold_hdfs_2010q1.  This tablespace will be used to store the archive data.  Every three months we will create a new tablespace to store the next oldest quarterly data. Move the older, historical partitions to cold_hdfs_2010q1 (using ALTER TABLE … MOVE PARTITION).   Run the BDS Copy Tablespace to HDFS script.  At a high level, this script will 1) mark the tablespace as read-only, 2) copy its data files to HDFS, 3) set the tablespace data file attribute to point to the new HDFS data location.  That’s all it takes!  There is no need to create external tables over data in HDFS; you simply manage a single partitioned table.   We focused here on a tablespace containing a single table’s partitions; of course, an Oracle tablespace may contain many objects.  Instead of copying a single table’s data to HDFS, you can copy archive data from as many tables as you desire. Queries Use Optimized Oracle Storage Format and Smart Scan Queries over the data in Oracle tablespaces will leverage all of the Oracle Database performance features.  The data may be indexed, use Hybrid Columnar Compression – you can even use Oracle Database In-Memory. Exploitation of the optimized Oracle storage formats are then combined with Big Data SQL’s Smart Scan capabilities; queries achieve a compound performance benefit.  Big Data SQL Smart Scan utilizes the massively parallel processing power of the Hadoop cluster to filter data at its source – greatly reducing data movement and network traffic between the cluster and the database. Big Data SQL Smart Scan applies the following processing on the Hadoop cluster: Smart Scan data filter Storage Index data skipping Data mining scoring offload Storage offload for LOBs and CLOBs Offloads scans on encrypted data  In summary, Big Data SQL 3.1 delivers a new, cost effective, easy-to-implement and seamless approach to ILM.  Applications and queries remain unchanged; Big Data SQL virtualization capabilities allow deployments to take advantage of Hadoop’s scalability while leveraging the security, analytic capabilities and performance Oracle Database is known for. Steps for Copying a Tablespace/Partitions to HDFS The following describes the tasks required for archiving data to HDFS. Big Data SQL includes a utility bds-copy-tbs-to-hdfs.sh that automates many of the steps required to archive data to HDFS.  Listed below are the key steps.  We’ll highlight both the automated and manual steps.  Mount HDFS  The first step is to set up fuse-dfs so that the Oracle Database can access HDFS thru the file system.  Big Data SQL’s bds-copy-tbs-to-hdfs.sh utility will set this up for you (note:  NFS Gateway is also supported – but it is not setup up by the utility).  Simply run the following command from the directory containing Big Data SQL utilities: ./bds-copy-tbs-to-hdfs.sh --install The –install flag will install fuse-dfs, create the fuse-<cluster name>-hdfs service, set up a mount point and create links in the Oracle home to that mount point (more on this later).  If the Hadoop cluster is named “MyCluster”, then you will be able to access HDFS using the following linux command: ls /mnt/fuse-MyCluster-hdfs/ Move Partitions to the Archive Tablespace Now that the basics are installed, we’ll move the partitions that must be archived to a “cold” tablespace.  In our example, we’ll move Q1-2010 partitions (January to March 2010) to a new tablespace:  -- Create cold-tablespace on hdfs for this quarter CREATE TABLESPACE cold_hdfs_2010q1 DATAFILE 'cold_hdfs_2010q1.dbf' SIZE 100M reuse AUTOEXTEND ON nologging; -- Add partitions to the tablespace  ALTER TABLE movie_fact  MOVE PARTITION PART_2010_01 TABLESPACE cold_hdfs_2010q1 ONLINE; ALTER TABLE movie_fact  MOVE PARTITION PART_2010_02 TABLESPACE cold_hdfs_2010q1 ONLINE; ALTER TABLE movie_fact  MOVE PARTITION PART_2010_03 TABLESPACE cold_hdfs_2010q1 ONLINE; Copy the Partitions to HDFS  We’ll now use the bds-copy-tbs-to-hdfs.sh utility to copy the data to HDFS: bds-copy-tbs-to-hdfs.sh --tablespace=COLD_HDFS_2010_Q1 The data has now been archived.  Queries that are run against January thru March 2010 will be sourced from data stored in HDFS! You are not limited to using the script.  If you want more control as to how the data is archived, you can use manual procedures.  Behind the scenes, that single copy to hdfs command is actually doing a few things: 1. Sets the tablespace to be read-only -- Make the tablespace readonly ALTER TABLESPACE cold_hdfs_2010q1 READ ONLY; 2. Takes the tablespace offline -- Take the tablespace offline ALTER TABLESPACE cold_hdfs_2010q1 OFFLINE; 3. Copies the tablespace’s data files to HDFS -- Copy the data files to hdfs cp /u03/app/oracle/product/12.1.0/dbhome_1/dbs/ cold_hdfs_2010q1 /u03/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:/user/oracle/orcl/cold_hdfs_2010q1 4. Rename the data files used by the tablespace -- Rename the datafiles use alter tablespace with rename datafile clause ALTER TABLESPACE cold_hdfs_2010q1   RENAME DATAFILE '/u03/app/oracle/product/12.1.0/dbhome_1/dbs/cold_hdfs_2010q1' TO '/u03/app/oracle/product/12.1.0/dbhome_1/dbs/hdfs:/user/oracle/orcl/cold_hdfs_2010q1'; Notice that the file’s path includes hdfs:.  This will tell Oracle Database that the data is stored in HDFS and that Big Data SQL Smart Scan should be used when processing the data.  The Oracle Database Home has been setup with the appropriate directories to support the special path; symbolic links will point to the actual HDFS mount point. 5.  Bring the tablespace online -- Bring tablespace online ALTER TABLESPACE cold_hdfs_2010q1 ONLINE; Start Querying the Archive Data You can now seamlessly query data stored in HDFS – and combine that data with data stored in the Oracle Database.  Below, we’re looking at movie purchase activity (stored in both HDFS and Oracle Database) in January 2010 for high grossing comedy movies (movie and genre data in Oracle Database) : select  m.title, m.year, m.gross, count(*)  from movie_fact mf, movie m, genre g where time_id <= to_date('31-JAN-2010', 'DD-MON-YYYY') and m.movie_id = mf.movie_id and g.genre_id = mf.genre_id and g.name = 'Comedy' group by g.name, m.title, m.year, m.gross order by gross desc; Result:   TITLE  YEAR  GROSS  COUNT(*)  Pirates of the Caribbean: At World's End  2007 963420425   9  Finding Nemo  2003  921743261  120  Shrek 2  2004  919838758  31  Shrek the Third  2007  798958162  5  Up  2009 731342744   63  Forrest Gump  1994  677387716  34  Kung Fu Panda  2008  631744560  54  ...       When you review the explain plan, you’ll notice that - thanks to partition pruning - only two out of 36 partitions have been accessed: --------------------------------------------------------------------------------------------------------- | Id  | Operation                  | Name       | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | --------------------------------------------------------------------------------------------------------- |   0 | SELECT STATEMENT           |            |  4246 |   244K|   230   (1)| 00:00:01 |       |       | |   1 |  SORT GROUP BY             |            |  4246 |   244K|   230   (1)| 00:00:01 |       |       | |*  2 |   HASH JOIN                |            |  4246 |   244K|   229   (1)| 00:00:01 |       |       | |   3 |    MERGE JOIN CARTESIAN    |            |  5776 |   236K|   139   (0)| 00:00:01 |       |       | |*  4 |     TABLE ACCESS FULL      | GENRE      |     1 |    11 |     3   (0)| 00:00:01 |       |       | |   5 |     BUFFER SORT            |            |  5076 |   153K|   136   (0)| 00:00:01 |       |       | |   6 |      TABLE ACCESS FULL     | MOVIE      |  5076 |   153K|   136   (0)| 00:00:01 |       |       | |   7 |    PARTITION RANGE ITERATOR|            | 97014 |  1610K|    90   (2)| 00:00:01 |     1 |     2 | |*  8 |     TABLE ACCESS FULL      | MOVIE_FACT | 97014 |  1610K|    90   (2)| 00:00:01 |     1 |     2 | ---------------------------------------------------------------------------------------------------------   Predicate Information (identified by operation id): ---------------------------------------------------      2 - access("M"."MOVIE_ID"="MF"."MOVIE_ID" AND "G"."GENRE_ID"="MF"."GENRE_ID")    4 - filter("G"."NAME"='Comedy')    8 - filter("TIME_ID"<=TO_DATE(' 2010-01-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))   And, you can see the impact of Smart Scan in filtering data on the cluster: select ms.sid, sn.name, ms.value from v$mystat ms, v$statname sn where ms.statistic# = sn.statistic#   and sn.name like '%XT%'; cell XT granule bytes requested for predicate offload 314597376 cell interconnect bytes returned by XT smart scan    6048336 Smart Scan filtered 98% of the available partitioned data stored in HDFS – greatly reducing the amount of data processed by the database tier.  Applications and users that querying the data do not need to know where the data resides; they simply issue the same queries that they always used.  Oracle Database + Big Data SQL transparently access and combine all of the data.  

For many years, Oracle Database has provided rich support for Information Lifecycle Management (ILM).  Numerous capabilities are available for data tiering – or storing data in different media based...

Functionality

Are approximate answers the best way to analyze big data

Image courtesy of pixabay.com In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate results are wrong which is very strange when you consider how often we interact with approximations as part of our everyday life. Most of the use cases in my first post on this topic covered situations where distinct counts were the primary goal - how many click throughs did an advert generate, how many unique sessions were recorded for a web site etc. The use cases that I outlined provided some very good reasons for using approximations of distinct counts. As we move forward into the era of Analytics-of-Things the use of approximations in queries will expand and this approach to processing data will become an accepted part of our analytical workflows. To support Analytics-of-Things, Database 12c Release 12.2 includes even more approximate functions. In this release we have added approximations for median and percentile computations and support for aggregating approximate results (counts, median and percentiles). What is a median and percentile? A quick refresher course….according to wikipedia a percentile is: a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found. Percentiles are prefect for locating outliers in your data set. In the vast majority of cases you can start with the assumption that a data set exhibits a normal distribution. Therefore if you take the data around the 0.13th and 99.87th percentiles (i.e. outside 3 standard deviations from the mean) then you get the anomalies. Percentiles are great for allowing you to quickly eyeball the distribution of a data set so that you can check for skew or bimodalities etc. Probably, the most common use case is around monitoring service levels where these anomalies are the values of most interest.   On the other hand, a median is: the number separating the higher half of a data sample, a population, or a probability distribution, from the lower half.  Why would you use median rather than the mean? In other words, what are the use cases that require median? Median is great at removing the impact of outliers because the data is sorted and then the middle value is extracted. The average is susceptible to be skewed by outliers. A great use case for median is in resource planning. If you want to know how many staff you should assign to manage your web-store application you might create a metric based on number of sessions during the year. With a web-store the number of sessions will peak around key dates such as July 4th and Thanksgiving. Calculating the average number of sessions over the year will be skewed by these two dates and you will probably end-up with too many staff looking after your application. Using the median removes these two spikes and will return a more realistic figure for the number of sessions per day during the year. But before you start to consider where, when, how or even if you want to consider using approximate calculations you need to step back for a moment and think about the accuracy of your existing calculations, which I am guessing you think are 100% accurate! Is your data accurate anyway? Most business users work on the assumption that the data set they are using is actually 100% accurate and for the vast majority of operational sources flowing into the data warehouse this is probably true although there will always parentless dimension values and in some cases “Other” bucket dimension members to create some semblance of accuracy. As we start to explore big data related sources pulled from untrusted external sources and IoT sensor streams, which typically are inherently “noisy”, then the level of “accuracy” within the data warehouse starts to become a range rather than a single specific value. Let’s quickly explore the three key ways that noise gets incorporated into data sets: 1) Human errors Human input errors: probably the most obvious. It affects both and internal and external sources that rely on human input or interpretation of manually prepared data. Free format fields on forms create all sorts of problems because the answers need to be interpreted. Good examples are insurance claim forms, satisfaction surveys, crime reports, sales returns forms etc  2) Coding errors ETL errors: Just about every data source feeding a data warehouse goes through some sort of ETL process. Whilst this is sort of linked to the first group of errors it does fall into this group simply because of the number of steps involved in most ETL jobs. There are some many places where errors can be introduced Rounding and conversion errors: When an ETL job takes source data, converts it and then aggregates it before pushing it into the warehouse it will always be difficult to back trace the aggregated numbers down to the source data because of inherent rounding errors. When dealing with currency exchange rates it can be a little difficult to tie-back source data in one currency to the aggregated data in the common currency dues to tiny rounding errors. 3) Data Errors Missing data points:  Data always get lost in translation somewhere down the line or is simply just out of date. In many cases this is the biggest source of errors. For example, one bank recently put together a marketing campaign to stop customer churn. Before they launched they campaign one of their data scientists did some deeper analysis and discovered that the training data for the model included customers who were getting divorced and this was being flagged as a lost customer. Including this group ended up skewing the results. The data about changes to marital status was not being pushed through fast enough to the data warehouse. Meaningless or distracting data points: with the growth in interest in the area of IoT it is likely that this type of “noise” will become more prevalent in data sets. Sensor data is rarely 100% accurate mainly because in many cases it does not need to deliver that level of accuracy. The volume of data being sent from the sensor will allow you to easily remove or flag meaningless or distracting data. With weblogs it is relatively easy to ignore click-events where a user clicks on an incorrect link and immediately clicks the back-button. In other words, in many situations, getting precise answers is nothing but an illusion: even when you process your entire data, the answer is still an approximate one. So why not use approximation to your computational advantage and in a way where the trade off between accuracy and efficiency is controlled by you?  Use cases for these new features There are a lot of really good use cases for these types of approximations but here are my two personal favorites: Hypothesis testing — a good example of this is A/B testing which is most commonly used in conjunction with website design and ads design to select the page design or ad that generates the best response. With this type of analysis it is not vital that you have accurate, precise values .What is needed is the ability to reliably compare results and approximations are good normally enough. Ranking — How does your ISP calculate your monthly usage so they can bill you fairly for your usage? They use a percentile calculation where they will remove the top 5% - 2%, of your bandwidth peaks. and then use that information to calculate your bill. By using data below the 95th-98th percentile they can ignore the infrequent peaks when say your are downloading the lasted update to your Android or iOS device. Again, having precise numbers for this percentile cut-off is not really necessary. A good enough approximation of  the 95th percentile is usually going to be sufficient because it implies that approximately 95% of the time, your usage is below the data volume identified around that percentile. An conversely the remaining 5% of the time, your usage creeps above that amount.  Of course all the use cases that we considered for distinct counts in the first posts are also valid: Discovery analytics: data analysts often slice and dice their dataset in their quest for interesting trends, correlations or outliers. If your application falls into this type of explorative analytics, getting an approximate answer within a second is much better compared to waiting twenty minutes for an exact answer. In fact, research on human-computer interaction has shown that, to keep business users engaged and productive, the response times for queries must be below 10 seconds. In particular, if the user has to wait for the answer to their query for more than a couple of seconds then their level of analytical thinking can be seriously impaired. Market testing: most common use case for market testing is around serving ads on websites. This is where two variants of a specific ad (each with a group of slightly different attributes such as animations or colour schemes) are served up to visitors during a session. The objective is to measure which version generates a higher conversion rate (i.e. more click-throughs). The analytics requires counting the number of clicks per ad with respect to the number of times each ad was displayed. Using an approximation of the number of click-throughs is perfectly acceptable. This is similar to the crowd-counting problem where it is not really necessary to report exactly how many people joined a rally or turned up to an event. Root cause analysis:  contrary to perceived wisdom, this can in fact be accomplished using approximations. Typically RCA follows a workflow model where results from one query trigger another query, which in turn triggers another related query. Approximations are used to speed up that the decision as to whether or not to continue with a specific line of analysis. Of course you need to incorporate the likelihood of edge cases within your thinking process because there is the danger that the edge values will get lost within the general hashing process. however, in these examples we usually end up merging or blending the first two use cases with the three above to gain a deeper level of insight so now let’s look at the new approximate statistical functions introduced in Database 12.2 Approximate median and percentile With Database 12c Release 2 we have added two new approximate functions: APPROX_PERCENTILE(%_number [DETERMINISTIC], [ERROR_RATE|CONFIDENCE]) WITHIN GROUP (ORDER BY expr [ DESC | ASC ]) This function takes three input arguments. The first argument is numeric type ranging from 0% to 100%. The second parameter is optional. If ‘DETERMINISTIC’ argument is provided, it means user requires deterministic results. If it is not provided, it means deterministic results are not mandatory.  The input expression for the function is derived from the expr in the ORDER BY clause.   The approx_median function has the following syntax: APPROX_MEDIAN(expr [DETERMINISTIC], [ERROR_RATE|CONFIDENCE] )   We can use these functions separately or together as shown here using the SH schema: SELECT  calendar_year,  APPROX_PERCENTILE(0.25) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.25",  TRUNC(APPROX_PERCENTILE(0.25, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-er",  TRUNC(APPROX_PERCENTILE(0.25, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.25-ci",  APPROX_MEDIAN(amount_sold deterministic) as "p-0.50",  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'ERROR_RATE'),2) as "p-0.50-er",  TRUNC(APPROX_MEDIAN(amount_sold deterministic, 'CONFIDENCE'),2) as "p-0.50-ci",   APPROX_PERCENTILE(0.75 deterministic) WITHIN GROUP (ORDER BY amount_sold ASC) as "p-0.75",  TRUNC(APPROX_PERCENTILE(0.75, 'ERROR_RATE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-er",  TRUNC(APPROX_PERCENTILE(0.75, 'CONFIDENCE') WITHIN GROUP (ORDER BY amount_sold ASC),2) as "p-0.75-ci"FROM sales s, times tWHERE s.time_id = t.time_idGROUP BY calendar_yearORDER BY calendar_year The results from the above query are shown below Note that for the APPROX_MEDIAN function I have included the keyword “DETERMINISTIC”.  What does this actually mean?  Due to the nature of computing approximate percentiles and medians it is not possible to provide a specific and constant value for the error rate or the confidence interval. However, when we have used a large scale real world customer data set (manufacturing use case) we saw an error range of around 0.1 - 1.0%. Therefore, in broad general terms, accuracy will not be a major concern. Error rates and confidence intervals  How closely an approximate answers matches the precise answer is gauged by two important statistics: margin of error confidence level. These two pieces of information tell us how well the approximation represents the precise value. For example, a result may have a margin of error of plus or minus 3 percent at a 95 percent level of confidence. These terms simply mean that if the analysis were conducted 100 times, the data would be within a certain number of percentage points above or below the percentage reported in 95 of the 100 runs. In other words, Company X surveys customers and finds that 50 percent of the respondents say its customer service is “very good.” The confidence level is cited as 95 percent plus or minus 3 percent. This information means that if the survey were conducted 100 times, the percentage who say service is “very good” will range between 47% and 53% most (95%) of the time (for more information see here: https://www.isixsigma.com/tools-templates/sampling-data/margin-error-and-confidence-levels-made-simple/). Please note that if you search for more information about error rates and confidence levels then a lot of results will talk about sample size and working back from typical or expected error rates and confidence levels to determine the sample size needed. With approximate query process we do not sample the source data. We always read all the source values, there is no sampling! Performance - how much faster is an approximate result?  As a test against a real world schema we took a simple query from the customer that computed a number of different median calculations: SELECT count(*) FROM (SELECT /*+ NO_GBY_PUSHDOWN */ b15, median(b4000), median(b776), median(e), median(f), median(n), median(z) FROM mdv group by b15); As you can see from the real-time monitoring page, the query accessed 105 million rows and the calculations generated 11GB of temp. That’s a lot of data for one query to spill to disk! Now if we convert the above query to use the approx_median function and rerun the query we can see below that we get a very different levels of resource usage:  Looking closely at the resource usage you can see that the query is 13x faster, uses considerably less memory (830Kb vs 1GB) but most importantly there is no usage of temp:   Summary One of the most important take-aways from this post relates to the fact that we always read all the source data. The approximate functions in Database 12c Release 2 do not using sampling as a way to increase performance. These new features are significantly faster and use fewer resources which means more resources are available for other queries - allowing you to do more with the same level of resources.

Image courtesy of pixabay.com In my previous post I reviewed some reasons why people seem reluctant to accept approximate results as being correct and useful. The general consensus is that approximate...

Functionality

SQL Pattern Matching Deep Dive - Part 6, state machines

The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don't need to know about or care about state machines. That's the beauty of using SQL for pattern matching. The MATCH_RECOGNIZE clause encapsulates all the deep technical modelling and processing that has to be performed to run pattern matching on a data set. However, there are times when it is useful, probably vital, that you understand what is going on behind the scenes and one of the most obvious situations is when backtracking happens.Therefore, the content covered in this post is a going to be a gently lead-in into my next post where I am going to discuss the concept of “backtracking”  and the dreaded ORA-30009 error.Let’s start our voyage of discovery…when you attempt to run a SQL statement containing a MATCH_RECOGNIZE clause during the compilation phase we generate a finite state machine based on the PATTERN and DEFINE clauses in your statement. What is an Finite State Machine? According to wikipedia:A finite-state machine (FSM)…is a mathematical model of computation....it is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time…changes from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.Reference from wikipedia - https://en.wikipedia.org/wiki/Finite-state_machine  A state machine, which is the PATTERN and DEFINE elements of your MATCH_RECOGNIZE clause, can be represented by a directed graph called a state diagram. This diagram shows each of the possible states for the “machine” and the conditions that force the machine to either remain in its current state or move to the next state. Below is a simple example of a state machine:  On the above diagram each state is represented by a node (ellipse) which in this case are marked as “State 1” to “State 4”. The arrows, known as Edges, show the transition(s) from one state to another. If you look at states 2 and 4 you will notice that they have two edges although these edges are shown in different vertical positions on the diagram. When drawing a proper state diagram each event is labeled with the event (condition) that triggers transition. Events (conditions) that don’t cause a change of state are represented by a circular arrow returning to the original state and these can be seen on states 2 and 4.The precedence for reading the information is to read from top-down. What this means is that when in State 2 the FSM will test to see if State 3 can be achieved and if it can’t it will then test to see if State 2 can be maintained. The reverse is true for State 4 where the FSM will test to see if State 4 can be maintained and if it can’t it will then, in this example, either end having determined that a match has completed or start backtracking to try and complete a match. I am sure you can now see how this is going to link into my next blog post.State machines are not limited to just pattern matching. They have all sorts of other uses. If you want a gentle diversion to look at state machines in a little more detail then try this article by Enrique Ortiz from the OTN magazine in August 2004: Managing the MIDlet Life-Cycle with a Finite State Machine.All of this flows directly into keywords that appear (or don’t appear) in the explain plans which was covered in this post MATCH_RECOGNIZE and the Optimizer from January 2015. As quick refresher…essentially there are four new keywords that you need to be aware of:MATCH RECOGNIZESORTBUFFERDETERMINISTIC FINITE AUTOThe fist three bullet points are reasonably obvious. The last keyword is linked to the use of “state machine”. Its appearance, or lack of appearance, affects the way our pattern is applied to our data set but that is all explained in the blog post. Most of my MATCH_RECOGNIZE examples are based on the stock ticker data set. Let’s assume that we are searching for V-shaped patters in our data set (https://docs.oracle.com/database/121/DWHSG/pattern.htm#CACHHJJG):SELECT *FROM Ticker MATCH_RECOGNIZE ( PARTITION BY symbol ORDER BY tstamp MEASURES STRT.tstamp AS start_tstamp, LAST(DOWN.tstamp) AS bottom_tstamp, LAST(UP.tstamp) AS end_tstamp ONE ROW PER MATCH AFTER MATCH SKIP TO LAST UP PATTERN (STRT DOWN+ UP+) DEFINE DOWN AS price < PREV(down.price), UP AS price > PREV(up.price)) MRORDER BY MR.symbol, MR.start_tstamp; this is what the state diagram would look like: These diagrams can be really helpful when you have more complex patterns and you need to consider the impact of backtracking. This posts is all about laying the building blocks for my next post on backtracking and the dreaded ORA-30009 error. If you have managed to read this far then you are guaranteed to be ready for an in-depth look at what happens inside MATCH_RECOGNIZE when we have move from right to left through our state diagram in an attempt to find a complete match. Now you should know everything you need to know about state machines and I am going to carry over the “why care” part to the next post…If you want a recap of where we are in this series of pattern matching deep dive posts here is the full list:MATCH_RECOGNIZE and the OptimizerSQL Pattern Matching Deep Dive - Part 1, The basicsSQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER()SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiersSQL Pattern Matching Deep Dive - Part 4, Empty matches and unmatched rows?SQL Pattern Matching Deep Dive - Part 5, SKIP TO where exactly? As per usual, if you have any comments or questions then feel free to contact me directly via email: keith.laker@oracle.com Technorati Tags: Analytics, Data Warehousing, Database 12c, SQL, SQL Analytics

The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don't need to know about or care...

Functionality

Exploring the interfaces for User Defined Aggregates

image courtesy of wikipedia Whilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses the User Defined Aggregate API introduced in database 9i. Tom’s comprehensive answer covers the two important areas that need to considered when using the data cartridge API: 1) a UDA can run as serial process and 2) a UDA can run as a parallel process. Therefore, you need to code for both these eventualities. Dealing with both scenarios can be a little challenging - as I have discovered over the last few weeks. having looked at a number of posts there is a common theme for explaining how the various interfaces for user defined aggregate actually work. One the clearest examples is on Tim Hall’s blog: String Aggregation Techniques. This got me thinking…would it be possible to take the new  extensions we made to LISTAGG and incorporate them into custom string manipulation function built using the UDA interfaces? Essentially providing a pre-12.2 solution to prevent the text-string overflow error “ORA-01489: result of string concatenation is too long”? After some initial coding I managed to get a solution that worked perfectly as long as I did not try and run the query using parallel execution. Eventually I managed to get the parallel execution process coded but it was not deterministic and the results differed from the results from the serial query. After revisiting Tom Kyte’s  stragg function solution I think I have finally created a fully working solution and here is what I have learned along the way… Overview Before the going into the code, let’s explore the data cartridge interface for creating user defined aggregates. As far as I can tell this is not a very well known or well used feature in that I have not seen any presentations at user conferences that explain why and how to build these functions. It is a very interesting feature because User-defined aggregate functions can be used in SQL DML statements just like Oracle’s built-in aggregates. Most importantly they allow you to work with and manipulate complex data types such as multimedia data stored using object types, opaque types, and LOBs. Each user-defined aggregate function is made up of three mandatory and three optional ODCIAggregate interfaces, or steps, to define internal operations that any aggregate function performs. The four mandatory interfaces are: initialization, iteration, merging, and termination. Initialization is accomplished by the ODCIAggregateInitialize() routine, which is invoked by Oracle to initialize the computation of the user-defined aggregate. The initialized aggregation context is passed back to Oracle as an object type instance. Iteration is performed through the ODCIAggregateIterate() routine, which is repeatedly invoked by Oracle. On each invocation, a new value or a set of new values and the current aggregation context are passed in. The routine processes the new values and returns the updated aggregation context. This routine is invoked for every non-NULL value in the underlying group. NULL values are ignored during aggregation and are not passed to the routine. Merging is an optional step and is performed by ODCIAggregateMerge(), a routine invoked by Oracle to combine two aggregation contexts. This routine takes the two contexts as inputs, combines them, and returns a single aggregation context. Termination takes place when the ODCIAggregateTerminate() routine is invoked by Oracle as the final step of aggregation. The routine takes the aggregation context as input and returns the resulting aggregate value. This is how these four main functions fit together… The most important observations I would make are: You need to think very carefully about where and how you want to process your data.  There are essentially two options: 1) during the Iterate stage and/or 2) during the Terminate stage. Of course you need to remember that code used in the Iterate stage needs to replicated at the Merge stage. It’s tempting but don’t ignore the Merge stagebut if you do then when the function is run in parallel you won’t see any results!  What’s missing…. In the code sample below you will notice that I am missing two interfaces: ODCIAggregateDelete() and ODCIAggregateWrapContext(). Both functions are optional and for the purposes of creating a replacement for the LISTAGG function, these two functions were not needed. But for the sake of completeness below is a brief description of each function:   ODCIAggregateDelete() removes an input value from the current group. The routine is invoked by Oracle by passing in the aggregation context and the value of the input to be removed. It processes the input value, updates the aggregation context, and returns the context. This is an optional routine and is implemented as a member method. ODCIAggregateWrapContext() integrates all external pieces of the current aggregation context to make the context self-contained. Invoked by Oracle if the user-defined aggregate has been declared to have external context and is transmitting partial aggregates from slave processes. This is an optional routine and is implemented as a member method. I have searched the internet for examples of when and how to use these two optional functions and there is not much out there. One example was posted by Gary Myers http://blog.sydoracle.com/2005/09/analytics-with-order-by-and-distinct.html which is another derivation of the STRAGG function. In  Gary’s example the objective is to return rows until 5 distinct values of a specific column had been returned. I think it is possible to do this without resorting to implementing this requirement within the  ODCIAggregateTerminate function but I will leave you to under that one! Schema - Sales History For this simple LISTAGG alternative I am using the sample sales history schema and I am going to create a concatenated list of the first name and last name of each customer within each sales region. To make the code examples a little easier to read I have created a view over the CUSTOMERS and COUNTRIES table: CREATE OR REPLACE VIEW CG_LIST ASSELECT g.country_region_id, c.cust_first_name, c.cust_last_nameFROM countries g, customers cWHERE g.country_id = c.country_idORDER BY g.country_region_id;  If we try to run the following SQL query we will get the usual LISTAGG overflow error: SELECT country_region_id, LISTAGG(cust_first_name||' '||cust_last_name) WITHIN GROUP (ORDER BY country_region_id) AS "customers"FROM MY_LISTGROUP BY country_region_id ORDER BY country_region_id; ORA-01489: result of string concatenation is too long01489. 00000 - "result of string concatenation is too long"*Cause: String concatenation result is more than the maximum size.*Action: Make sure that the result is less than the maximum size. Now let’s use the User Defined Aggregates framework to resolve this issue… Stage 1 - Building the basic framework First we need a storage object to hold the results generated during the Iterate stage. Following Tom Kyte’s example I am using an array/table as my storage object as this will ensure that I never hit the limits of the VARCHAR2 object when I am building my list of string values. Here I have checked the maximum size of the concatenated first name and last name combinations which is 21 characters. This means I can set the limit for the varchar2 column at 25 characters (…giving myself a little bit of headroom, just in case…). CREATE OR REPLACE TYPE string_varray AS TABLE OF VARCHAR2(25); Note that if we did not use an array we would be forced to use a single VARCHAR2 variable to hold the string values being pieced together. This would mean testing the length of the VARCHAR2 object before adding the next string value within both the Iterate and Merge functions, which causes all sorts of problems! As with most things in life, when working with the user defined aggregate interfaces it pays to keep things as simple as possible and put the processing in the most appropriate place. Therefore, in this case the best solution is to use an array because it makes the code simple and processing more efficient!  Here is the definition of for the interfaces: CREATE OR REPLACE TYPE t_string_agg AS OBJECT( a_string_data string_varray, STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg, value IN VARCHAR2 ) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg, returnValue OUT VARCHAR2, flags IN NUMBER) RETURN NUMBER, MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg, ctx2 IN t_string_agg) RETURN NUMBER);/SHOW ERRORS Initially I included some additional variables at the start of the OBJECT definition for managing various counters and the maximum string length. However, it’s not possible to pass parameters into this function except for the string value to be processed. Therefore, I removed the counter definitions from the header and just instantiated my array. This actually makes perfect sense because the counters and maximum string length variables are only needed at one specific point within the process flow - more on this later -  plus this keeps the definition stage nice and simple.  Now we have the definition for our interfaces which means we can move on to the actual coding associated with each interface. Stage 2 - Coding the Initialization phase The code for this interface is relatively simple because all we need to do is instantiate the array that will hold our list of string values being passed into the aggregation process. We do this using a call to the object named in the previous section: sctx := t_string_agg(string_varray() ); This is the start of our processing flow… STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg) RETURN NUMBER IS BEGIN sctx := t_string_agg(string_varray() ); RETURN ODCIConst.Success; END; Stage 3 - Coding the Iterate phase The iterate phase will get called multiple times as we process the string values that need to be aggregated. At this point all we need to do is to collect the string values being passed in and insert them into the array object that was created at the start of stage 1. We use the extend function to add a new row into the array and assign the string value. The important point to note here is that the Iterate process will also form part of the parallel execution framework, therefore,  we need to keep the processing as simple as possible.  MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg, value IN VARCHAR2) RETURN NUMBER IS BEGIN a_string_data.extend; a_string_data(a_string_data.count) := value; RETURN ODCIConst.Success; END; Given that we want to try and ensure that we don’t exceed the limits of the VARCHAR2 object and generate an  “ORA-01489“ it’s tempting to code this logic within the Iterate function. The issue with placing the processing logic within this function is that all the logic will need to be replicated within the ODCIAggregateMerge function to cope parallel execution where multiple Iterate functions are executed and the results merged to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function.  Stage 4 - Coding the Merge phase This function gets called during parallel execution and is used to merge results from multiple Iterate processes to generate a final, single resultset that can be passed to the ODCIAggregateTerminate function. The basic logic needs to replicate the logic from the ODCIAggregateIterate function but needs to take account of multiple values coming into the function via the CTX2 instantiation of our string function rather than the single string value being passed into the Iterate function. In essence the CTX2 provides the data from the various parallel execution processes and the self object simply accumulates the results being passed in from the various processes. MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg, ctx2 IN t_string_agg) RETURN NUMBER IS BEGIN FOR i IN 1 .. ctx2.a_string_data.count LOOP a_string_data.extend; a_string_data(a_string_data.count) := ctx2.a_string_data(i); END LOOP; RETURN ODCIConst.Success; END; If we tried to enforce the maximum string length functionality within Iterate function then the same processing would need to be enforced within this function as well and having tried to do it I can tell you that the code very quickly gets very complicated. This is not the right place to do complex processing that is better done in the final, terminate, phase. If you do find yourself putting a lot of code within either or both the Iterate and/or Merge functions then I would recommend taking a very hard look at whether the code would be better and simpler if it was placed within the Terminate stage. Stage 5 - Coding the Terminate phase  Given that we want to avoid blowing the limits of the VARCHAR2 object this is the most obvious place to code our processing logic. By placing the code here, we can ensure that our string processing function will work when the query is run in both serial and parallel.  Here is the code: MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg, returnValue OUT VARCHAR2, flags IN NUMBER) RETURN NUMBER IS l_data varchar2(32000); ctx_len NUMBER; string_max NUMBER; BEGIN ctx_len := 0; string_max := 100; FOR x IN (SELECT column_value FROM TABLE(a_string_data) order by 1) LOOP IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN l_data := l_data || ',' || x.column_value; ELSE ctx_len := ctx_len + 1; END IF; END LOOP; IF ctx_len > 1 THEN l_data := l_data || '...(' || ctx_len||')'; END IF; returnValue := LTRIM(l_data, ','); RETURN ODCIConst.Success; END; Note that the function returns the list of string values from the array sorted in alphabetical order: (SELECT column_value FROM TABLE(a_string_data) order by 1). If you don’t need a sorted list then you can remove the ORDER BY clause. Cycling through the ordered list of strings allows us to add a comma to separate each value from the next but first we need to check the length of the list of strings to see if we have reached the maximum string length set by the variable string_max. As soon as we reach this maximum value we start counting the number of values that are excluded from the final list: IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN l_data := l_data || ',' || x.column_value; ELSE ctx_len := ctx_len + 1; END IF; For this code sample the maximum string length is set as 100 characters. Once we reach that value then we simply continue looping through the array but now we increment our counter (ctx_len).  Given that we have sorted the list of string values, the next question is: can we remove duplicate values? The answer is yes! It is a relatively simple change: FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1) There is an obvious cost to pay here because the use of the DISTINCT keyword will require additional processing which will have a performance impact so think very carefully about whether this is actually needed.  Full code sample Here is the complete definition of the code: CREATE OR REPLACE TYPE BODY t_string_agg IS STATIC FUNCTION ODCIAggregateInitialize(sctx IN OUT t_string_agg) RETURN NUMBER IS BEGIN sctx := t_string_agg(string_varray() ); RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateIterate(self IN OUT t_string_agg, value IN VARCHAR2) RETURN NUMBER IS BEGIN a_string_data.extend; a_string_data(a_string_data.count) := value; RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateTerminate(self IN t_string_agg, returnValue OUT VARCHAR2, flags IN NUMBER) RETURN NUMBER IS l_data varchar2(32000); ctx_len NUMBER; string_max NUMBER; BEGIN ctx_len := 0; string_max := 100; FOR x IN (SELECT DISTINCT column_value FROM TABLE(a_string_data) order by 1) LOOP IF LENGTH(l_data || ',' || x.column_value) <= string_max THEN l_data := l_data || ',' || x.column_value; ELSE ctx_len := ctx_len + 1; END IF; END LOOP; IF ctx_len > 1 THEN l_data := l_data || '...(' || ctx_len||')'; END IF; returnValue := LTRIM(l_data, ','); RETURN ODCIConst.Success; END; MEMBER FUNCTION ODCIAggregateMerge(self IN OUT t_string_agg, ctx2 IN t_string_agg) RETURN NUMBER IS BEGIN FOR i IN 1 .. ctx2.a_string_data.count LOOP a_string_data.EXTEND; a_string_data(a_string_data.COUNT) := ctx2.a_string_data(i); END LOOP; RETURN ODCIConst.Success; END;END;/SHOW ERRORS Stage 6 - Coding the actual string function  The last step is to create a function that calls our string processing object and takes a string (VARCHAR2) object as it’s input: CREATE OR REPLACE FUNCTION string_agg (p_input VARCHAR2)RETURN VARCHAR2PARALLEL_ENABLE AGGREGATE USING t_string_agg; Bringing it all together… Now let’s run a query using our replacement for LISTAGG().. SELECT country_region_id,string_agg(cust_first_name||' '||cust_last_name) AS s_str_listFROM MY_LISTGROUP BY country_region_id ORDER BY country_region_id;   and here is the output: the length of the concatenated string for each region is limited to a maximum of 100 characters and we have a count of the number of values excluded from the list - shown in brackets at the end of the list. Just so you can check that the truncation is occurring correctly I have included an additional column in the resultset below which shows the length of the final string (including the three dots + the numbers showing the count of truncated values). Conclusion First a huge vote of thanks to Tom Kyte for providing the original code and ideas behind this particular blog post.   If you don’t have access to Database 12c Release 2 and the enhancements that we made to LISTAGG to control  “ORA-01489“ errors then here is a “usable” alternative that provides a lot more control over string concatenation. Obviously if you want string concatenation with a count of missing values and without a count then you will need to create two separate functions. If you need a distinct list of values for some queries and not others then you will need to create separate functions to handle the different processing. This example also provides a great introduction to the area of user defined aggregates, which were introduced in Oracle 9i. A quick Google search returned a lot of similar examples but no “real-world” use cases so I assume that UDAs are a little used gem within the long list of Oracle Database features. UDAs are a very powerful and flexible feature that you should definitely add to your toolbox of skills.

image courtesy of wikipedia Whilst I was working on the functional specification for the LISTAGG extensions that we implemented in 12c Release 2, I came across Tom Kyte’s stragg function which uses...

Functionality

Simplifying your data validation code with Database 12.2

Image courtesy of pixabay.com Doesn’t matter who much testing you do (well, it actually does but that’s a whole different issue) you can almost guarantee that at some point your beautiful data validation code, that parses data input from a web form or loads data from some external file, will pop up with the error: SQL Error: ORA-01722: invalid number 01722. 00000 - "invalid number"*Cause: The specified number was invalid.*Action: Specify a valid number. Of course, what’s is really annoying at this point is that you don’t know which column value of the record failed (assuming that you have more than one numeric column) Managing conversion errors during data loads What’s to do? Of course the sensible thing is to add lots of data validation checks into your code to try and catch the situations where the wrong type of data  arrives from your data source. It’s likely that all the additional validation checks will slow down the process of inserting data, which is not a great result. If your data is arriving via an external file then you can use the BADFILE clause to capture records that cannot be loaded because of data type errors. But what if the data source for your insert statement is a staging table that was populated by an ETL job or a series of values from a web form?  How to manage conversion errors during INSERTs Panic over - Database 12c Release 2 contains important changes to the CAST and TO_xxx functions to manage the most common data conversion errors. The CAST function now has the ability to return a user-specified value if there is a conversion error. For example, let’s build a simple staging table in the schema: CREATE TABLE STAGING_EMP ( "EMPNO" VARCHAR2(6),   "ENAME" VARCHAR2(10),   "JOB" VARCHAR2(9),   "MGR" VARCHAR2(4),   "HIREDATE" VARCHAR2(10),   "SAL" VARCHAR2(7),   "COMM" VARCHAR2(9),   "DEPTNO" VARCHAR2(6)); and let’s insert some data, which includes values that will cause data conversion errors when we try to add the values into our target table: -- INSERTING DATA INTO STAGING_EMPInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO)             values ('GB9369','SMITH','CLERK','7902','17-DEC-80','800',null,'20');-- INVALID DATEInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('9499','ALLEN','SALESMAN','7698','31-FEB-81','1600','300','30');-- INVALID NUMBER FOR DEPTNOInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('9521','WARD','SALESMAN','7698','22-FEB-81','1250','500','SALES');-- INVALID NUMBER FOR EMPNO KEYInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('US9566','JONES','MANAGER','7839','02-APR-81','2975',null,'20');Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('9782','CLARK','MANAGER','7839','09-JUN-81','2450',null,'10');-- INVALID NUMBER FOR EMPNO KEYInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('FR9788','SCOTT','ANALYST','7566','19-APR-87','3000',null,'20');-- INVALID NUMBER FOR MGR KEYInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('9839','KING','PRESIDENT','null','17-NOV-81','5000',null,'10');-- INVALID NUMBER FOR EMPNO KEYInsert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('DE9844','TURNER','SALESMAN','7698','08-SEP-81','1500',0,'30');Insert into STAGING_EMP (EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO) values ('9876','ADAMS','CLERK','7788','23-MAY-87','1100',null,'20'); Now let’s try inserting the data from our staging table into the EMP table and see what happens: INSERT INTO scott.emp SELECT * FROM staging_emp; … and not surprisingly I get the following error: Error starting at line : 52 in command -INSERT INTO emp SELECT * FROM staging_empError report -SQL Error: ORA-01722: invalid number01722. 00000 - "invalid number"*Cause: The specified number was invalid.*Action: Specify a valid number. I can deal with this situation in a couple of different ways. Firstly let’s try and discover which rows and columns in my staging table contain values that are likely to cause data conversion errors. To do this I am going to use the new VALIDATE_CONVERSION() function which identifies problem data that cannot be converted to the required data type. It returns 1 if a given expression can be converted to the specified data type, else it returns 0. SELECT  VALIDATE_CONVERSION(empno AS NUMBER) AS is_empno,  VALIDATE_CONVERSION(mgr AS NUMBER) AS is_mgr,  VALIDATE_CONVERSION(hiredate AS DATE) AS is_hiredate,  VALIDATE_CONVERSION(sal AS NUMBER) AS is_sal,  VALIDATE_CONVERSION(comm AS NUMBER) AS is_comm,  VALIDATE_CONVERSION(deptno AS NUMBER) AS is_deptno FROM staging_emp; this produces a table where I can easily pick out the rows where the data conversion is going to succeed (column value is 1) and fail (column value is 0): I could use this information to filter the data in my staging table as I insert it into my EMP table or I could use the enhanced CAST and TO_xxx functions within the INSERT INTO ….. SELECT statements. The CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can now return a user-specified value, instead of an error, when data type conversion errors occur. This reduces failures during an data transformation and data loading processes. Therefore, my new 12.2 self-validating SELECT statement looks like this: INSERT INTO emp SELECT   empno,   ename,   job,   CAST(mgr AS NUMBER DEFAULT 9999 ON CONVERSION ERROR),   CAST(hiredate AS DATE DEFAULT sysdate ON CONVERSION ERROR),   CAST(sal AS NUMBER DEFAULT 0 ON CONVERSION ERROR),   CAST(comm AS NUMBER DEFAULT null ON CONVERSION ERROR),   CAST(deptno AS NUMBER DEFAULT 99 ON CONVERSION ERROR) FROM staging_emp WHERE VALIDATE_CONVERSION(empno AS NUMBER) = 1; which results in five rows being inserted into my EMP table - obviously this means that 4 rows were rejected during the insert process (rows 1, 4, 6 and 8) because they contain errors converting the contents to a number for the empno key. Here is the data that was loaded: we can see that on row 1 the HIERDATE was invalid so it was replaced by the value from sys date (07-JUL-16). Row 2 the value of DEPTNO is the conversion default of 99 and on row 4 the value for MGR is the conversion default of 9999. Conclusion The enhanced  CAST function (along with TO_NUMBER, TO_BINARY_FLOAT, TO_BINARY_DOUBLE, TO_DATE, TO_TIMESTAMP, TO_TIMESTAMP_TZ, TO_DSINTERVAL, and TO_YMINTERVAL functions) can help you deal with data conversion errors without having to resort to complicated PL/SQL code or writing data validation routines within your application code. The new VALIDATE_CONVERSION() function can be used to help you identify column values that cannot be converted to the required data type. Hope these two features are useful. Enjoy! Don’t forget that LiveSQL is now running Database 12c Release so check out all the new tutorials and code samples that have recently been posted. I have just published a tutorial covering the features discussed above and it is available here: https://livesql.oracle.com/apex/livesql/file/tutorial_EDVE861IMHO1W3Q654ES9EQQW.html

Image courtesy of pixabay.com Doesn’t matter who much testing you do (well, it actually does but that’s a whole different issue) you can almost guarantee that at some point your beautiful data...

Big Data SQL

Big Data SQL Quick Start. Add SerDe in Big Data SQL classpath. - Part17.

today I'm going to write about how to add custom SerDe in Big Data SQL. SerDe is one of the most powerful features of Hadoop and Big Data SQL in particular. It allows you to read any type of data as structured, you just need to explain how to do parse it.  Let's imagine, that we have JSON file: {"wr_returned_date_sk":38352,"wr_returned_time_sk":46728,"wr_item_sk":561506,"wr_refunded_customer_sk":1131210} {"wr_returned_date_sk":38380,"wr_returned_time_sk":78937,"wr_item_sk":10003,"wr_refunded_customer_sk":1131211} and want to proceed it with custom SerDe, for example, org.openx.data.jsonserde.JsonSerDe. Based on the guide I'm trying to create the external table: hive> CREATE EXTERNAL TABLE json_openx( wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 'hdfs://scaj43-ns/user/hive/warehouse/json_string'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerDe in the end, I got the error, which tells me that I don't have this jar. Fair enough. I have to add it. I can add this file with the Hive API and create table again: hive> add jar hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar; converting to local hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar Added [/tmp/f0317b31-2df6-4a24-ab8d-66136f9c26e6_resources/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] to class path Added resources: [hdfs://scaj43-ns/tmp/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar] hive> CREATE EXTERNAL TABLE json_openx( wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 'hdfs://scaj43-ns/user/hive/warehouse/json_string'; OK Time taken: 0.047 seconds hive> select * from json_openx limit 1; OK 38352 46728 561506 1131210 Now everything seems good, but if I log off and log in again, the query will not work. hive> select * from json_openx limit 1; FAILED: RuntimeException MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found) well, I have to add this JAR in Hive config and most convenient way to do this is Cloudera Manager. before this, I have to copy it on each machine on the cluster. Also, we need to copy this jar in hive Auxiliary JARs Directory: [Linux]# dcli -C "mkdir /home/oracle/serde/" [Linux]# dcli -C -f /root/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar -d /home/oracle/serde/ [Linux]# dcli -C -f /root/json/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar -d /opt/oracle/bigdatasql/bdcell-12.1/jlib [Linux]# dcli -C "ls /home/oracle/serde" 192.168.42.92: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar 192.168.42.93: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar 192.168.42.94: json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar after all, files are propagated to each node, click on the hive service in Cloudera manager:  then configuration:  type in search bar "hive-env" and add the path to the jar: after this reboot hive service and deploy config:  Check that hive works: hive> select * from json_openx limit 1; OK 38352 46728 561506 1131210 Now we are ready to create an external table in Oracle (I'm using PL/SQL package for that): SQL> DECLARE DDLout VARCHAR2(4000); BEGIN dbms_hadoop.create_extddl_for_hive( CLUSTER_ID => 'scaj43', DB_NAME => 'default', HIVE_TABLE_NAME => 'json_openx', HIVE_PARTITION => FALSE, TABLE_NAME => 'json_openx', PERFORM_DDL => FALSE, TEXT_OF_DDL => DDLout); dbms_output.put_line(DDLout); END; / SQL> CREATE TABLE BDS.json_openx (wr_returned_date_sk NUMBER, wr_returned_time_sk NUMBER, wr_item_sk NUMBER, wr_refunded_customer_sk NUMBER) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=scaj43 com.oracle.bigdata.tablename=default.json_openx) ) PARALLEL 2 REJECT LIMIT UNLIMITED; SQL> SELECT * FROM BDS.json_openx; SELECT * FROM BDS.json_openx ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-11504: error from external driver: oracle.hadoop.sql.JXADException: error parsing "com.oracle.bigdata.colmap" field name "wr_returned_date_sk" not found well... we got an error because this jar doesn't exist in Big Data SQL classpath. Let's add it. On the cell side in file bigdata.properties (in my environment int's in /opt/oracle/bigdatasql/bdcell-12.1/bigdata.properties) add this jar to java.classpath.Hadoop variable: java.classpath.hadoop=/opt/oracle/bigdatasql/bdcell-hadoopconf/*:/opt/cloudera/parcels/CDH/lib/hadoop/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-hdfs/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-yarn/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib/*:/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/*:/home/oracle/serde/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar On the database side in $ORACLE_HOME/bigdatasql/bigdata_config/bigdata.properties file add the same path against the same variable java.classpath.hadoop. After this you need to restart Big Data SQL on the cell side: and restart extproc on the database side:  [Linux]# $GRID_HOME/bin/crsctl stop resource bds_DBINSTANCE_HADOOPCLUSTER [Linux]# $GRID_HOME/bin/crsctl start resource bds_DBINSTANCE_HADOOPCLUSTER and finally, check that everything works properly: SQL> SELECT * FROM BDS.json_openx; 38352 46728 561506 1131210 38380 78937 10003 1131211

today I'm going to write about how to add custom SerDe in Big Data SQL. SerDe is one of the most powerful features of Hadoop and Big Data SQL in particular. It allows you to read any type of data...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL over complex data types in Oracle NoSQL. - Part16.

Today I'm going to publish the blogpost, which has been written by Javier De La Torre Medina. Thanks to him for the great research! All below is his article. Oracle Big Data SQL over complex data types in Oracle NoSQL When working with Oracle NoSQL databases, we have the flexibility to choose complex data types like arrays, records and map. In this example we are going to show you how to use Oracle Big Data SQL over these complex data types. Let’s use one Oracle NoSQL table included with the virtual machine Oracle Big Data Lite. This table it’s called: movie. This table includes simple data types like: string, integer, etc. but the last column is an array data type. On the Oracle NoSQL database, we can see the description of the table. We are going to focus only on the array: kv-> show table -name movie ……. { "name" : "genres", "type" : "ARRAY", "collection" : { "name" : "RECORD_gen", "type" : "RECORD", "fields" : [ { "name" : "cid", "type" : "STRING", "nullable" : true, "default" : null }, { "name" : "id", "type" : "INTEGER", "nullable" : true, "default" : null }, { "name" : "name", "type" : "STRING", "nullable" : true, "default" : null } ] }, Let’s have a look into the data. kv-> get table -name movie    We can see the array as the last column. Let’s create a Hive table on top of the Oracle NoSQL table.      Now we can run a simple query to see if works:      And also we can query the array directly:    Now we can create the Oracle Big Data SQL table on top. On the following documentation link, we can see how is the mapping between different data types are done: http://docs.oracle.com/cd/NOSQL/html/examples/hadoop/hive/table/package-summary.html#ondb_hive_ora_data_model_mapping_table   We will have to create the table as varchar2 data type and we will define that GENRE column is an array. Here is the code to create the table:      Now let’s run some queries. Let’s run a simple one, let’s query two columns to see how the data looks like:    Let’s query over the GENRES column. In this case we will use the JSON_QUERY operator. This operator always returns a JSON, like an object or an array. Oracle Database 12c can work natively with JSON, so we will be able to query and select the field we want. As a final example, let’s query over the name field:    

Today I'm going to publish the blogpost, which has been written by Javier De La Torre Medina. Thanks to him for the great research! All below is his article. Oracle Big Data SQL over complex data types...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL over nested tables in Oracle NoSQL. - Part15.

Today I'm going to publish the blogpost, which has been written by Javier De La Torre Medina. Thanks to him for the great research! All below is his article.   Big Data SQL over nested tables in Oracle NoSQL. In the Oracle NoSQL database, customers can take advantage of the table model. The table model simplifies application data modeling by leveraging existing customer skills: SQL. The table model is built on top of the distributed key-value structure, inheriting all its advantage, and using AVRO schemas, which compress very well using less CPU and storage than JSON. Here we have an example of how to create a table in Oracle NoSQL.  Oracle NoSQL Database tables can be organized in a parent/child hierarchy. We can create tables on inside other tables. Here we have another example:   We create the parent table: myInventory      Then we can create the child or nested table: itemDetails        When we create the child table, it inherits the parent table’s primary key. Therefore, the itemDetails table has two primary keys: itemCategory and itemSKU. Here you have a visual representation of the nested tables:    Working with Big Data SQL. To get a better understanding about how Oracle NoSQL works with Big Data SQL, you can start reading this blog post from Alexey here: https://blogs.oracle.com/datawarehousing/entry/big_data_sql_quick_start8 With Oracle Big Data SQL, you can take advantage of the predicate pushdown. You can send the query to the Oracle NoSQL database, and you will get the results very fast thanks to the key-value structure. We are going to do a demo about how this works. First of all, let’s have a look into the data. Here we have a few documents for fleet: {"vin":"023X43EKB0ON212J84F6","make":"FORD","model":"F150","year":2010,"fuelType":"G","vehicleType":"TRUCK"} {"vin":"10W0251I02U4ILS32K25","make":"FORD","model":"F150","year":2010,"fuelType":"G","vehicleType":"TRUCK"} {"vin":"5I4P0L132Q3518XOFVV3","make":"NISSAN","model":"PATHFINDER","year":2013,"fuelType":"G","vehicleType":"SUV"} We have more data for each car, which is about mileage data. Here you can have a look: {"vin":"023X43EKB0ON212J84F6","currentTime":10001,"driverID":"X6712184","longitude":58.0,"latitude":75.0,"odometer":7,"fuelUsed":0.37562913,"speed":70} {"vin":"02O1J1O1O3545Z6682NB","currentTime":10001,"driverID":"F1605891","longitude":175.0,"latitude":30.0,"odometer":8,"fuelUsed":0.3891571,"speed":80} {"vin":"04E0ZT2V10P25D71W703","currentTime":10001,"driverID":"P3433939","longitude":105.0,"latitude":140.0,"odometer":6,"fuelUsed":0.1807928”speed":60} As you can see, they have in common the “vin” column. This will be used for the nested tables. Let’s create the first table for fleet. We will define the “vin” column as primary key.    Next let’s create the nested table mileage.      Finally, let’s insert the data shown before.      Now let’s run some queries over the tables. In the first query we are going to query the fleet table over the primary key:      We are getting just one result. Now let’s see what happens when we run the same query but over the mileage table:      Here we can see the hierarchy 1 to N between the nested tables. Finally, if we query the mileage table with the two primary keys, this is the result:      Once we have the Oracle NoSQL tables and the data inserted and tested, let’s create the Hive tables on top of the NoSQL tables. We will create the fleet hive table over the fleet table in NoSQL.      Then we will create the mileage table over the child table.      Once we have the hive tables, we are able to create the tables for the Oracle Database access. We will create Oracle Database external tables over the Hive tables.      Once we had created the tables, we are able to query efficiently the data through Oracle Big Data SQL as it is taking advantage of the Oracle NoSQL nested tables.          (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-88274055-1', 'auto'); ga('send', 'pageview');

Today I'm going to publish the blogpost, which has been written by Javier De La Torre Medina. Thanks to him for the great research! All below is his article.   Big Data SQL over nested tables in...

Data Warehousing

SQL Pattern Matching Deep Dive - Part 5, SKIP TO where exactly?

So far in this series we looked at how to ensure query consistency, how correctly use predicates, managing sorting, using the built-in measures to help with optimise your code and the impact of different types of quantifiers: SQL Pattern Matching deep dive - Part 1 SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER() SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiers SQL Pattern Matching Deep Dive - Part 4, Empty matches and unmatched rows? In this post I am going to review what MATCH_RECOGNIZE does after a match has been found i.e. where the search begins for the next match. It might seem obvious, i.e. you start at the next record, but MATCH_RECOGNIZE provides a lot of flexibility in this specific area (as you would expect). Basic Syntax We use the AFTER MATCH SKIP clause to determine the precise point to resume row pattern matching after a non-empty match is found. If you don’t supply an AFTER MATCH SKIP clause then the default is AFTER MATCH SKIP PAST LAST ROW. Of course there are quite a few options available: AFTER MATCH SKIP TO NEXT ROW Resume pattern matching at the row after the first row of the current match. AFTER MATCH SKIP PAST LAST ROW Resume pattern matching at the next row after the last row of the current match. AFTER MATCH SKIP TO FIRST pattern_variable Resume pattern matching at the first row that is mapped to the pattern variable. AFTER MATCH SKIP TO LAST pattern_variable Resume pattern matching at the last row that is mapped to the pattern variable. AFTER MATCH SKIP TO pattern_variable The same as AFTER MATCH SKIP TO LAST pattern_variable. Using Pattern Variables and ORA-62154 Note that you can set the restart point to be linked to a specific pattern variable which allows you to work with overlapping patterns - i.e. where you are searching for “shapes” within your data set such as “W” shaped patterns within our ticker data stream. But what happens if the pattern variable within the SKIP TO clause is not matched? Let’s look at the following example: SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol  ORDER BY tstamp  MEASURES STRT.tstamp AS start_tstamp, LAST(UP.tstamp) as end_tstamp,           MATCH_NUMBER() AS match_num,           CLASSIFIER() AS var_match  ALL ROWS PER MATCH AFTER MATCH SKIP TO DOWN  PATTERN (STRT DOWN* UP)  DEFINE        DOWN AS DOWN.price < PREV(DOWN.price),        UP AS UP.price > PREV(UP.price) ) MR WHERE symbol='ACME' ORDER BY MR.symbol, MR.tstamp; here we are stating that we need at least zero or more matches of the variable DOWN to occur and once a match has been found then we will resume the search for the next pattern at the DOWN event. With this pattern it is possible that DOWN will never get matched so the AFTER MATCH SKIP TO DOWN cannot happen even though a complete match for the pattern is found. Therefore, the compiler throws an error to let you know that this code will not work: ORA-62514: AFTER MATCH SKIP TO variable is not bounded in the match found. 62514. 00000 - "AFTER MATCH SKIP TO variable is not bounded in the match found." *Cause: AFTER MATCH SKIP TO variable was not bound in the match found due to pattern operators such as |, *, ?, and so on. *Action: Modify the query and retry the operation Therefore, you need to change the pattern to search for at least one or more instances of DOWN rather than zero or more as this will allow the DOWN event to be matched at least once and therefore it will be available for AFTER MATCH SKIP TO processing. Skipping PAST LAST ROW [DEFAULT] This is the default behaviour and in many circumstances this is the most obvious choice. In these situations the searching for the next pattern it makes sense to resume at the row after the last match since going back over previous rows does not make any sense and would only result in more rows than necessary being processed. For example, let’s look at the sessionization example: http://oracle-big-data.blogspot.co.uk/2014/02/sessionization-with-12c-sql-pattern.html and if you want to try the code see the tutorial on the LiveSQL site.  Looking at the source data for the sessionization example it’s clear that as we walk through the entries in the log file to check if an entry is part of the current session or not, there is no point in stepping backwards to begin searching again once a match has been found. You can run the code for this sessionization example on LiveSQL. Looking for shapes and controlling skipping As I previously stated, you might think the obvious position to start searching for the next occurrence of a pattern is the next record after the last row of the current match. But what if there are overlapping patterns where the middle of an earlier match overlaps with the start of the next match? For example if we are looking for a w-shaped pattern within our ticker data set then it is quite possible to have overlapping w-shapes where the next “W” starts within the second down phase of the previous ”W”. Fortunately MATCH_RECOGNIZE  provides great flexibility in terms of being able to specify the restart point. If we look at the source data for the ACME symbol within our ticker data set then we can see that there are overlapping W-shapes (assuming we allow for the flat-top in the middle of the 2nd w-shape by using the <= and >= tests for each pattern variable!). Let’s use this example to explore the various AFTER MATCH SKIP TO options…starting with the default behaviour: SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol ORDER BY tstamp  MEASURES STRT.tstamp AS start_w,           LAST(z.tstamp) AS end_w  ONE ROW PER MATCH  AFTER MATCH SKIP PAST LAST ROW  PATTERN (STRT x+ y+ w+ z+) DEFINE   x AS x.price <= PREV(x.price),   y AS y.price >= PREV(y.price),   w AS w.price <= PREV(w.price),   z AS z.price >= PREV(z.price) ) MR WHERE symbol='ACME' ORDER BY symbol, MR.start_w; returns only one match within the ACME data set: and if we expand the output, using ALL ROWS PER MATCH, so we can see how the pattern was matched we can see that it starts on 05-Apr-11 with pattern variable STRT and ends on 14-Apr-11 with pattern variable Z. Now let’s change the above code sample so that after the first pattern has been found we begin searching at the row after the end of the matching process for the Y variable - i.e. row 6, 10-Apr-11. SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol ORDER BY tstamp  MEASURES STRT.tstamp AS start_w,           LAST(z.tstamp) AS end_w  ONE ROW PER MATCH AFTER MATCH SKIP TO LAST Y  PATTERN (STRT x+ y+ w+ z+)  DEFINE      x AS x.price <= PREV(x.price),      y AS y.price >= PREV(y.price),      w AS w.price <= PREV(w.price),      z AS z.price >= PREV(z.price) ) MR WHERE symbol='ACME' ORDER BY symbol,start_w; which now finds two w-shapes with the second W starting on 10-Apr-11 and ending on 18-Apr-11: but what is going on under-the-covers? SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol ORDER BY tstamp  MEASURES STRT.tstamp AS start_w,           LAST(z.tstamp) AS end_w,           classifier() AS pv,           match_number() AS mn,           count(*) as row_count  ALL ROWS PER MATCH  AFTER MATCH SKIP TO LAST Y  PATTERN (STRT x+ y+ w+ z+)  DEFINE       x AS x.price <= PREV(x.price),       y AS y.price >= PREV(y.price),       w AS w.price <= PREV(w.price),        z AS z.price >= PREV(z.price) ) MR WHERE symbol='ACME' ORDER BY symbol, mn, tstamp; now shows us that the records for 10-Apr-11 to 14-Apr-11 were actually processed twice: Skip to next row? What about using the SKIP TO NEXT ROW syntax? How does that affect our results? It is important to remember that this will force MATCH_RECOGNIZE to resume pattern matching at the row after the first row of the current match. Using our ticker data we can see that this would actually increase the number of W-shapes to three! In match 2 we have two occurrences of pattern variable x, there once the second W-shape has been matched the search process restarts on row 12, i.e. the first row of the current match, which is row 12 mapped to STRT. SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol ORDER BY tstamp  MEASURES STRT.tstamp AS start_w,           LAST(z.tstamp) AS end_w  ONE ROW PER MATCH  AFTER MATCH SKIP TO NEXT ROW  PATTERN (STRT x+ y+ w+ z+)  DEFINE       x AS x.price <= PREV(x.price),       y AS y.price >= PREV(y.price),       w AS w.price <= PREV(w.price),       z AS z.price >= PREV(z.price)  ) MR WHERE symbol='ACME' ORDER BY symbol, mr.start_w; creates the following output: and if we change our code to return the more detailed report we can see how the pattern is being matched: SELECT * FROM Ticker MATCH_RECOGNIZE (  PARTITION BY symbol ORDER BY tstamp  MEASURES STRT.tstamp AS start_w,           LAST(z.tstamp) AS end_w,           classifier() AS pv,           match_number() AS mn,           count(*) as row_count  ALL ROWS PER MATCH AFTER MATCH SKIP TO NEXT ROW  PATTERN (STRT x+ y+ w+ z+)  DEFINE       x AS x.price <= PREV(x.price),       y AS y.price >= PREV(y.price),       w AS w.price <= PREV(w.price),       z AS z.price >= PREV(z.price) ) MR WHERE symbol='ACME' ORDER BY symbol, mn, tstamp; which produces the following output: Note that match two, the 2nd W-shape, starts on line 11 but we began the search for this second match on row 2, i.e. the next row after the first start variable. Similarly, the search for the third W-shape on row 12 after the second STRT variable. Given that our original data set for ACME only contained 20 rows you can see from this example how it is possible to do a lot more processing when you start to fully exploit the power of the AFTER MATCH SKIP syntax. Just accept the default? The AFTER MATCH SKIP clause determines the point at which we will resume searching for the next match after a non-empty match has been found. The default for the clause is AFTER MATCH SKIP PAST LAST ROW: resume pattern matching at the next row after the last row of the current match. In most examples of using MATCH_RECOGNIZE you will notice that the AFTER MATCH clause is not present and the developer blindly assumes that the AFTER MATCH SKIP PAST LAST ROWclause is applied. This obviously does not help the next developer who has to amend the code to fit new business requirements.  Therefore, my recommendation is that you should always clearly state where you want the matching process to start searching for the next match. Never assume the default will behaviour will be good enough! Summary We are getting near the end of this series of deep dive posts. Hopefully this post has explained the ways in which you can use the AFTER MATCH SKIP… clause to ensure that you capture all of the required patterns/shapes within your data set. It’s always a good idea to explicitly include this clause because it is very important - if you don’t want to allow for overlapping matches then clearly state this in your code by using AFTER MATCH SKIP PAST LAST ROW clause. Don’t assume the default will kick-in and that the next developer will have time to read all your detailed documentation when making the next round of changes to the code. Don’t forget to try our pattern matching tutorials and scripts on LiveSQL and all the above code examples are available via the “Skip to where?” tutorial on livesql.oracle.com. What’s next? In the next post in this series I am going to review the keywords that control the output from MATCH_RECOGNIZE: ALL ROWS vs. ONE ROW. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com. 

So far in this series we looked at how to ensure query consistency, how correctly use predicates, managing sorting, using the built-in measures to help with optimise your code and the impact of...

Parameter Changes for Parallel Execution in Oracle Database 12c Release 2

As our new database release, Oracle Database 12c Release 2, is now available on the Exadata Express Cloud Service, the Exadata Cloud Service, and the Database Cloud Service, we can start talking about the new features and changes it brings. In regards to Parallel Execution let me start with the initialization parameter changes in this new release. Obsoleted and desupported parameters The following parameters were deprecated long time ago but were still there prior to Oracle Database 12.2. We have now obsoleted and removed these parameters. parallel_server parallel_server_instances parallel_io_cap_enabled parallel_automatic_tuning Deprecated parameters In Oracle Database 12.2 we are deprecating the Adaptive Parallelism feature which is controlled by the parameter parallel_adaptive_multi_user. This feature adjusts statement DOPs based on the system load when the statement is submitted. If Oracle thinks the system load is high, the statement will be executed with a lower DOP than requested. In the worst case, it will even run in serial. This results in unpredictable performance for users as the response time of a statement depends on whether it is downgraded or not. Prior to Oracle Database 12.2, the default value of this parameter was true which meant the feature was enabled by default. Now, the default value of this parameter is false and the feature is disabled by default. To control system load and utilization we recommend using Parallel Statement Queuing and Database Resource Manager. Classifying users with difference performance requirements into resource manager consumer groups and allocating parallel resources to those consumer groups based on performance requirements is a much better way of controlling system utilization and ensuring predictable performance for users. Here are my slides from Open World 2015 that talk about how Parallel Statement Queuing and Database Resource Manager work and how you can configure them. Please also check the documentation for all parameter changes in Oracle Database 12c Release 2. In the coming days, I will be posting more about Parallel Execution changes and features in the new release.

As our new database release, Oracle Database 12c Release 2, is now available on the Exadata Express Cloud Service, the Exadata Cloud Service, and the Database Cloud Service, we can start talking about...

Data Warehousing in the Cloud - Part 3

In my last post I looked at Oracle’s Cloud Services for data warehousing and described how they are based around engineered systems running the industry’s #1 database for data warehousing, fully optimised for data warehousing workloads and providing 100% compatibility with existing workloads. Most importantly, Oracle customers can run their data warehouse services on-premise, in the Cloud or using hybrid Cloud using the same management and business tools. I also looked at how Oracle’s Cloud Services for data warehousing are designed to simplify the process of integrating a data warehouse with cutting edge business processes around big data. Oracle Cloud offers a complete range of big data services are available to speed up the monetisation of data sets: Oracle Big Data Cloud Service, Big Data Preparation Cloud, Big Data Discovery, IoT Cloud. In this post, the last in this series, I am going to discuss Oracle’s cloud architecture for supporting data warehousing projects.  Complete Architecture of Oracle’s Cloud for Data Warehousing Oracle’s Cloud Services for data warehousing extend beyond the data management scenarios outlined in the previous sections. They cover a wide range of cloud services that together form a broad and complete set of enterprise-grade solutions for data warehousing in the cloud. Oracle Storage Cloud Service This provides an environment for managing staging files coming out of enterprise operational systems or ETL files that have been processed and need loading into the data warehouse. All files with the cloud are replicated across multiple storage nodes, which guarantees protection against hardware failure and data corruption. Enterprise-grade data protection and privacy policies are enforced to ensure that staging files remain secure at all times. Oracle Database Backup Cloud Service Oracle Database Backup Cloud Service is a secure, scalable, on-demand storage solution for backing up Oracle data warehouses, both on-premise and cloud-based, to the public cloud. The cloud infrastructure provides enterprise-grade performance, redundancy, and security to make sure that the data warehouse does not lose availability. Oracle Data Preparation and Integration Services This service provides a highly intuitive and interactive way for analysts and data scientists to prepare unstructured, semi-structured and structured data for downstream processing and analysis within the data warehouse. It aims to reduce the challenges of data processing and preparation of new data sets such as those linked to IoT and social media by providing a large set of data repair, transformation, and enrichment options that require zero coding or scripting. It significantly reduces the burden of repairing, classifying, and publishing new data sets into the data warehouse, which can be on-premise or in the Oracle Cloud. Oracle Business Intelligence Cloud Service A best-in-class business intelligence cloud offering that provides the full array of intuitive BI tools. It includes interactive interfaces with built-in guidance, easy search, dynamic navigation, contextual analytics and tutorials to increase productivity. It offers tight integration with the sophisticated analytical capabilities of the Oracle Database and using Big Data SQL is able to integrate and join structured and unstructured data sets. Summary When moving to the cloud it is important to remember that all the technical requirements that have driven data warehousing over the last 20+ years still apply. A data warehouse is expected to deliver extreme query performance, scalable and robust data management, handle concurrent workloads etc. All these requirements still apply in the cloud. Oracle provides the only true enterprise-grade cloud based around a fully optimized infrastructure for data warehousing. Oracle’s Cloud Services for data warehousing are based around engineered systems running the industry’s #1 database for data warehousing, fully optimized for data warehousing workloads and providing 100% compatibility with existing workloads. The unique aspect of Oracle’s Cloud service is the “same experience” guarantee. Customers running data warehouse services on-premise, in the Cloud or using hybrid Cloud will use the same management and business tools. Many companies recognize that their most important use cases for moving to cloud rely on integration with big data and access to sophisticated analytics: data mining, statistics, spatial and graph, pattern matching etc. These analytical features are key to consolidation projects (moving current disconnected on-premise systems to the cloud) and delivering new projects that aim to monetize new data streams by treating them as a profit centers. Oracle’s Cloud Services for data warehousing are designed to simplify the process of integrating a data warehouse with cutting edge business processes around big data. A complete range of big data services are available to speed up the monetization of data sets: Oracle Big Data Cloud Service, Big Data Preparation Cloud, Big Data Discovery, IoT Cloud. Overall this provides a unique complete cloud offering: an end-to-end solution for data warehouse covering data integration, big data, database, analytics and business intelligence. Of course all of Oracle’s cloud services can be delivered as services on-premise and in the Oracle Public Cloud. The choice is yours. For more information about Oracle’s Cloud Services visit cloud.oracle.com. Feel free to contact me (keith.laker@oracle.com)if you have any questions about Oracle’s Cloud Services for data warehousing.

In my last post I looked at Oracle’s Cloud Services for data warehousing and described how they are based around engineered systems running the industry’s #1 database for data warehousing, fully...

Data Warehousing in the Cloud - Part 2

In the last blog post (Data Warehousing in the Cloud - Part 1) I examined why you need to start thinking about and planning your move to the cloud: looking forward data warehousing in the cloud is seen as having the greatest potential for driving significant business impact through increased agility, better cost control and faster data integration via co-location. In the last section I outlined the top 3 key benefits of moving your data warehouse to the Oracle cloud: it provides an opportunity to consolidate and rationalise your data warehouse environment, it opens up new opportunities to monetise the content within your warehouse, new data security requirements means require IT teams to start implementing robust data security systems alongside comprehensive audit reporting. In this post I am going to review Oracle’s cloud solutions for data warehousing, how Oracle’s key technologies enable Data Warehousing in the cloud and why Oracle’s Cloud runs Oracle better than any other cloud environment. Oracle Database enabling technologies supporting the cloud Many of the leading analysts have recognized that more and more organizations are moving to the cloud as a fast and efficient way of deploying data warehouse environments. However, they all point out that, although clouds in general are very appealing in terms of flexibility and agility of deployment and pricing, clouds must deliver support for hybrid on-premises-and-cloud solutions. Oracle’s leadership in data warehousing and vision for the cloud is unique in being able to support this must-have hybrid model. Oracle’s dominant position in the data warehouse market, through its delivery of engineered hardware to support data warehousing, and end to-end data integration services, is recognized by leading analysts as giving it a significant competitive advantage. Each release of the Oracle Database continues to add innovative new solutions for data warehousing. Oracle has taken these industry leading data warehouse innovations and made them available in the Cloud. Key technology areas include: Multitenant - New architecture for DW consolidation and moving to cloud Oracle’s multitenant feature is the cornerstone for both consolidation and the transition to the cloud. Multitenant makes it quick and easy to consolidate multiple data marts into a single system using it’s pluggable database capabilities. This unique feature also enables seamless movement of a database, data mart and data warehouse from an on-premise environment to the cloud and, if needed, even back again to on-premise. In-Memory - Immediate answers to any question with real-time analytics Oracle In-Memory option stores data in a highly optimised columnar format that is able to support the types of analytical queries that characterize data warehouse workloads. Oracle’s Cloud Services offer configurations that maximize the analytical potential of this feature making it easier for business teams to gain access to significantly faster analytical processing. Analytical SQL - New SQL Innovations for data warehousing and big data Oracle as a foundation for data warehousing has always innovated in the area of SQL both as an analytical language and in ways to broaden the cope SQL to manage more types of data, such as JSON documents. These innovations allow SQL to do more types of operations such as pattern matching and delivering approximate results that directly support the new types of projects being launched in the cloud. Why Oracle Runs Oracle Better in the Cloud There several key things that Oracle is doing to ensure that the Oracle Database runs better in the Oracle Cloud. Firstly, Oracle is providing integrated and optimized hardware, from disk-to-flash-to-memory, as part of its cloud infrastructure. Customers can now get the same extreme performance capabilities along with fully scalable storage and compute resources of the Exadata platform combined with the ease of use, scaling up and down in a few clicks, and cost effectiveness of Oracle’s cloud infrastructure. No other cloud vendor offers this level of optimization and integration between hardware and software. Secondly, running the Oracle Database in the Oracle Cloud is exactly the same experience as running on-premise such that existing on-premise workloads are completely compatible with Oracle Cloud. Oracle provides easy to use tools and services to move data into the Cloud and out of the Cloud back to on-premise systems. Lastly, no cloud solution should, or even can, act as a data silo - enterprise systems most definitely cannot function as data silos. Every organization has different applications. From an operational perspective this covers ERP, CRM, OLTP systems. From a data warehouse perspective it includes data preparation, data integration and business intelligence. Oracle provides all these solutions within its Cloud Services. This means it is possible to put the data warehouse in the cloud alongside the source systems that push data into the warehouse and tools that analyze and visualize that data. Oracle Cloud Solutions This section provides an overview of the various Oracle Cloud Services that support data warehousing and the use cases for each service. In general terms, Oracle Cloud provides a broad spectrum of data management offerings that can be offered both on-premise and in the Oracle Public Cloud: LiveSQL - Free Service The free cloud service is a great place to quickly, easily and safely road-test new database features. Entry into this service is via Oracle Live SQL, which provides an simple way to test and share SQL and PL/SQL application development concepts. LiveSQL offers: Browser based SQL worksheet access to an Oracle database schema Ability to save and share SQL scripts Schema browser to view and extend database objects Interactive educational tutorials Customized data access examples for PL/SQL, Java, PHP, C All that is needed to access this service is an account Oracle Technology Network – which itself is a free online service. Exadata Express Cloud Service Most organizations focus 80% of their valuable resources on creating on-premise development and test environments. This is because setting up both software and hardware even for these types of systems is time consuming. There are all sorts of procedures that need to be followed to buy licenses, configure servers, connect up networks, configure DB tools etc. However, it’s very rare that these systems match the eventual production environment of the data warehouse and this creates significant challenges around testing and QA processes. In general, the majority of IT teams want to develop and test in a cloud environment where scaling up for testing and then scaling back once the tests and QA procedures are complete is simply a few clicks. Use Cases for Exadata Express Cloud Service Exadata Express Cloud Service is ideally suited to supporting development, test, small-scale marts and data discovery sandboxes up to 50GB in size. From a development and test perspective it provides the option, once a project is production-ready, to move the data warehouse schema back on-premise if regulatory requirements mandate the relocation of data within a specific geographic boundary. Using Oracle Multitenant it is a simple process to unplug from the Cloud and then plug-in and run on-premise. Database-as-a-Service Delivers all the same functionality and power of Oracle Database 12c running on-premise. Oracle’s Database-as-a-Service configurations provide access to large numbers of both CPUs and memory to match workload requirements and take full advantage of advanced database features such as Oracle In-Memory and in-database advanced analytical capabilities (advanced data mining, spatial graph and multidimensional analytics). All the typical cloud attributes are part of this service: the ability to quickly and easily create new databases, manage databases from a single cloud console, tools to migrate existing data from on-premise into Database 12c running in the cloud. Use Cases for Database-as-a-Service Oracle Database Cloud Service is designed to be a production-ready cloud environment that supports medium sized deployments beyond the size limits of the Exadata Express Cloud Service. This makes it ideal for managing larger scale enterprise level development systems as well as departmental marts and warehouses. It supports situations where data scientists need large-scale sandboxes to support data mining projects that require access to large amounts of historical data integrated with raw data streams from IoT devices and other big data related subject areas. The plug-and-play features of multitenant combined with Big Data SQL make it possible to minimize data movement when deploying these types of sandboxes. The scale-up and scale-down capabilities of the cloud make it very easy to deploy production-realistic quality assurance environments for final testing and end user acceptance operations. Exadata Cloud Service This is Oracle’s flagship solution for data warehousing. It offers the highest levels of performance and scalability and is optimized for data warehouse workloads. It delivers fully integrated and preconfigured software and infrastructure providing extreme performance for all types of data warehouses workloads. The Exadata Cloud Service bundles all the comprehensive and rich set of data management and analytics features of Oracle Database 12c as standard, such as: Partitioning, Multitenant, advanced compression features, advanced security and the complete range of Enterprise Manager packs In-memory, Advanced Analytics (Data Mining and R Enterprise), Spatial and Graph and OLAP Use Cases for Exadata Cloud Service Oracle’s comprehensive Exadata Cloud Service is designed to support enterprise-level, multi-petabyte data warehouse deployments where these environments (typically based around warehouses linked to multiple marts and large-scale plug-and-play sandboxes) typically have high levels of concurrency along with a wide variety of workloads. The extreme performance characteristics of Exadata Cloud Service make it the perfect consolidation environment for running multiple departmental marts, warehouses and data discovery sandboxes all within a single robust cloud environment. The Exadata Cloud Service is the foundation for a complete data warehouse solution due its tight integration with Oracle’s other cloud services such as Compute Cloud Service (for managing 3rd party tools), Big Data Preparation Services, Big Data Cloud Service and Backup and Recovery Cloud Service. Big Data Cloud Service Just like the Exadata Cloud Service, the Big Data Cloud Service, is based on Oracle’s engineered system for Big Data, which delivers fully integrated and preconfigured software and infrastructure providing extreme performance and scalability. Today’s enterprise data warehouse extends beyond the management of just structured, relational, data to encompass new data streams from areas such as Internet of Things and concepts such as data reservoirs based on unstructured and semi-structured data sets. In many cases the data reservoir is hosted on a Hadoop or on a Big Data platform. Oracle’s Big Data Cloud Service is the ideal service to support these complimentary services that are being built around the enterprise data warehouse. The co-location of big data reservoirs in the cloud alongside the Exadata Cloud Service opens up the ability to run SQL over both structured and unstructured data by using Oracle Big Data SQL, minimizing data movement and reducing the time taken to monetize new data streams. Use Cases for Big Data Cloud Service Oracle Big Data Cloud Service is the platform of choice for data-reservoir projects and IoT projects because the service leverages Cloudera’s industry’s leading distribution of Apache Hadoop. Many customers are using their data reservoirs as part of a wider information lifecycle management framework where historical, “cold”, data is pushed from the enterprise data warehouse to the data reservoir, which then sits alongside other operational information stored using NoSQL technologies, such as Oracle NoSQL Database. Using Big Data SQL, Data scientists and business users can easily incorporate data from all these different data management engines into their data models and analysis. This opens up the opportunity to extend existing analysis by incorporating broader data sets and explore new areas of opportunity using newly acquired data sets. For data scientists the Big Data Cloud Service combined with Exadata Cloud Service offers fast deployment and teardown of sandboxes environments. These data discovery sandboxes provide access to sophisticated analytical tools such as Oracle’s Enterprise R and Oracle Big Data Spatial and Graph analytics. These tools include extensive libraries of built-in functions that speed up the process of discovering relationships and making recommendations using big data. Compute Cloud Service On-premise data warehouse systems always rely on supporting systems to deliver data into the warehouse, visualize data via executive dashboards or analyze data using specialized processing engines. An enterprise cloud environment needs to also incorporate these solutions as well. Oracle supports this key requirement using its Compute Cloud Service, which allows customers to install and manage any non-Oracle software tools and components. As with Oracle’s various data warehouse cloud services, this means that 3rd party products can benefit from the advantages of co-location, such as lower latency, by running alongside the data warehouse in the same data center. Summary Oracle’s Cloud Services for data warehousing are based around engineered systems running the industry’s #1 database for data warehousing, fully optimized for data warehousing workloads and providing 100% compatibility with existing workloads. The unique aspect of Oracle’s Cloud service is the “same experience” guarantee. Customers running data warehouse services on-premise, in the Cloud or using hybrid Cloud will use the same management and business tools. Oracle’s Cloud Services for data warehousing are designed to simplify the process of integrating a data warehouse with cutting edge business processes around big data. A complete range of big data services are available to speed up the monetization of data sets: Oracle Big Data Cloud Service, Big Data Preparation Cloud, Big Data Discovery, IoT Cloud.  In the next post I will discuss Oracle’s cloud architecture for supporting data warehousing projects. Feel free to contact me (keith.laker@oracle.com)if you have any questions about Oracle’s Cloud Services for data warehousing.

In the last blog post (Data Warehousing in the Cloud - Part 1) I examined why you need to start thinking about and planning your move to the cloud: looking forward data warehousing in the cloud...

Data Warehousing in the Cloud - Part 1

Why is cloud so important? Data warehouses are currently going through two very significant transformations that have the potential to drive significant levels of business innovation: The first area of transformation is the drive to increase overall agility. The vast majority of IT teams are experiencing a rapid increase demand for data. Business teams want access to more and more historical data whilst at the same time, data scientists and business analysts are exploring ways to introduce new data streams into the warehouse to enrich existing analysis as well as drive new areas of analysis. This rapid expansion in data volumes and sources means that IT teams need to invest more time and effort ensuring that query performance remains consistent and they need to provision more and more environments (data sandboxes) for individual teams so that they can validate the business value of new data sets. The second area of transformation is around the need to improve the control of costs. There is a growing need to do more with fewer and fewer resources whilst ensuring that all sensitive and strategic data is fully secured, throughout the whole lifecycle, in the most cost efficient manner. Cloud is proving to be the key enabler. It allows organizations to actively meet the challenges presented by the two key transformations of expanding data volumes and increased focus on cost control. In this series of blog posts I hope to explain why and how moving your data warehouse to the cloud can support and drive these two key transformation as well as explaining the benefits that it brings for DBAs, data scientists and business users. Hopefully, this information will be useful for enterprise architects, project managers, consultants and DBAs. Over the coming weeks I will cover the following topics: Why is cloud so important for data warehousing Top 3 use cases for moving your data warehouse to the cloud Oracle’s cloud solutions for data warehousing Why Oracle Cloud? Why Oracle’s Cloud runs Oracle better A review of Oracle’s complete architecture for supporting data warehousing in the cloud In this first post I will cover points 1 and 2 and subsequent posts then cover the other topics. So here we go with part 1… Why move to the cloud now? A recent report by KPMG (Insights & Predictions On Disruptive Tech From KPMG’s 2015 Global Innovation Survey: http://softwarestrategiesblog.com/tag/idc-saas-forecasts/) looked at all the various technologies that are likely to have the greatest impact on business transformation over the coming years. KPMG talked to over 800 C-level business leaders around the world from a very broad range of businesses including tech industry startups, mid to large-scale organizations, angel investors and venture capital firms. One of the key objectives of the survey was to identify disruptive technologies and innovation opportunities. Looking forward, the top 3 technologies that will have the biggest impact of business transformation are: cloud, data and analytics and Internet of Things. All three of these technologies are key parts of the today’s data warehouse ecosystem. Therefore, it is possible to draw the conclusion that technology leaders view data warehousing in the cloud as having the greatest potential for driving significant business impact. The importance of cloud for data warehousing to Oracle customers is directly linked to three key drivers: Increased agility Better cost control Co-location Improving agility Many data warehouses are now embarking on a refresh phase. With much of the ground work for working with big data now in place, businesses are looking to leverage new data streams and new, richer types of analytics to support and drive new project areas such as: Customer-360, predictive analytics, fraud detection, IoT analytics and establishing data as profit center. Many of these projects require provisioning of new hardware environments and deployment of software. It is faster, easier and more efficient to kick-start these new data centric projects using Oracle’s comprehensive Cloud Services. Delivering better cost control Many IT teams are looking for ways to consolidate existing Oracle marts, each running on dedicated hardware, and legacy non-Oracle marts, running on proprietary hardware, into a single integrated environment. The delivery of Oracle’s enterprise-grade cloud services provides the perfect opportunity to start these types of projects and Oracle has cloud-enabled migration tools to support these projects. Compliance cannot be seen as an optional extra when planning a move to the cloud. Data assets need to be secured across their whole lifecycle. Oracle’s enterprise-grade cloud services make compliance easier to manage and more cost efficient because all security features can be enabled by default and transparently upgraded and enhanced. Co-Location for faster loading The vast majority of Oracle E-Business customers have already begun the process of moving their applications to Oracle’s Cloud Services. Most data warehouses source data directly from these key applications such as order entry, sales, finance and manufacturing etc. Therefore, it makes perfect sense to co-locate the data warehouse alongside source systems that are already running in the cloud. Co-location offers faster data loading which means that business users get more timely access to their data. Key benefits of moving to the Oracle Cloud There are typically three main benefits for moving the data warehouse to Oracle Cloud and these are directly linked to the three key drivers listed in the previous section:  Easier CONSOLIDATION and RATIONALIZATION Faster MONETIZATION of data in the Cloud Cloud offers better PROTECTION Let’s explore each of these use cases in turn: 1) Easier consolidation and rationalization  “All enterprise data will be stored virtually in the cloud. More data is in the cloud now than in traditional storage systems” - Oracle CEO Mark Hurd For a whole variety of reasons many data warehouse environments are made up of a number of databases that cover corporate data sets, departmental data sets, data discovery sandboxes, spread-marts etc. Each one of these usually runs on dedicated hardware that requires on-going investment and dedicated DBA services. This means that a significant proportion of IT costs is allocated to just keeping the lights on rather then helping drive new levels of innovation. Oracle customers are seeing the opportunities provided by the Oracle Multitenant feature of Database 12c and the availability of Oracle’s Cloud Services for data warehousing as an ideal opportunity to consolidate systems into a single cloud environment. At the same time these systems benefit from co-location: being next to their key source systems, which are already running in the cloud.  By moving to the Oracle Cloud it is possible to both improve ETL performance and reduce IT costs. With growing interest in big data there is an on-going need to rapidly provision new sandboxes for data discovery projects. The provisioning process is much simpler and faster using Oracle’s Cloud Services, allowing business teams to start their discovery work a lot quicker. As new data warehouse projects are formulated the deployment of development, test and training environments within the cloud provides the ability to free up costly on-premise resources and make them available to other projects. The cloud provides and excellent opportunity to convert and transform outdated legacy marts by leveraging the sophisticated analytical features of Oracle’s industry leading database. Oracle has a complete set of cloud-ready migration tools to convert and move schemas and data from all major database vendors to the Oracle Cloud.  2) Faster Monetization IoT - next wave of data explosion. By 2020 there will be 32 billion connected devices, generating 4.4ZB  (Cars, Infrastructure, Appliances, Wearable Technology) There is a growing need to monetize data, i.e. treat it as a valuable asset and convert it into a profit center. Moving to the cloud opens up a wide range of opportunities by providing an agile and simple way of implementing new style hybrid transactional/analytical requirements. It is a quick, efficient way to onboard new data streams such as IoT, external data sets from social media sources, 3rd party data sources etc. to enrich existing data sets making it possible to explore new business opportunities. The fast deployment and tear-down capabilities of the cloud offers an effective way to keep both deployment and development costs down to support the new style of “fail-fast” data discovery projects. The success of these data monetization projects largely depends on being able to integrate both existing traditional data sources, such as those from transactional operations, and the newer big data/IoT data streams. This integration is not only about matching specific data keys but also the ability to use a single, common industry standards driven query language such as SQL over all the data. Oracle’s Data Warehouse Cloud Services uniquely offers tight integration between relational and big data via features such as Big Data SQL. 3) Better Protection “Oracle’s enterprise cloud will be the most secure IT environment. We are fully patched, fully secured, fully encrypted—that’s our cloud. . .“ - Oracle CEO Mark Hurd Treating data as a profit center naturally requires IT teams to consider the concept of data protection and data availability. Outside of the cloud, security profiles have to be replicated, operating system patch levels need to kept in sync along with database patchsets. Trying to enforce common security and compliance rules across multiple standalone systems that share data is a time-consuming and costly process. The processes of consolidating data sets by leveraging the multitenant features of Oracle Database 12c and moving schemas to Oracle Cloud Services gives DBAs a simple and efficient way to secure data across the whole lifecycle. Within Oracle’s Cloud Services all data is encrypted by default: in transit, in the Cloud and at rest. Backups with encryption are enforced which ensures that cloud data remains secure at all times. Overall security is managed using Oracle Key Vault, secure SSH access, federated identity and an isolated cloud management network. Compliance and reporting is managed through the use of comprehensive audit trails. Below the database level Oracle Cloud Service is fully managed by Oracle, which means it is always fully patched and therefore fully secured. This allows DBAs and business teams to focus on business innovation without having to invest considerable time and resources securing, encrypting and patching environments to ensure they are in line with compliance and regulatory requirements.  Summary In this blog post I have examined why you need to start thinking about and planning your move to the cloud: looking forward data warehousing in the cloud is seen as having the greatest potential for driving significant business impact through increased agility, better cost control and faster data integration via co-location. I have outlined the top 3 key benefits of moving your data warehouse to the Oracle cloud: it provides an opportunity to consolidate and rationalise your data warehouse environment, it opens up new opportunities to monetise the content within your warehouse, new data security requirements means require IT teams to start implementing robust data security systems alongside comprehensive audit reporting. In the next post I will discuss Oracle’s cloud solutions for data warehousing, how Oracle’s key technologies enable Data Warehousing in the cloud and why Oracle’s Cloud runs Oracle better than any other cloud environment. Feel free to contact me (keith.laker@oracle.com)if you have any questions about Oracle’s Cloud Services for data warehousing.

Why is cloud so important? Data warehouses are currently going through two very significant transformations that have the potential to drive significant levels of business innovation: The first area of...

Data Warehousing

SQL Pattern Matching Deep Dive - Part 4, Empty matches and unmatched rows?

I have been asked a number of times during and after presenting on this topic (SQL Pattern Matching Deep Dive) what is the difference between the various matching options such as EMPTY MATCHES and UNMATCHED ROWS. This is the area that I am going to cover in this particular blog post, which is No 4 in this deep dive series. When determining the type of output you want MATCH_RECOGNIZE to return most developers will opt for one of the following: ONE ROW PER MATCH - each match produces one summary row. This is the default. ALL ROWS PER MATCH - a match spanning multiple rows will produce one output row for each row in the match. The default behaviour for MATCH_RECOGNIZE is to return one summary row for each match. In the majority of use cases this is probably the ideal solution. However, there are also many use cases that require more detailed information to be returned. If you are debugging your MATCH_RECOGNIZE statement then a little more information can help show how the pattern is being matched to your data set. In some cases you may find it useful, or even necessary, to use the extended syntax of the ALL ROWS PER MATCH keywords. There are three sub options: ALL ROWS PER MATCH SHOW EMPTY MATCHES <- note that this is the default ALL ROWS PER MATCH OMIT EMPTY MATCHES ALL ROWS PER MATCH WITH UNMATCHED ROWS Let’s look at these sub options in more detail but first a quick point of reference: all the examples shown below use the default AFTER MATCH SKIP PAST LAST ROW syntax. More on this later…  TICKER DATA Here is part of the ticker data set that we are going to use in this example - if you want to take a look at the full data set then see the example on the LiveSQL site: Empty matches An obvious first question is: what’s the difference between an “empty match” and an “unmatched row”? This is largely determined by the type of quantifier used as part of the pattern definition. By changing the quantifier it is possible to produce the similar result using both sets of keywords.  To help explore the subtleties of these keywords I have simplified the pattern to just look for price decreases and you should note that we are using the * quantifier to indicate that we are looking for zero or more matches of the DOWN pattern. Therefore, if we run the following code: SELECT   symbol,   tstamp,  price,  start_tstamp,  end_tstamp,  match_num,  classifierFROM ticker MATCH_RECOGNIZE (   PARTITION BY symbol ORDER BY tstamp   MEASURES FIRST(down.tstamp) AS start_tstamp,           LAST(down.tstamp) AS end_tstamp,           match_number() AS match_num,           classifier() AS classifier ALL ROWS PER MATCH SHOW EMPTY MATCHES  PATTERN (DOWN*)   DEFINE     DOWN AS (price <= PREV(price)))WHERE symbol = 'GLOBEX'; We get the following output: You can see that the result set contains all 20 rows that make up the data for my symbol “GLOBEX". Rows 1- 3, 9, and 13-15 are identified as unmatched rows - the classifier returns null. These rows appear because we have defined the search requirements for pattern DOWN as being zero or more occurrences. Based on this we can state that an empty match is a row that does not map explicitly to a pattern variable (in this case DOWN). However, it is worth noting that an empty match does in fact have a starting row and it is assigned a sequential match number, based on the ordinal position of its starting row. The above situation is largely the result of the specific quantifier that we are using: * (asterisk). Given that the DOWN variable can be matched zero or more times there is the opportunity for an empty match to occur. As the complexity of the PATTERN increases, adding more variables and using different combinations of quantifiers, the probability of getting empty matches decreases but it is something that you need to consider. Why? Because the MATCH_NUMBER() function counts the empty matches and assigns a number to them - as you can see above. Therefore, if we omit the empty matches from the results the MATCH_NUMBER() column no longer contains a contiguous set of numbers: So that if we run the following code where we have specified “OMIT EMPTY MATCHES”: SELECT   symbol,   tstamp,  price,  start_tstamp,  end_tstamp,  match_num,  classifierFROM ticker MATCH_RECOGNIZE (   PARTITION BY symbol ORDER BY tstamp   MEASURES FIRST(down.tstamp) AS start_tstamp,           LAST(down.tstamp) AS end_tstamp,           match_number() AS match_num,           classifier() AS classifier ALL ROWS PER MATCH OMIT EMPTY MATCHES  PATTERN (DOWN*)   DEFINE     DOWN AS (price <= PREV(price)))WHERE symbol = 'GLOBEX'; We get the following output: as you can see the MATCH_NUMBER() column starts with match number 4 followed by match 6 followed by match 10. Therefore, you need to be very careful if you decide to test for a specific match number within the MATCH_RECOGNIZE section and/or the result set because you might get caught out if you are expecting a contiguous series of numbers.   Summary of EMPTY MATCHES Some patterns permit empty matches such as those using the asterisk quantifier, as shown above. Three mains points to remember when your pattern permits this type of matching: The value of MATCH_NUMBER() is the sequential match number of the empty match. Any COUNT is 0. Any other aggregate, row pattern navigation operation, or ordinary row pattern column reference is null. The default is always to return empty matches, therefore, it is always a good idea to determine from the start if your pattern is capable of returning an empty match and how you want to manage those rows: include them (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter. Reporting unmatched rows? Always useful to view the complete result set - at least when you are running your code against test data sets. Getting all the input rows into your output is relatively easy because you just need to include the phrase ALL ROWS PER MATCH WITH UNMATCHED ROWS. Other than for testing purposes I can’t think of a good use case for using this in production so make sure you check your code before submit your production-ready code to your DBA. What about skipping? Note that if ALL ROWS PER MATCH WITH UNMATCHED ROWS is used with the default skipping behaviour (AFTER MATCH SKIP PAST LAST ROW), then there is exactly one row in the output for every row in the input. This statement will lead us nicely into the next topic in this deep dive series where I will explore SKIPPING. Taking a quick peak into this next topic…obviously there are many different types of skipping behaviours that are permitted when using WITH UNMATCHED ROWS. It does, in fact, become possible for a row to be mapped by more than one match and appear in the row pattern output table multiple times. Unmatched rows will appear in the output only once. Can a query contain all three types of match? Now the big question: Can I have a query where it is possible to have both UNMATCHED ROWS and EMPTY MATCHES? Short answer: Yes. When the PATTERN clause allows empty matches, nothing in the DEFINE clause can stop the empty matches from happening. However, there are special PATTERN symbols that are called anchors. Anchors work in terms of positions rather than rows. They match a position either at the start or end of a partition, or it used together then across the whole partition. ^ matches the position before the first row in the partition $ matches the position after the last row in the partition Therefore, using these symbols it is possible to create a PATTERN where the keywords SHOW EMPTY MATCHES, OMIT EMPTY MATCHES, and WITH UNMATCHED ROWS all produce different results from the same result set. For example, let’s start with the following: SELECT symbol, tstamp, price, mnm, nmr, clsFROM ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() AS mnm,           count(*) AS nmr,           classifier() AS cls ALL ROWS PER MATCH SHOW EMPTY MATCHES PATTERN ((^A*)|A+) DEFINE A AS price > 11)WHERE symbol = 'GLOBEX'ORDER BY 1, 2; returns the following 5 rows: this shows row 1 as an empty match for the pattern A* because we are matching from the start of the partition. This sets the MATCH_NUMBER() counter to 1. After the empty match the state moves to the pattern A+ for the remainder of the rows. The first match for this pattern starts at row 2 and completes at row 4. The final match in our data set is found at the row containing 15-APR-11. Therefore, if we omit the empty match at row 1 we only get 4 rows returned as shown here: SELECT symbol, tstamp, price, mnm, nmr, clsFROM ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() AS mnm,           count(*) AS nmr,           classifier() AS cls ALL ROWS PER MATCH OMIT EMPTY MATCHES PATTERN ((^A*)|A+) DEFINE A AS price > 11)WHERE symbol = 'GLOBEX'ORDER BY 1, 2; returns the following 4 rows: Now if we use the last iteration of this example the MATCH_RECOGNIZE statement returns all the rows from the input data. The actual “unmatched rows” are identified as having a NULL match number and NULL classifier. The “empty matches” are identified as having a NULL classifier and in this example the COUNT(*) function returns zero. SELECT symbol, tstamp, price, mnm, nmr, clsFROM ticker MATCH_RECOGNIZE( PARTITION BY symbol ORDER BY tstamp MEASURES match_number() AS mnm,           count(*) AS nmr,           classifier() AS cls ALL ROWS PER MATCH WITH UNMATCHED ROWS PATTERN ((^A*)|A+) DEFINE A AS price > 11)WHERE symbol = 'GLOBEX'ORDER BY 1, 2; returns all 20 rows from our data set: LiveSQL I have taken all the code and the associated explanations and created a tutorial on LiveSQL so you can try out the code for yourself: https://livesql.oracle.com/apex/livesql/file/tutorial_DZO3CVNYA7IYFU1V8H0PWHPYN.html. Summary I hope this helps to explain how the various output keywords that are part of the ALL ROWS PER MATCH syntax can affect the results you get back. You should now understand why your results contains match_number values that are not contiguous and why classifier can return a NULL value along with specific aggregate functions. I expect the hardest concept to understand is the idea of empty matches. As I stated earlier it is always a good idea to determine from the start if your pattern is capable of returning an empty match: are you using an asterisk * within the PATTERN clause? Then you can determine how you want to manage those rows: include the empty matches (SHOW EMPTY MATCHES) or exclude them (OMIT EMPTY MATCHES). Be careful if you are using MATCH_NUMBER() within the DEFINE section as part of a formula because empty matches increment the MATCH_NUMBER() counter. What should be immediately obvious is that in all the examples I have used the default skip behaviour: AFTER MATCH SKIP PAST LAST ROW. In the next post I will explore the various skip keywords and how they can impact the results returned by your MATCH_RECOGNIZE statement. What’s next? In the next post in this series I am going to review the keywords that control where we restart searching once a pattern has been found: the keywords SKIP TO. Feel free to contact me if you have an interesting use cases for SQL pattern matching or if you just want some more information. Always happy to help. My email address is keith.laker@oracle.com.  Looking for more Information Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

I have been asked a number of times during and after presenting on this topic (SQL Pattern Matching Deep Dive) what is the difference between the various matching options such as EMPTY MATCHES and UNMA...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL and YARN on the same cluster. - Part14.

today I'm going to explain how to multitenant on the same cluster Big Data SQL and YARN. I think it's a quite common scenario - you may want to store historical data and query it with the Big Data SQL. As well you may want to perform the ETL job within the same cluster. If so, resource management became one of the main requirement for this. In other words, you have to warranty certain performance despite on other jobs. For example: 1) You may need to finish your ETL  as fast as possible. In this case, MapReduce, which run on YARN has higher priority 2) You build critical reports with the Big Data SQL and in this case, Big Data SQL have to have higher priority rather than YARN Life without resource manager. Let's have a start from the beginning. I do have MapReduce (YARN) jobs and Big Data SQL queries, which runs on the same cluster. It will work perfectly fine unless you have exceeded your CPU or IO boundary. Let me give you an example. I picked up small data set for quering it (my goal is not exceed CPU limit) 1) I run the MapReduce job (hive query) and it finished in 165 seconds. 2) I run the Big Data SQL and it finished in 30 seconds. 3) I run Big Data SQL together with Hive and BDS finished in 31 seconds, Hive in 170 Sec. Almost the same results!  But as soon as you run the query, which has reach CPU boundary and your engines (Big Data SQL and YARN) start to share the CPU among two processes. Resource manager will not increase your CPU capacity, but it will help you to define how to share resources between those two processes. How to enable Resource Sharing between YARN and Big Data SQL. Cloudera has a very powerful mechanism to share resources - "Static Service Pool". Under the hood,  it uses Linux cgroups. It defines the proportion of CPU and IO resources between processes. The easiest way to enable it is the use Cloudera Manager: 1) Go to the  "Cluster -> Static Service Pool": 2) Go to the configuration:  3) Enable Cgroup Managment and use: Cgroup CPU Shares and Cgroup IO Weight  It's interesting that Linux CPU share may vary between 2 and 262144 and at the same moment IO weight vary between 100 and 1000. I recommend you to change those two handlers synchronously (in other words, change the values between 100 and 1000 for both). After the restart of coresponding processes, you will have enabled Resource Managment.  Trust, bu verify. it's all the theory and every theory has to be proven by some concrete examples. I played a bit with Static Service Pools in the context of tenant Big Data SQL and Hive query (read YARN) on the same cluster. For benchmarking I picked up the simplest query which use neither Storage Indexes nor Predicate Push Down and which returns exactly 0 rows: SQL> SELECT * FROM store_sales_csv WHERE MOD(ss_ticket_number,10)=20; In a case of Hive this query will be very similar: hive> SELECT * FROM csv.store_sales WHERE PMOD(ss_ticket_number,10)=20; Before all, I sequentially ran the Big Data SQL query and Hive. Bellow you could find the CPU and IO profile for Big Data SQL and Hive: Hive query was done in 890.628 seconds, BDS in 391 seconds. I start my test with running those queries without any Resource Managment, just run two statements simultaneously. Big Data SQL (BDS) took  731 seconds Hive have finished within 1434.75 seconds After this, I enable cgroup resource management (by Static Service Pool) and run Hive and Big Data SQL queries simultaneously.   In my tests, I only play with the CPU shares which indirectly handle IO as well. I conclude the results into the table, which you could find bellow: CPU.shares configuration (BDS/Hive) Big Data SQL, elapsed time seconds Hive, elapsed time seconds Stand alone 391 890.628  No control 731 1434.75  2/262144 1231 1022.083 100/1000  1217 1184.115 200/800 1166 1244.993 500/500 749  1269.115 800/200 513 1277.694 1000/100 465 1288.094 262144/2 407 1284.804 This table shows: 1) Static Service Pool works 2) It's coarse handler. In other words, you couldn't expect exact proportions from it.

today I'm going to explain how to multitenant on the same cluster Big Data SQL and YARN. I think it's a quite common scenario - you may want to store historical data and query it with the Big Data...

Big Data SQL

Big Data SQL Quick Start. My query is running too slow or how to tune Big Data SQL. - Part13.

In my previous posts, I was talking about different features of the Big Data SQL. Everything is clear (I hope), but when you start to run real queries you may have doubts - is it a maximum performance which I could get from this Cluster? In this article, I would like to explain steps which are required for the performance tuning of the Big Data SQL. SQL Monitoring. First of all, the Big Data SQL is the Oracle SQL. You may use to start to debug Oracle SQL performance/other issues with SQL Monitor. Same for Big Data SQL. For start working with it, you may need to install OEM and use the lightweight version of it - Database Express. If you don't want/like/can use GUI tools you may use it with SQLPLUS, like it showed here.Some of the performance problems could be unrelated with Hadoop and may be a general Oracle Database issues, like active using TEMP tablespace  Many of the waiting events are standard for Oracle Database, you may found the only couple which is specific for the Big Data SQL: 1)  "cell external table smart scan" - which is the typical event for Big Data SQL and it tells us that something happens (scan) on the Hadoop side. 2) "External Procedure call" - this event is also natural for the Big Data SQL, through the extproc Database fetch the metadata and define the block location on the HDFS for future planning, but if you observe a lot of "External Procedure call" the waiting events - it could be a bad sign. Usually, it means that you fetch the HDFS block on the Database side and parse/process it there (without the offloading) Quarantine. If your query has failed few times it may be placed in the quarantine. It works like in Exadata - SQLs which are in the quarantine will not proceed on the cell side and instead this will be shipped to the Database and proceed there ("External Procedure Call" wait event will tell you about this). For checking, which queries are in the quarantine you have to run: [Linux] $ dcli -C bdscli -e "list quarantine" for dropping it off: [Linux] $ dcli -C "bdscli -e "drop quarantine all"" Storage Indexes. Storage Indexes (SI) is very powerful performance feature. I explained the way how it works here. I don't recommend you to disable it. In most cases, SI brings you the great performance boost. But it has one downside - first few runs are slower than without SI. But again I don't recommend you to disable it. If you want to get consistent performance with SI - I advise you to warm it up by running few times query, which returns exactly 0 rows. It may be done by putting WHERE predicate which is never TRUE, for Example: SQL> select * from customers WHERE age= -1 and passport_id = 0; The first run will be slow, but after few times query will be finished within couple seconds.  Data types. Well, let's imagine that you made sure, that everything that may work on the cell side works there (in other words you don't have a lot of "External Procedure Call" wait events), don't have any Oracle Database related problem, Storage Indexes warmed up, but you may still think that query could run faster. Next thing to check is datatype definition in the Oracle Database and Hive. In nutshell - you may work in few times slower with wrong datatype definition. Ideally, you just pass the data from Hadoop level to the database layer without any transformation otherwise, you burn a lot of CPU resources on the cell side. I put all details here, so be very careful with your Oracle DDLs. File Formats. Big Data SQL has a lot of improvements for working with Text Files (like CSV). It proceeds it in C engine. You may also get some profit from the Columnar File Formats like Parquet File or ORC. The main optimization is Predicate Push Down. Another one big optimization, which you could do with the Columnar File Formats is list less columns. Avoid queries like,  SQL> select * from customers instead, list the minimum number of columns:  SQL> select col1, col2 from customers If you are creating parquet files it may also be useful to reduce page size for reducing Big Data SQL memory consumption. For example, you could do this with hive - create the new table: hive> CREATE TABLE new_tab STORED AS PARQUET tblproperties ("parquet.page.size"="65536") AS SELECT * FROM old_tab; What is your bottleneck? It's  very important to understand where is your bottleneck. Big Data SQL is the complex product which involves two sides - Database and Hadoop. Each side has few components which could limit your performance. For Database side I do recommend to use OEM. Hadoop is easier to debug with Cloudera Manager (it has a plenty of pre-collected and predefined charts, which you could find in the charts bookmark).   What is the whole picture? Many thanks for Marty Gubar for this picture, that shows overall picture of the Big Data SQL processing: whenever you run the query first of all Oracle Database obtain the list of Hive partitions. This is the first Big Data SQL optimization - you read only data what you need. After this database obtain the list of the blocks and plan the scan in the way which will evenly distribute the workload. After the column prunning database runs the scan on the Hadoop tier. If Storage Indexes exist they are applied as a first step. After this (in case of parquet files or ORC) Big Data SQL applies Predicate Push Down and starts to fetch the data. Data stored in the Hadoop format and need to be converted to Oracle type. After this Big Data SQL run the Smart Scan (filter) over rest of the data (which were not prune out by Storage Indexes or Predicate Push Down).

In my previous posts, I was talking about different features of the Big Data SQL. Everything is clear (I hope), but when you start to run real queries you may have doubts - is it a maximum performance...

The complete review of data warehousing and big data content from Oracle OpenWorld 2016

The COMPLETE REVIEW of OpenWorld covers all the most important sessions and related content from this year's conference, including Oracle's key data warehouse and big technologies: Oracle Database 12c Release 2, Oracle Cloud, engineered systems, partitioning, parallel execution, Oracle Optimizer, analytic SQL, analytic views, in-memory, spatial, graph, data mining, multitenant, Big Data SQL, NoSQL Database and industry data models. The COMPLETEreview covers the following areas: On-demand videos of the most important keynotes Overviews of key data warehouse and big data sessions and links to download each presentation List of data warehouse and big data presenters who were at #oow16 Overview of Oracle Cloud services for data warehousing and big data Details of OpenWorld 2017 and details of how to justify your trip to San Francisco Links to the data warehouse and big data product web pages, blogs, social media sites This review is available in Apple iBooks for people who are living in the 21st Century and for those of you stuck in early 1900's there is the fall-back option of a PDF version. Of course the iBook version offers a complete, exciting and immersive multi-media experience whilst the PDF version is a simply and quite literally just a PDF. PDF version can be downloaded by clicking here Apple iBook version can be downloaded from the iBook Store by clicking here Hope this review is useful. Let me know if you have any questions and/or comments. Enjoy!

The COMPLETE REVIEW of OpenWorld covers all the most important sessions and related content from this year's conference, including Oracle's key data warehouse and big technologies: Oracle Database 12c...

Big Data SQL

Big Data SQL Quick Start. Semi-structured data. - Part12.

In my previous blogpost, I was talking about Schema on Read and Schema on Write advantages and disadvantages. As a conclusion, we found that HDFS could be quite suitable for data in the original format. Very often customers have data in a semi-structure format like XML or JSON. In this post, I will show how to work with it. Use case for storing semi-structure data. One of the most common use case for storing semi-structure data in the HDFS could be desire to store all original data and move only part of it in the relational database. This may be due to the fact that part of the data may be needed on daily basis, but other parts of the data will be accessed very rarely (but they still may be required for some deep analytics). For example, we have XML follow format: <XML> <NAME> ...</NAME> <AGE> ...</AGE> <INCOME> ...</INCOME> <Color_of_eyes> ... </Color_of_eyes> <Place_of_birth> ... </Place_of_birth> </XML>  and on daily basis, we need in the relational database only the name and age of a person. Like this: Name Age ---- ---- .... .... Others fields will be accessed very rarely. At this case, HDFS seems like a good solution to store data in the original format and Big Data SQL seems like a good tool for access it from the database. Let me show to you couple examples how you can do this. Big Data SQL and XML. For start querying the XML data with Big Data SQL you have to define Hive metadata over it using Oracle XQuery for Hadoop. After this, you have to define an external table in the Oracle Database, which will link you to the Hive table and you are ready to run your queries. Here is the abstract picture: Now, let me give you an example of data with DDLs. Like an example of the data I took some machine data (it could be a smart meter or so): <row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 12:29</End_Datetime><General_Supply_KWH>0.197</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> <row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 12:59</End_Datetime><General_Supply_KWH>0.296</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> <row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 13:29</End_Datetime><General_Supply_KWH>0.24</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> <row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 13:59</End_Datetime><General_Supply_KWH>0.253</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> <row><CUSTOMER_KEY>8170837</CUSTOMER_KEY><End_Datetime>4/04/2013 14:29</End_Datetime><General_Supply_KWH>0.24</General_Supply_KWH><Off_Peak_KWH>0</Off_Peak_KWH><Gross_Generation_KWH>0</Gross_Generation_KWH><Net_Generation_KWH>0</Net_Generation_KWH></row> I put this data on the HDFS: [Linux] $ hadoop fs -put source.xml hdfs://cluster-ns/user/hive/warehouse/xmldata/ Like a second step i have to define Hive metadata: hive> CREATE EXTERNAL TABLE meter_counts( customer_key string, end_datetime string, general_supply_kwh float, off_peak_kwh int, gross_generation_kwh int, net_generation_kwh int) ROW FORMAT SERDE 'oracle.hadoop.xquery.hive.OXMLSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLInputFormat' OUTPUTFORMAT 'oracle.hadoop.xquery.hive.OXMLOutputFormat' LOCATION 'hdfs://cluster-ns/user/hive/warehouse/xmldata/' TBLPROPERTIES ( 'oxh-column.CUSTOMER_KEY'='./CUSTOMER_KEY', 'oxh-column.End_Datetime'='./End_Datetime', 'oxh-column.General_Supply_KWH'='./General_Supply_KWH', 'oxh-column.Gross_Generation_KWH'='./Gross_Generation_KWH', 'oxh-column.Net_Generation_KWH'='./Net_Generation_KWH', 'oxh-column.Off_Peak_KWH'='./Off_Peak_KWH', 'oxh-elements'='row'); For more information on creating XML tables, see Oracle XQuery for Hadoop here. As a second step we have to define External table in the Oracle Database, which linked to the hive table: SQL> CREATE TABLE OXH_EXAMPLE ( CUSTOMER_KEY VARCHAR2(4000), END_DATETIME VARCHAR2(4000), GENERAL_SUPPLY_KWH BINARY_FLOAT, OFF_PEAK_KWH NUMBER, GROSS_GENERATION_KWH NUMBER, NET_GENERATION_KWH NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.tablename=default.meter_counts) ) REJECT LIMIT UNLIMITED PARALLEL; here we are. Now we are ready to query XML data from the  Oracle DB: SQL> SELECT * FROM oxh_example WHERE ROWNUM <= 3; ...... 8170837 4/04/2013 12:29 0.196999997 0 0 0 8170837 4/04/2013 12:59 0.296000004 0 0 0 8170837 4/04/2013 13:29 0.239999995 0 0 0 Great, we expose XML data as structure in the Database.  Another one great thing about Big Data SQL, that parsing and part of the processing is pushed down to the Hadoop side. For example, if we run query like: SQL> SELECT COUNT(1) FROM oxh_example WHERE customer_key='8170837'; will be pushed to the Hadoop nodes and will not utilize the database. In the OEM we could see only "cell external table smart scan" event: In Cloudera Manager we see that 3 Hadoop nodes are utilized and at the same point of time database node is idle.   and session stat could show us, that only 8KB out of 100GB returned back to the Database (all other were filtered on the cell side): SQL> SELECT n.name, VALUE FROM v$mystat s, v$statname n WHERE s.statistic# = n.statistic# AND n.name LIKE '%XT%'; ... cell interconnect bytes returned by XT smart scan 8192 Bytes cell XT granule bytes requested for predicate offload 115035953517 Bytes Big Data SQL and JSON. All right, now we know how to work with XML data with Big Data SQL. But there is another one popular semi-structure data format - JSON. Here Oracle Database has prepared a pleasant surprise. Since 12c version, we have very convenient and flexible API for working with JSON in the database as well as out of the database (external table). Let me show this.  An example of the input data: {wr_returned_date_sk:37890,wr_returned_time_sk:8001,wr_item_sk:107856,wr_refunded_customer_sk:5528377,wr_refunded_cdemo_sk:172813,wr_refunded_hdemo_sk:3391,wr_refunded_addr_sk:2919542,wr_returning_customer_sk:5528377,wr_returning_cdemo_sk:172813,wr_returning_hdemo_sk:3391,wr_returning_addr_sk:2919542,wr_web_page_sk:1165,wr_reason_sk:489,wr_order_number:338223251,wr_return_quantity:4,wr_return_amt:157.88,wr_return_tax:11.05,wr_return_amt_inc_tax:168.93,wr_fee:11.67,wr_return_ship_cost:335.88,wr_refunded_cash:63.15,wr_reversed_charge:87.15,wr_account_credit:7.58,wr_net_loss:357.98} {wr_returned_date_sk:37650,wr_returned_time_sk:63404,wr_item_sk:1229906,wr_refunded_customer_sk:5528377,wr_refunded_cdemo_sk:172813,wr_refunded_hdemo_sk:3391,wr_refunded_addr_sk:2919542,wr_returning_customer_sk:5528377,wr_returning_cdemo_sk:172813,wr_returning_hdemo_sk:3391,wr_returning_addr_sk:2919542,wr_web_page_sk:1052,wr_reason_sk:118,wr_order_number:338223251,wr_return_quantity:19,wr_return_amt:3804.37,wr_return_tax:266.31,wr_return_amt_inc_tax:4070.68,wr_fee:47.27,wr_return_ship_cost:3921.98,wr_refunded_cash:1521.75,wr_reversed_charge:2100.01,wr_account_credit:182.61,wr_net_loss:4454.6} Put it in the HDFS from the Linux filesystem and create Hive table with single column over it. After this create Oracle DB external table with the single column: [Linux] $ hadoop fs -put source.json hdfs://cluster-ns/user/hive/warehouse/jsondata/ hive> CREATE TABLE json_string( json_str string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://cluster-ns/user/hive/warehouse/jsondata/' SQL> CREATE TABLE WEB_RETURNS_JSON_STRING ( JSON_STR VARCHAR2(4000) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS (com.oracle.bigdata.tablename=json.json_string) ) REJECT LIMIT UNLIMITED PARALLEL; Ok, it the beggining it seems senceless. Why do we need the table with the single column? But in the Oracle 12c you have very wide JSON capabilities, which is automatically available for Big Data SQL (you do remember that Big Data SQL is Oracle SQL, aren't you?). If you are not familiar with it, I advise to check out this blogpost (thank you, Gerald). It's extremely easy to parse your JSONs with the Oracle SQL - just put a dot after a name of the column and write the name of the field. SQL> SELECT j.json_str.wr_returned_date_sk, j.json_str.wr_returned_time_sk FROM web_returns_json_string j WHERE j.json_str.wr_returned_time_sk = 8645 AND ROWNUM <= 5; ... 38195 8645 38301 8645 37820 8645 38985 8645 37976 8645 If we check the system stat, we could find that we filter out a lot of data on the Hadoop Side: SQL> SELECT n.name, VALUE FROM v$mystat s, v$statname n WHERE s.statistic# = n.statistic# AND n.name LIKE '%XT%'; ... cell interconnect bytes returned by XT smart scan 507904 Bytes cell XT granule bytes requested for predicate offload 16922334453 Bytes Note: parsing and filtering happen on the Hadoop side! Big Data SQL and JSON.  Restrictions and workaround. So, everything works well until you have JSON strings more than 4000 characters and you are able to define the table with VARCHAR2(4000) column. But, what I suppose to do if I do have JSON strings longer than 4000 characters? Define it like a CLOB, but (!) at this case all parsing and filtering will happen on the Database side. Test Case: 1) Using table definition from previous example(VARCHAR2), I ran  the query: SQL> SELECT COUNT(1) FROM web_returns_json_string j WHERE j.json_str.wr_returned_time_sk = 8645; it was finished in 75 seconds. OEM shows that most of the event were "cell external table smart scan", which means that we offload scans on the Storage side. After this I defined External table over the same hive table, but define column like a CLOB: SQL> CREATE TABLE WEB_RETURNS_JSON_STRING ( JSON_STR CLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS (com.oracle.bigdata.tablename=json.json_string) ) REJECT LIMIT UNLIMITED PARALLEL ; And run the same query: SQL> SELECT COUNT(1) FROM web_returns_json_string j WHERE j.json_str.wr_returned_time_sk = 8645; it was finished in 90 Minutes (!!!) in 3600 times slower. OEM shows that most of the event on the Database CPUs. Which means that we couldn't offload because it's a CLOB column.  Cloudera Manager also shows us the difference between two queries. First one utilizes the cell side (3 Hadoop nodes), second one utilize only the database.   Well, now we understand the problem (low performance in case if JSON longer than 4000 characters), but how to work around it? It's easy, like in the XML example define structure in the Hive metastore and map hive table to Oracle Table one to one. hive> CREATE EXTERNAL TABLE j1_openx( wr_returned_date_sk bigint, wr_returned_time_sk bigint, wr_item_sk bigint, wr_refunded_customer_sk bigint, wr_refunded_cdemo_sk bigint, wr_refunded_hdemo_sk bigint, wr_refunded_addr_sk bigint, wr_returning_customer_sk bigint, wr_returning_cdemo_sk bigint, wr_returning_hdemo_sk bigint, wr_returning_addr_sk bigint, wr_web_page_sk bigint, wr_reason_sk bigint, wr_order_number bigint, wr_return_quantity int, wr_return_amt double, wr_return_tax double, wr_return_amt_inc_tax double, wr_fee double, wr_return_ship_cost double, wr_refunded_cash double, wr_reversed_charge double, wr_account_credit double, wr_net_loss double) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://cluster-ns/user/hive/warehouse/jsondata/' there is plenty of different SerDe for JSON, but from the performance perspective, I personally would recommend org.openx.data.jsonserde.JsonSerDe After this we just need to define Oracle external table over it: SQL> CREATE TABLE WEB_RETURNS_JSON_SD_OPENX ( WR_RETURNED_DATE_SK NUMBER(10,0), WR_RETURNED_TIME_SK NUMBER(10,0), WR_ITEM_SK NUMBER(10,0), WR_REFUNDED_CUSTOMER_SK NUMBER(10,0), WR_REFUNDED_CDEMO_SK NUMBER(10,0), WR_REFUNDED_HDEMO_SK NUMBER(10,0), WR_REFUNDED_ADDR_SK NUMBER(10,0), WR_RETURNING_CUSTOMER_SK NUMBER(10,0), WR_RETURNING_CDEMO_SK NUMBER(10,0), WR_RETURNING_HDEMO_SK NUMBER(10,0), WR_RETURNING_ADDR_SK NUMBER(10,0), WR_WEB_PAGE_SK NUMBER(10,0), WR_REASON_SK NUMBER(10,0), WR_ORDER_NUMBER NUMBER(10,0), WR_RETURN_QUANTITY NUMBER(10,0), WR_RETURN_AMT NUMBER, WR_RETURN_TAX NUMBER, WR_RETURN_AMT_INC_TAX NUMBER, WR_FEE NUMBER, WR_RETURN_SHIP_COST NUMBER, WR_REFUNDED_CASH NUMBER, WR_REVERSED_CHARGE NUMBER, WR_ACCOUNT_CREDIT NUMBER, WR_NET_LOSS NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster:bds30 com.oracle.bigdata.tablename:json.j1_openx)) REJECT LIMIT UNLIMITED PARALLEL ; and query it: SQL> SELECT COUNT(1) FROM WEB_RETURNS_JSON_SD_OPENX j WHERE j.wr_returned_time_sk = 8645; Now it took 141 seconds with offloading on the Hadoop side. In two times slower than native processing with VARCHAR2, but in 1800 times faster than native processing with CLOB. Conclusion. 1) HDFS is well suitable for storing data in the original format 2) Big Data SQL offers wide capabilities for working with  Semi-Structure data 3) For JSON file format it has convenient API out of the box.

In my previous blogpost, I was talking about Schema on Read and Schema on Write advantages and disadvantages. As a conclusion, we found that HDFS could be quite suitable for data in the original...

Your Chance To Meet the Analytic SQL Development Team at OpenWorld

Wow, it’s only just under two weeks to go until this year’s OpenWorld kicks off on September 18th at Moscone Center in San Francisco. The analytic SQL development team will be available on the demo booth (id ref 1635) in the database area, Moscone South, to help with any technical questions and provide general guidance and using analytic features of 12c Release 2. Obviously we would love to meet you all and we will keep you up-to-date with the latest data warehouse news coming out of #oow16 by posting updates during week on our social media sites. such as @BigRedDW on twitter. On the demo grounds we are trying something a little different this year. To give you the best experience of all the new features that we have added to Database 12c Release 2 we are going to showcase different features during the morning and afternoon sessions. Here is the schedule: Date Morning Topic Afternoon Topic Monday 10:15am - 1:00pm Analytic Views 1:00pm - 5:30pm Analytic SQL Tuesday 10:15am - 1:00pm Analytic Views 1:00pm - 5:15pm Analytic SQL Wednesday 10:15am - 1:00pm Analytic Views 1:00pm - 4:15pm Analytic SQL This split will allow you to ask our brilliant developers about all the key new SQL features they’ve added and the enhancements they’ve made for 12c Release 2. I’ll be at the booth during both morning and afternoon sessions during the week. Here are the key sessions linked to our demo booths: Oracle Database 12c Release 2: Top 10 Data Warehouse Features for Developers and DBAs:  Monday at 5:30pm to 6:15pm in Moscone South-303 Analytic Views: A New Type of Database View for Simple, Powerful Analytics : Wednesday at 12:15pm to 1:00pm in Moscone South-102 And don’t forget Marty Gubar’s excellent hands-on lab “Use Oracle Big Data SQL to Analyze Data Across Oracle Database, Hadoop and NoSQL“, running in Bay View (25th Floor) at the Nikko Hotel, which shows you how to use analytic functions and SQL pattern matching on big data! Also, check out the panel session Optimizing SQL for Performance and Maintainability which is on Thursday 22nd, 1:15pm - 2:00 pm in Moscone South—103. I will be hosting this will be a great session which will include our AskTom and SQL developer advocates team of Chris and Connor, Mr Optimizer (Nigel Bayliss, John Clarke from the Real World Performance team, Christian Antognini (Senior Principal Consultant, Trivadis AG) and Timothy Hall (DBA, Developer, Author, Trainer, Haygrays Limited). This will be an exciting and lively session so  don’t miss it! Don't forget that my fellow data warehousing product managers (Hermann Baer, Yasin Baskan, Nigel Bayliss, George Lumpkin, Jean-Pierre Dijcks) will also be presenting a bunch of sessions so check them out in the full searchable OOW catalog. If you search using keywords like partitioning, warehousing, warehouse, parallel, optimizer and big data you’ll find them! Alternatively download my Complete Guide to Data Warehousing and Big Data at OpenWorldwhich is available in both Apple iBooks and PDF formats. The PDF format will work on any smartphone, tablet and/or computer. The iBooks format will open on iPads, iPhones and Mac computers via the relevant iBooks App. Please refer to the Apple Apps Store for more information. PDF version can be downloaded by clicking here Apple iBooks version can be downloaded by clicking here Look forward to seeing you all in the beautiful city of San Francisco.

Wow, it’s only just under two weeks to go until this year’s OpenWorld kicks off on September 18th at Moscone Center in San Francisco. The analytic SQL development team will be available on the demo...

Big Data SQL

Big Data SQL Quick Start. Schema on Read and Schema on Write - Part11.

Schema on Read vs Schema on Write So, when we talking about data loading, usually we do this with a system that could belong on one of two types. One of this is schema on write. With this approach, we have to define columns, data formats and so on. During the reading, every user will observe the same data set. As soon as we performed ETL (transform data in the format that most convenient to some particular system), reading will be pretty fast and overall system performance will be pretty good. But you should keep in mind, that we already paid a penalty for this when we're loading data. Like an example of the schema on write system you could consider Relational database, for example, like Oracle or MySQL. Schema on Write Another approach is schema on read. In this case, we load data as-is without any changing and transformations. With this approach, we skip ETL (don’t transform data) step and we don’t have any headaches with the data format and data structure. Just load the file on a file system, like copying photos from FlashCard or external storage to your laptop’s disk. How to interpret data you will decide during the data reading. Interesting stuff that the same data (same files) could be read in a different manner. For instance, if you have some binary data and you have to define Serialization/Deserialization framework and using it within your select, you will have some structure data, otherwise, you will get a set of the bytes. Another example, even if you have simplest CSV files you could read the same column like a Numeric or like a String. It will affect on different results for sorting or comparison operations. Schema on Read Hadoop Distributed File System is the classical example of the schema on read system.More details about Schema on Read and Schema on Write approach you could find here. Is schema on write always goodness? Apparently,  many of you heard about Parquet and ORC file formats into Hadoop. This is the example of the schema on write approach. We convert source format in the form which is convenient for processing engine (like hive, impala or Big Data SQL). Big Data SQL has the very powerful feature like predicate push down and column pruning, which allows you significantly improve the performance. I hope that my previous blog post convinced you that you may have dramatic Big Data SQL performance improvement with the parquet files, but have you immediately delete source files after conversion? don't think so and let me give you the example why. Transforming source data. As a data source, I've chosen AVRO files. { "type" : "record", "name" : "twitter_schema", "namespace" : "com.miguno.avro", "fields" : [ { "name" : "username", "type" : "string", "doc" : "Name of the user account on Twitter.com" }, { "name" : "tweet", "type" : "string", "doc" : "The content of the user's Twitter message" }, { "name" : "timestamp", "type" : "long", "doc" : "Unix epoch time in seconds" } using this schema we generate  AVRO file, which has 3 records: [LINUX]$ java -jar /usr/lib/avro/avro-tools.jar random --schema-file /tmp/twitter.avsc --count 3 example.avro on the next step we put this file into the hdfs directory: [LINUX]$ hadoop fs -mkdir /tmp/avro_test/ [LINUX]$ hadoop fs -mkdir /tmp/avro_test/flex_format [LINUX]$ hadoop fs -put example.avro /tmp/avro_test/flex_format   now is good time to explain this data with metadata (create hive external table):    hive> CREATE EXTERNAL TABLE tweets_flex ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/tmp/avro_test/flex_format' TBLPROPERTIES ('avro.schema.literal'='{ "namespace": "testing.hive.avro.serde", "name": "tweets", "type": "record", "fields": [ {"name" : "username", "type" : "string", "default" : "NULL"}, {"name" : "tweet","type" : "string", "default" : "NULL"}, {"name" : "timestamp", "type" : "long", "default" : "NULL"} ] }' ); to get access to this data from Oracle we need to create an external table which will be linked with hive table, created in the previous step. SQL> CREATE TABLE tweets_avro_ext ( username VARCHAR2(4000), tweet VARCHAR2(4000), TIMESTAMP NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.tablename=DEFAULT.tweets_flex) ) REJECT LIMIT UNLIMITED PARALLEL; Now I want to convert my data to a format which has some optimizations for  Big Data SQL, parquet for example: hive> create table tweets_parq ( username string, tweet string, TIMESTAMP smallint ) STORED AS PARQUET; hive> INSERT OVERWRITE TABLE tweets_parq select * from tweets_flex; as a second step of the metadata definition, I created Oracle external table, which is linked to the parquet files: SQL> CREATE TABLE tweets_parq_ext ( username VARCHAR2(4000), tweet VARCHAR2(4000), TIMESTAMP NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=DEFAULT.tweets_parq) ) REJECT LIMIT UNLIMITED PARALLEL; Now everything seems fine, and let's query the tables which have to have identic data (because parquet table was produced in Create as Select style from AVRO). SQL> select TIMESTAMP from tweets_avro_ext WHERE username='vic' AND tweet='hello' UNION ALL select TIMESTAMP from tweets_parq_ext WHERE username='vic' AND tweet='hello' ------------ 1472648470 -6744 Uuups...  it's not what we expect.. data have to be identical, but something went wrong. smallint datatype is not enough for the timestamp and this is the reason that we got wrong results. Let's try to recreate parquet table in hive and run SQL in Oracle again. hive> drop table tweets_parq; hive> create table tweets_parq ( username string, tweet string, TIMESTAMP bigint ) STORED AS PARQUET; hive> INSERT OVERWRITE TABLE tweets_parq select * from tweets_flex; after reloading data we don't need to do something in Oracle database (as soon as the table name in hive remains the same. SQL> select TIMESTAMP from tweets_avro_ext WHERE username='vic' AND tweet='hello' UNION ALL select TIMESTAMP from tweets_parq_ext WHERE username='vic' AND tweet='hello' ------------ 1472648470 1472648470 bingo! results are the same! Conclusion. it's a philosophical question what's better schema on read or schema on write. The first one could give you flexibility and preserve from human mistakes.The second one is  able to provide better performance. Generally, it's a good idea to keep data in source format (just in case) and optimize it in another format which is convenient for the engine which scans your data. ETL could have wrong transformations and source data format will allow you to jump back to the source and reparse data in the proper way.

Schema on Read vs Schema on Write So, when we talking about data loading, usually we do this with a system that could belong on one of two types. One of this is schema on write. With this approach, we...

The Complete Guide To Data Warehousing and Big Data at Oracle OpenWorld 2016

The COMPLETE guide for OpenWorld provides a comprehensive day-by-day list of all the most important sessions and hands-on labs for Oracle's data warehouse and big technologies: Oracle Database 12c Release 2, Oracle Cloud, engineered systems, partitioning, parallel execution, Oracle Optimizer, analytic SQL, analytic views, in-memory, spatial, graph, data mining, multitenant, Big Data SQL, NoSQL Database and industry data models. The COMPLETEguide covers the following areas: Key highlights from last year's conference List of Data Warehouse and Big Data presenters Day-by-Day schedule for must-see sessions and labs Details of this year's Appreciation Event Links to DW and Big Data product web pages, blogs, social media sites Information about our #oow16 smartphone app Maps to help you find your way around Moscone The Complete Guide to Data Warehousing and Big Data at OpenWorld is available in both Apple iBooks and PDF formats. The PDF format will work on any smartphone, tablet and/or computer. The iBooks format will open on iPads, iPhones and Mac computers via the relevant iBooks App. Please refer to the Apple Apps Store for more information. PDF version can be downloaded by clicking here Apple iBooks version can be downloaded by clicking here Look forward to seeing you all in the beautiful city of San Francisco.

The COMPLETE guide for OpenWorld provides a comprehensive day-by-day list of all the most important sessions and hands-on labs for Oracle's data warehouse and big technologies: Oracle Database 12c...

Your Essential Online Session and Hands-on Lab Calendars for #oow16

It’s almost time for OpenWorld. Only three weeks to go! With so much to see and learn at Oracle OpenWorld we are doing our best to make sure that everyone get the most from this year’s conference. Therefore, to help you get prepared and organized we have created a series of online calendars which list all the must-see data warehousing and big data sessions, labs and key events. Just look at the agenda below - we have packed this year’s schedule with the very best must-see sessions and must-attend labs by Oracle product managers and key customers. The above agenda is built using Google Calendar and is available for use with other calendar applications that allow you to subscribe to online calendars. To make the process as easy as possible we have created a range of calendars mapped to specific areas of interest such as: cloud, data warehousing, analytics, big data and unstructured data.  The following links can be used to access our OpenWorld calendars via your own calendar applications using the following links: Complete calendar - covers all the most important data warehousing and big sessions and labs (HTML version is here) If you just want to cherry-pick particular topics then select from the following links to focus on your particular areas of interest: Cloud calendar  - covers all the most important cloud sessions and labs (HTML version is here) Data warehouse calendar -  covers all the most important data warehousing sessions and labs (HTML version is here) SQL and Analytics calendar-  covers all the most important sessions and labs for analytics: data mining, SQL, analytic SQL, spatial and graph (HTML version is here) Unstructured data calendar -  covers all the most important sessions and labs for unstructured data and application development: text, XML, REST, JSON (HTML version is here) Big Data calendar - covers all the most important sessions and labs for big data (HTML version is here) Hope these links are useful. Looking forward to seeing you at OpenWorld in September. Have a great conference.

It’s almost time for OpenWorld. Only three weeks to go! With so much to see and learn at Oracle OpenWorld we are doing our best to make sure that everyone get the most from this year’s...

Parallel PL/SQL Functions and Global Temporary Tables... and Wrong Results

Recently I got a question from a customer about a parallel query which produces wrong results. The query involves a global temporary table and a parallel enabled PL/SQL function. Before talking about this specific query I want to briefly show the effect of using PL/SQL functions in a parallel query. PL/SQL functions in parallel queries When you use a PL/SQL function, as a predicate for example, in a parallel query the function is executed by the query coordinator (QC). This can cause some parts of the query or the whole query to be serialized which means significantly worse performance. Here is an example. create table s as select rownum id,rpad('X',1000) pad from dual connect by level<=10000; create or replace function f_wait(id in number) return number is begin dbms_lock.sleep(0.01); return(id); end; / I have a table with 10K rows and a PL/SQL function that takes an input, waits 0.01 seconds and returns the input back. Let's compare a serial and a parallel query using this function. SQL> set timing on SQL> select count(*) from s where id=f_wait(id); COUNT(*) ---------- 10000 Elapsed: 00:01:40.24 SQL> select /*+ parallel(4) */ count(*) from s where id=f_wait(id); COUNT(*) ---------- 10000 Elapsed: 00:01:40.28 Both the serial and the parallel query ran for around 100 seconds, this is because we called the function by using the column ID as input and the function was executed for every row of table S (10K rows) and we waited 0.01 seconds for each row. We can understand why the parallel query did not improve the response time by looking at the execution plan. Even though there is a parallel hint in the query the execution plan is a serial plan. This is because the PL/SQL function can only be executed by the QC, this makes the whole query go serial. Parallel enabled PL/SQL functions How can we change this behavior? How can we make sure the function is executed in parallel so that the query runs faster? The way to tell Oracle that a function can be executed in parallel is to use the keyword PARALLEL_ENABLE in the function definition. This keyword tells Oracle that this function is safe to be executed by an individual PX server. Here is what happens when we add that keyword. create or replace function f_wait(id in number) return number parallel_enable is begin dbms_lock.sleep(0.01); return(id); end; / SQL> select /*+ parallel(4) */ count(*) from s where id=f_wait(id); COUNT(*) ---------- 10000 Elapsed: 00:00:25.81 The elapsed time dropped to a quarter of what it was before, from 100 seconds to 25 seconds. This is because we had 4 PX servers running the function concurrently. Here is the plan this time. Now the plan is fully parallel and operation #6, which is the filter operation running the function, is executed in parallel. If you have to use a PL/SQL function make sure to set it as PARALLEL_ENABLE if you know that it is safe to be executed by PX servers. This is required to prevent serialization points in the execution plan. Now, to the customer question I mentioned before. Parallel enabled PL/SQL functions and global temporary tables As you may already know, the data in a global temporary table is private to a session. You can only see the data populated in your own session, you cannot see the data inserted by other sessions. So, what happens if you populate the temporary table in your session and then run a parallel query on it? As parallel queries use multiple PX servers and multiple sessions, can PX servers see the data in the temporary table? create global temporary table ttemp (col1 number) on commit delete rows; insert into ttemp select rownum from dual connect by level<=10000; select /*+ parallel(2) */ count(*) from ttemp; COUNT(*) ---------- 10000 In this case we had two PX servers scanning the temporary table and the reported count is correct. This indicates individual PX servers were able to see the data populated before by the user session. Parallel queries are different in the sense that parallel sessions working on a temporary table can see the data populated by the QC before. When running a query against the temporary table, the QC is aware of the temporary table and sends the segment information to the PX servers so that they can read the data. PL/SQL functions querying temporary tables change this behavior. Here is a simplified version of the customer problem. create table t1 (id number); insert into t1 values (1000); commit; create global temporary table tempstage (col1 number) on commit preserve rows; create or replace function f_test return number parallel_enable is v_var number; begin select col1 into v_var from tempstage; return v_var; end; / Here we have a regular table T1, a temporary table TEMPSTAGE, and a parallel enabled PL/SQL function F_TEST that queries the temporary table. Let's populate the temporary table, and compare the results of a parallel and a serial query using the function as a predicate. SQL> insert into tempstage values (100); 1 row created. SQL> commit; Commit complete. SQL> select /*+ parallel(2) */ * from t1 where id>f_test; no rows selected SQL> select * from t1 where id>f_test; ID ---------- 1000 Things do not look good for the parallel query here, it returned wrong results to the user. This is because the function is declared as safe to be executed by individual PX servers. Each PX server uses its own session and as a result they cannot see the data populated by the user session. This is different than the previous example where the query was running against the temporary table, the QC in that case knew a temporary table was involved, here it only sees a function call which is parallel enabled. So, be careful when declaring functions as parallel enabled, be aware that the function will be executed by PX servers which can cause some unintended behavior. Think about how the function can behave when executed by multiple sessions and processes. Only declare it as parallel enabled when you are sure it is safe.

Recently I got a question from a customer about a parallel query which produces wrong results. The query involves a global temporary table and a parallel enabled PL/SQL function. Before talking...

Big Data SQL

Big Data SQL Quick Start. Storage Indexes - Part10.

Today I'm going to explain very powerfully Big Data SQL feature - Storage Indexes. Before all, I want to note, that name "Storage Index" could mislead you. In fact, it's dynamic structure that automatically built over your data after you scan it. There is no any specific command or something that user have run. There is nothing that you have to maintain (like Btree index), rebuild. You just run your workload and after a while, you may note better performance. Storage Indexes is not something completely new for Big Data SQL. Oracle Exadata also has this feature and with Big Data SQL we just re-use it. How it works.  The  main idea is that we could create some metadata over the unit of the scan (block or multiple blocks ). For example, we scan HDFS blocks with one given query, which has some predicate in where clause (like WHERE id=123). If this block doesn't return any rows we build statistics for this column, like the minimum, the maximum for given block. Next scan could use these statistics for skipping the scan. It's very powerful feature for unique columns. Fine-granule unit for Storage Index in the case of Hadoop is HDFS block. As you may know, HDFS block has pretty big size (in Big Data Appliance default is 256MB) and if you will be able to skip full scan of it will bring to you significant performance benefits. Query initially scan granule and if this scan doesn't return any row, storage index is built (if you find at least one row in the block, Storage Index will not be created over this concrete block). In HDFS data usually stored in 3 copies. For maximize performance and get benefits from Storage Index as quick as possible, Big Data SQL (since 3.1 version) uses the deterministic order of the hosts. If you scan table once and create Storage Indexes over the first replica, the second scan will be performed over the same copy and will use Storage Index right with the second scan. In conclusion, I want to show you couple bad and good examples for Storage Index. I have a table, with one pretty unique column: SQL> SELECT num_distinct FROM user_tab_col_statistics WHERE table_name = 'STORE_SALES_CSV' AND COLUMN_NAME = 'SS_TICKET_NUMBER'; num_distinct ------------ 849805312 the table is quite big: SQL> select count(1) from STORE_SALES_CSV ------------------- 6 385 178 703 which means that in average each value appears in given dataset in average 7-8 times, which is quite selective (this is 900.1 GB dataset). For show Storage Indexes in action, I run the query that uses predicate which returns 2 rows. SQL> select count(1) from STORE_SALES_CSV where SS_TICKET_NUMBER=187378862; the first scan consumes a lot of IO and CPU and finishes in 10.6 minutes. The second and next scans finished extremely fast in 3 seconds (because of Storage Index I definitely know that there is no data that matches with my predicate in my table). For checking a number of Storage Index efficiency I query the session statistic view: SQL> SELECT n.name, CASE NAME WHEN 'cell XT granule predicate offload retries' THEN VALUE WHEN 'cell XT granules requested for predicate offload' THEN VALUE ELSE round(VALUE / 1024 / 1024 / 1024,2) END Val, CASE NAME WHEN 'cell XT granule predicate offload retries' THEN 'Granules' WHEN 'cell XT granules requested for predicate offload' THEN 'Granules' ELSE 'GBytes' END Metric FROM v$mystat s, v$statname n WHERE s.statistic# = n.statistic# AND n.name IN ('cell XT granule IO bytes saved by storage index', 'cell XT granule bytes requested for predicate offload') ORDER BY Metric; ------------------------------------------------------------------------------------- cell XT granule IO bytes saved by storage index 899.86 GBytes cell XT granule bytes requested for predicate offload 900.11 GBytes Based on this statistics we could conclude that only (cell XT granule bytes requested for predicate offload - cell XT granule IO bytes saved by storage index) = 256 Mbytes were read, which is one HDFS block. First scan. I don't  recommend you to disable Storage Index in your real production environment, but it has one side effect. First scan in the case if Storage Indexes are enabled takes longer than without Storage Indexes. In my previous example, first scan took 10.6 minutes, but all next were finished in seconds: If I disable Storage Index first, second and next scan will take the same time - about 5.1 minutes: I could summarize all above in the table:    Elapsed time with Storage Indexes Elapsed time without Storage Indexes first scan of the table 10.3 minutes 5.1 minutes second scan of the table 3 seconds 5.1 minutes   Query by the unselective predicate. Taking into account that in our previous example, Storage Index brought performance degradation for the first query it's interesting to check the behavior of the query which uses unselective predicates. The same table has column SS_QUANTITY which has only 100 unique values: SQL> SELECT num_distinct FROM user_tab_col_statistics WHERE table_name = 'STORE_SALES_CSV' AND COLUMN_NAME = 'SS_QUANTITY'; ------------ 100 that are between 0 and 100: SQL> SELECT min(SS_QUANTITY, max(SS_QUANTITY) FROM STORE_SALES_CSV ---- ------ 0 100 With enabled Storage Indexes I ran 3 times query like: SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=82; and all 3 times it was done in 5.3 minutes. This query return a lot of rows But when I tried to query with unexisting predicates (like a negative), which return 0 rows: SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY=-1; Behaviour was like in the previous example. First scan took 10.5 minutes, second and nexts 3 seconds. In the end of the test I disable Storage Index and  run query multiple times. In the end, I got 5.3 minutes. Let me conclude results into the table:    Elapsed time with Storage Index  Elapsed time without Storage Index  The first run, return many rows  5.3 minutes  5.3 minutes  Second and next run, return many rows  5.3 minutes  5.3 minutes  The first run, return 0 rows  10.5 minutes  5.3 minutes  Second and next run, return 0 rows  3 seconds  5.3 minutes Based on this experiments we may see: 1) Storage Index built only when block doesn't return any row 2) If block return at least one row SI will not be built 3) It means, that first scan will not have performance degradation, but the second scan will not have Index for accelerate performance Order by In the last example, we saw that query which uses unselective predicate is the bad candidate for Storage Indexes unless you sort out your source data. What does it mean? I  create a new dataset from original one, like CTAS with hive statement: hive> create table csv.store_sales_quantity_sort stored as textfile as select * from csv.store_sales order by SS_QUANTITY; After this I repeat my query again two times in Oracle RDBMS (table STORE_SALES_CSV mapped to the hive table store_sales_quantity_sort) using the predicate which is return many rows: SQL> select count(1) from STORE_SALES_CSV where SS_QUANTITY= 82; as you can see the second one is way faster (even for the query which returns many rows). And now Storage Indexes take place. I can prove this with SI statistics: cell XT granule IO bytes saved by storage index 601.72 GBytes cell XT granule bytes requested for predicate offload 876.47 GBytes For analyze how columns are sorted and reorder it you may use this tool. Bucketing. Another one trick which will allow you dramatically improve performance is bucketing. Only one thing, that it will work only in case if you exactly know the number of the different values (like distinct) or the maximum number of distinct values. If you know ahead that you will use some query predicate (like SS_QUANTITY in my examples above), you may prepare data in an optimal way for this. this statement will create 100 files and each file will have corresponded value (it's ideal case otherwise you will have hash like distribution). now after creating Oracle external table and querying the data twice, we could check the benefits which bring us Storage Indexes: hive> CREATE TABLE csv.store_sales_quantity_bucketed( ss_sold_date_sk bigint, ss_sold_time_sk bigint, ss_item_sk bigint, ss_customer_sk bigint, ss_cdemo_sk bigint, ss_hdemo_sk bigint, ss_addr_sk bigint, ss_store_sk bigint, ss_promo_sk bigint, ss_ticket_number bigint, ss_quantity int, ss_wholesale_cost double, ss_list_price double, ss_sales_price double, ss_ext_discount_amt double, ss_ext_sales_price double, ss_ext_wholesale_cost double, ss_ext_list_price double, ss_ext_tax double, ss_coupon_amt double, ss_net_paid double, ss_net_paid_inc_tax double, ss_net_profit double) CLUSTERED BY (SS_QUANTITY) INTO 100 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' stored as textfile; hive> set hive.enforce.bucketing = true; hive> insert overwrite table csv.store_sales_quantity_bucketed select * from csv.store_sales; In Oracle RDBMS run: SQL> select count(1) from STORE_SALES_CSV_QUANTITY_BUCK where SS_QUANTITY= 82; .... elapsed time: 822 sec SQL> select count(1) from STORE_SALES_CSV_QUANTITY_BUCK where SS_QUANTITY= 82; .... elapsed time: 8 sec SQL> SELECT * FROM xt_stat; cell XT granule IO bytes saved by storage index 867.53 GBytes cell XT granule bytes requested for predicate offload 876.47 GBytes the fact that we read only 8.94GB of data in the second case explain why elapsed time have been significantly reduced. Bucketing together with Storage Indexes could bring you significant performance benefits if you will use bucketed column in where predicate. Joins and Storage Indexes. Storage Indexes could be very powerful and very useful for joins together with Bloom Filters. For showing this I took the table from the previous example - STORE_SALES_CSV_QUANTITY_BUCK and joined it with the small table, which contains only 2 rows. SQL> CREATE TABLE test_couple_rows AS SELECT 3 q FROM dual UNION ALL SELECT 4 q FROM dual; Now I going to join it with the big fact table, which is bucketed by  SS_QUANTITY column, using SS_QUANTITY as a join predicate with Bloom filters. SQL> SELECT /*+ use_hash(tt ss)*/ COUNT(1) FROM test_couple_rows tt, STORE_SALES_CSV_QUANTITY_BUCK SS WHERE ss.Ss_Quantity = tt.q AND tt.q>0; let's check the plan for making sure, that we are using Bloom Filters: The query was done in 12 seconds and we save a lot of IO thanks for SI: cell XT granule IO bytes saved by storage index 859.91 GBytes cell XT granule bytes requested for predicate offload 876.47 GBytes Please feel free to ask any questions in the comments!

Today I'm going to explain very powerfully Big Data SQL feature - Storage Indexes. Before all, I want to note, that name "Storage Index" could mislead you. In fact, it's dynamic structure that...

Adaptive Distribution Methods in Oracle Database 12c

In my post about common distribution methods in Parallel Execution I talked about a few problematic execution plans that can be generated when the optimizer statistics are stale or non-existent. Oracle Database 12c brings some adaptive execution features that can fix some of those issues at runtime by looking at the actual data rather than statistics. In this post we will look at one of these features which is about adapting the distribution method on the fly during statement execution. Adaptive Distribution Methods One of the problems I mentioned in the earlier post was hash distribution with low cardinality. In that case there were only a few rows in the table but the optimizer statistics indicated many rows because they were stale. Because of this stale information we were picking hash distribution and as a result only some of the consumer PX servers were receiving rows. This made the statement slower because not all PX servers were doing work. This is one of the problems we are trying to fix by using adaptive distribution methods in 12c. To show what an adaptive distribution method is and how it works I will use the same example from the older post and try to see how it works in 12c. You can go back and look at the post I linked, but as a reminder here are the tables we used. create table c as with t as (select rownum r from dual connect by level<=10000) select rownum-1 id,rpad('X',100) pad from t,t where rownum<=10; create table s as with t as (select rownum r from dual connect by level<=10000) select mod(rownum,10) id,rpad('X',100) pad from t,t where rownum<=10000000; exec dbms_stats.set_table_stats(user,'C',numrows=>100000); exec dbms_stats.gather_table_stats(user,'S'); Just like in the 11g example I modified the optimizer statistics for table C to make them stale. Here is the same SQL statement I used before, this time without optimizer_features_enable set. select /*+ parallel(8) leading(c) use_hash(s) */ count(*) from c,s where c.id=s.id; Here is the SQL Monitor report for this query in 12.1. Rather than picking broadcast distribution for table C based on optimizer statistics like in 11g, here we see that the plan shows another distribution method, PX SEND HYBRID HASH in lines 7 and 12. We also see a new plan step called STATISTICS COLLECTOR. These are used to adapt the distribution method at runtime based on the number of rows coming from table C. The query coordinator (QC) at runtime looks at the number of rows coming from table C, if the total number of rows is less than or equal to DOP*2 it decides to use broadcast distribution as the cost of broadcasting small number of rows will not be high. If the number of rows from table C is greater than DOP*2 the QC decides to use hash distribution for table C. The distribution method for table S is determined based on this decision. If table C is distributed by hash, so will table S. If table C is distributed by broadcast, table S will be distributed by round-robin. The QC looks at the number of rows from table C at runtime using the statistics collector. Each PX server scanning table C count their rows using the statistics collector until they reach a threshold, once they reach the threshold they stop counting and the statistics collector is bypassed. They return their individual counts to the QC and the QC makes the distribution decision for both tables. In this example table C is distributed by broadcast and table S is distributed by round-robin as the number of rows from table C is 10 and the DOP is 8. You can find this out by looking at the number of rows from table C (line ID 10), which is 10, and the number of rows distributed at line ID 7, which is 80. 10 rows were scanned and 80 rows were distributed, this is because DOP was 8 and all 10 rows were broadcasted to 8 PX servers. For an easier way to find out the actual distribution method used at runtime, please see an earlier post that shows how to do it in SQL Monitor. If we look at the Parallel tab we now see that all consumer PX servers perform similar amount of work as opposed to some of them staying idle in 11g. Another problem I mentioned before was using hash distribution when the data is skewed. We will look at how Oracle Database 12c solves this problem in a later post.

In my post about common distribution methods in Parallel Execution I talked about a few problematic execution plans that can be generated when the optimizer statistics are stale or non-existent....

Big Data SQL

Big Data SQL Quick Start. NoSQL databases - Part9.

It's not a secret that lately IT people are talking a lot about NoSQL. Some even use it. NoSQL databases could have some advantages over RDBMS (like scalability), but many of NoSQL databases don't have many features that are quite common for RDBMS (like transaction support, maturity for backup and recovery tools). Also, many NoSQL databases are schema-less, which can be an advantage (in sense of application development agility), but it also could be a disadvantage (in sense of supportability).  It's a long discussion that stays out of the scope of this blog and you could easily find in the web many posts about this and many opinions about this (I assume, that reader is familiar with those). My personal opinion is that NoSQL databases could be useful for some particular cases, but it rarely stays alone and it's why seamless integration with databases may be needed. Big Data SQL could provide it. Let's have an example. I have a huge company with tens of millions customers. I do store customers profiles in HBase database, because: -  Number of customers (10s of millions) -  All need low latency for reading operation (People and applications) -  Profile tags (metrics) could vary from case to case (data is pretty sparse) -  Application developers want to have more flexibility and want easily add new columns I also have Oracle Database like an analytical platform and I have existing BI application, which doesn't know what HBase is. I just need to use data from those profiles (stored in HBase) in my BI reporting. The overall picture looks like this: Now let me show how this challenge may be resolved with Big Data SQL. Create HBase structures and insert the data. Here is a simple example of creating HBase table (with HBase shell) and adding a couple of values there:  # hbase shell ... hbase(main):001:0> create 'emp', 'personal data', 'professional data' => Hbase::Table - emp hbase(main):002:0> put 'emp','6179995','personal data:name','raj' hbase(main):003:0> put 'emp','6179995','personal data:city','palo alto' hbase(main):004:0> put 'emp','6179995','professional data:position','developer' hbase(main):005:0> put 'emp','6179995','professional data:salary','50000' hbase(main):007:0> put 'emp','6274234','personal data:name','alex' hbase(main):008:0> put 'emp','6274234','personal data:city','belmont' hbase(main):009:0> put 'emp','6274234','professional data:position','pm' hbase(main):010:0> put 'emp','6274234','professional data:salary','60000' After this we will have following data into HBase table (here is a screenshot from HUE HBase browser): looks like a table from a relational database. But HBase is more flexible. If developer will want to add a new field into the table (analog of a new column in the relational database), he just will need to use following simple API: hbase(main):001:0> put 'emp','6274234','professional data:skills','Hadoop' and new field immediately will appear into the table: No DDL or metadata changing is not needed , just another one put operation. Very flexible! Create Hive and Oracle External tables. Now we have real data into HBase. Hive provides us opportunity to represent it like a table. Hive has very powerful integration tool- StorageHandlers. Using HBase StorageHandlers we define (add metadata in Hive) the way how to interpret NoSQL data: hive> CREATE EXTERNAL TABLE IF NOT EXISTS emp_hbase (rowkey STRING, ename STRING, city STRING, position STRING, salary STRING) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key#binary, personal data:name, personal data:city, professional data:position, professional data:salary') TBLPROPERTIES('hbase.table.name' = 'emp'); Now we could use HQL (Hive Query Language) for access HBase table. But our final goal is using Oracle RDBMS SQL for this. It's not a problem, only one thing that we need is connect to the Oracle RDBMS and create an external table, that will be linked to the hive table, like this:  SQL> CREATE TABLE emp_hbase ( rowkey number, ename VARCHAR2(4000), city VARCHAR2(4000), position VARCHAR2(4000), salary number ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=emp_hbase) ) REJECT LIMIT UNLIMITED PARALLEL; This explanation seems a bit tangled, but I hope this diagram could resolve this misunderstanding: Data stored into HBase, metadata stored into Hive and Oracle RDBMS. Build Hybrid Report. let's imagine, one day business user comes to you and asks the question: "I need a report of all sales by years for each position (job role)". To answer on this question I need information from STORE_SALES table (sales), EMP_HBASE (position) and DATE_DIM (year). It's a three different tables, two dimension tables (date_dim and emp_hbase) and one fact table (store_sales), they could be joined by follow keys: In my infrastructure I store "STORE_SALES" on HDFS in ORC format, customers profiles (EMP_HBASE) I store in HBase and date dimension table (date_dim) i store in Oracle RDBMS as permanent table.Overall picture looks like this: Thanks for Big Data SQL I could query all the data within the single query, like this: SQL> SELECT e.position, d.d_year, SUM(s.ss_ext_wholesale_cost) FROM store_sales_orc s, emp_hbase e, date_dim d WHERE e.rowkey = s.ss_customer_sk AND s.ss_sold_date_sk =d. d_date_sk AND e.rowkey > 0 GROUP BY e.position, d.d_year Query plan looks like usual Oracle RDBMS plan: Bingo! One query cover three data sources. Big Data SQL and NoSQL patterns and anti-patterns. After my previous example you may be so excited and decide to use NoSQL databases always for any use case. But it's wrong feeling. Many NoSQL (including HBase and Oracle NoSQL DB) databases work well when you access data by key. For example, fetch value by key or scan small range of data. On the opposite side stay queries when you don't use key (instead this you use one of the filed from value). Let me demonstrate this. In my test stand I co-locate HBase and Big Data SQL on the 3 node Hadoop cluster and use one server for database. I create HBase table load data from HDFS parquetfile: hive> CREATE TABLE IF NOT EXISTS fil.store_sales_hbase_bin ( ss_sold_date_sk BIGINT, ss_sold_time_sk BIGINT, ss_item_sk BIGINT, ss_customer_sk BIGINT, ss_cdemo_sk BIGINT, ss_hdemo_sk BIGINT, ss_addr_sk BIGINT, ss_store_sk BIGINT, ss_promo_sk BIGINT, ss_ticket_number BIGINT, ss_quantity INT, ss_wholesale_cost DOUBLE, ss_list_price DOUBLE, ss_sales_price DOUBLE, ss_ext_discount_amt DOUBLE, ss_ext_sales_price DOUBLE, ss_ext_wholesale_cost DOUBLE, ss_ext_list_price DOUBLE, ss_ext_tax DOUBLE, ss_coupon_amt DOUBLE, ss_net_paid DOUBLE, ss_net_paid_inc_tax DOUBLE, ss_net_profit DOUBLE) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ('hbase.columns.mapping' = 'data:SS_SOLD_DATE_SK, data:SS_SOLD_TIME_SK,data:SS_ITEM_SK, data:SS_CUSTOMER_SK,data:SS_CDEMO_SK, data:SS_HDEMO_SK,data:SS_ADDR_SK,data:SS_STORE_SK, data:SS_PROMO_SK,:key#binary,data:SS_QUANTITY, data:SS_WHOLESALE_COST,data:SS_LIST_PRICE,data:SS_SALES_PRICE, data:SS_EXT_DISCOUNT_AMT,data:SS_EXT_SALES_PRICE, data:SS_EXT_WHOLESALE_COST,data:SS_EXT_LIST_PRICE, data:SS_EXT_TAX,data:SS_COUPON_AMT,data:SS_NET_PAID, data:SS_NET_PAID_INC_TAX,data:SS_NET_PROFIT'); hive> set hive.hbase.bulk=true; hive> insert into fil.store_sales_hbase_bin select * from parq.store_sales; and after this compare performance for basic operations. HBase co-located with Hadoop servers (Big Data SQL also installed on those servers) and it's fair comparison, because numbers of disks, CPUs are the same (please note it's not official benchmark, it's just an example). Totally I have around 189 Millions in both tables. HBase is really strong if we query data by keys. In my DDL column "SS_TICKET_NUMBER" is key for HBase: SQL> SELECT COUNT(1) FROM store_sales_hbase_small WHERE SS_TICKET_NUMBER=187378869 because of this query took less than one second. Also you could run a lot of simple queries over HBase (but make sure that you have key in where predicate). Same query over parquetfile: SQL> SELECT COUNT(1) FROM store_sales_parq_small WHERE SS_TICKET_NUMBER=187378869 will take about 8 seconds. But if we will try query HBase, using not key columns - we will get full scan and very low performance as consequence. Parquet files will work a much faster. Also, i convert this data (the same table) in CSV format and I ran few tests. You could find conclusion in table below for few type of the queries. Table. Example of performance numbers: Concurrency. As you could noted before, combination of NoSQL + Big Data SQL is doing pretty well, when it read data by NoSQL key, but how scalable it is? To answer on this question I ran simple test. I fire the same query: with different concurrency (number of simultaneous queries) and got follow results:  Number of simultaneous queries  Average elapsed time, sec  10  1.2  15  1.3  20  1.9  25  2.2  30  3  45  4.3  60  6.2 Infer. let me summarize all findings regarding NoSQL databases and Big Data SQL.  1) You could query NoSQL data by Oracle SQL with Big Data SQL 2) You could run many concurrent queries over NoSQL databases 3) You could have good performance until you work over NoSQL key column (column, which is mapped to NoSQL key) 4) You could query data even using non-key column, but performance will low 5) Full scan of NoSQL databases is slower than full scan of textfile or other Hadoop file formats, because of extra overhead of API NoSQL DB.

It's not a secret that lately IT people are talking a lot about NoSQL. Some even use it. NoSQL databases could have some advantages over RDBMS (like scalability), but many of NoSQL databases don't...

Big Data SQL

Big Data SQL Quick Start. Data types - Part8.

Today I'm going to share one of the easiest way to improve overall Big Data SQL performance. Big Data SQL is the complex system, which contains two main pieces - Database and Hadoop. Each system has own datatypes - Oracle RDBMS and Java. Every time when you query data from Oracle RDBMS you do data conversion. Data conversion is very expensive CPU operation.   AVRO, RCFile, ORC and Parquet files. Let's zoom Hadoop server piece and find there few components - Hadoop part (Data Node) and Oracle part (Smart Scan). Also we have "External Table Service" (part of Big Data SQL software) here we do datatype transformation (read as: here we spend a lot of CPU). But there is good news, if you already did ETL once when transform source data into Parquet or ORC file, you could reap of this transformation and by proper mapping Hive datatypes to Oracle DB datatypes you will just pass through data without transformations. Сompliance matrix you could find here. But it's always better to see once than hear hundred times - let me give you example. I do have table in hive (ORC file): hive> show create table store_sales; OK CREATE TABLE store_sales( ss_sold_date_sk bigint, ss_sold_time_sk bigint, ss_item_sk bigint, ss_customer_sk bigint, ss_cdemo_sk bigint, ss_hdemo_sk bigint, ss_addr_sk bigint, ss_store_sk bigint, ss_promo_sk bigint, ss_ticket_number bigint, ss_quantity int, ss_wholesale_cost double, ss_list_price double, ss_sales_price double, ss_ext_discount_amt double, ss_ext_sales_price double, ss_ext_wholesale_cost double, ss_ext_list_price double, ss_ext_tax double, ss_coupon_amt double, ss_net_paid double, ss_net_paid_inc_tax double, ss_net_profit double) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://bds30-ns/user/hive/warehouse/orc.db/store_sales' And create in Oracle RDBMS two external tables - one of the table map hive double to Oracle NUMBER(7,2): SQL> CREATE TABLE STORE_SALES_ORC_NUM ( SS_SOLD_DATE_SK NUMBER(10,0), SS_SOLD_TIME_SK NUMBER(10,0), SS_ITEM_SK NUMBER(10,0), SS_CUSTOMER_SK NUMBER(10,0), SS_CDEMO_SK NUMBER(10,0), SS_HDEMO_SK NUMBER(10,0), SS_ADDR_SK NUMBER(10,0), SS_STORE_SK NUMBER(10,0), SS_PROMO_SK NUMBER(10,0), SS_TICKET_NUMBER NUMBER(10,0), SS_QUANTITY NUMBER(10,0), SS_WHOLESALE_COST NUMBER(7,2), SS_LIST_PRICE NUMBER(7,2), SS_SALES_PRICE NUMBER(7,2), SS_EXT_DISCOUNT_AMT NUMBER(7,2), SS_EXT_SALES_PRICE NUMBER(7,2), SS_EXT_WHOLESALE_COST NUMBER(7,2), SS_EXT_LIST_PRICE NUMBER(7,2), SS_EXT_TAX NUMBER(7,2), SS_COUPON_AMT NUMBER(7,2), SS_NET_PAID NUMBER(7,2), SS_NET_PAID_INC_TAX NUMBER(7,2), SS_NET_PROFIT NUMBER(7,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=orc.store_sales) ) REJECT LIMIT UNLIMITED PARALLEL ; For second table I check define datatypes according mapping matrix, I define Oracle BINARY_DOUBLE instead NUMBER(7,2) over hive DOUBLE:  SQL> CREATE TABLE STORE_SALES_ORC_NUM ( SS_SOLD_DATE_SK NUMBER(10,0), SS_SOLD_TIME_SK NUMBER(10,0), SS_ITEM_SK NUMBER(10,0), SS_CUSTOMER_SK NUMBER(10,0), SS_CDEMO_SK NUMBER(10,0), SS_HDEMO_SK NUMBER(10,0), SS_ADDR_SK NUMBER(10,0), SS_STORE_SK NUMBER(10,0), SS_PROMO_SK NUMBER(10,0), SS_TICKET_NUMBER NUMBER(10,0), SS_QUANTITY NUMBER(10,0), SS_WHOLESALE_COST NUMBER(7,2), SS_LIST_PRICE NUMBER(7,2), SS_SALES_PRICE NUMBER(7,2), SS_EXT_DISCOUNT_AMT NUMBER(7,2), SS_EXT_SALES_PRICE NUMBER(7,2), SS_EXT_WHOLESALE_COST NUMBER(7,2), SS_EXT_LIST_PRICE NUMBER(7,2), SS_EXT_TAX NUMBER(7,2), SS_COUPON_AMT NUMBER(7,2), SS_NET_PAID NUMBER(7,2), SS_NET_PAID_INC_TAX NUMBER(7,2), SS_NET_PROFIT NUMBER(7,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=orc.store_sales) ) REJECT LIMIT UNLIMITED PARALLEL ; For performance test I run 10 concurent queries, each one filter the data using column, that in first case have double -> number mapping and in second double -> binary_double mapping. Like this: SQL> SELECT COUNT(1) FROM STORE_SALES_ORC WHERE ss_net_paid_inc_tax=:bind and this: SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_NUM WHERE ss_net_paid_inc_tax=:bind All 10 queries were finished almost at the same time.  Mapping type  Is it Proper?  Elapsed time double (hive) -> number (oracle)  No 16.1 min double (hive) -> binary_double (oracle)  Yes 10.8 min So, fine I get better performance with proper datatype mapping, but what's going on under the hood? Let's check the graphs. CPU consumption are very hight in both cases: But as soon as we are CPU bound, IO throughput become more interesting metrics: In the first case we do have complex transformation, we spend more CPU time on the cell side and bacause of that can not read faster (stuck on CPU). Second query doesn't transform the data, just pass it to Oracle Smart Scan as is. TextFiles and Sequence Files. If in case of AVRO, RCFile, ORC and Parquet it's matter what you actually store on the HDFS, but Textfile and SequenceFile works in a complitely different way. The Hadoop InputFormat for a CSV file is reading a byte stream, finding text rows (normally terminated by a newline) and then parsing off columns. It's how it works step by step: 1. Java part of "External Table Services" reads HDFS blocks, passes byte buffer up to C 2. C part of "External Table Services" parses buffer for newline to find a row 3. C part of "External Table Services" parses row for "|" to find column value, always a string, like "-11.52" 4. C part of "External Table Services" then converts the string found, "-11.52" to Oracle Number The difference here is that the conversion from string to Oracle Number (4) is much more efficient than the above conversion of string to IEEE754 Binary Floating Point (Oracle BINARY_DOUBLE). And for sure I run the test (over CSV files) to prove this. Like in an example above I created two Oracle RDBMS tables, one used NUMBER, second one use BINARY_DOUBLE: SQL> CREATE TABLE STORE_SALES_CSV_NUM ( SS_SOLD_DATE_SK NUMBER(10,0), SS_SOLD_TIME_SK NUMBER(10,0), SS_ITEM_SK NUMBER(10,0), SS_CUSTOMER_SK NUMBER(10,0), SS_CDEMO_SK NUMBER(10,0), SS_HDEMO_SK NUMBER(10,0), SS_ADDR_SK NUMBER(10,0), SS_STORE_SK NUMBER(10,0), SS_PROMO_SK NUMBER(10,0), SS_TICKET_NUMBER NUMBER(10,0), SS_QUANTITY NUMBER(10,0), SS_WHOLESALE_COST NUMBER(7,2), SS_LIST_PRICE NUMBER(7,2), SS_SALES_PRICE NUMBER(7,2), SS_EXT_DISCOUNT_AMT NUMBER(7,2), SS_EXT_SALES_PRICE NUMBER(7,2), SS_EXT_WHOLESALE_COST NUMBER(7,2), SS_EXT_LIST_PRICE NUMBER(7,2), SS_EXT_TAX NUMBER(7,2), SS_COUPON_AMT NUMBER(7,2), SS_NET_PAID NUMBER(7,2), SS_NET_PAID_INC_TAX NUMBER(7,2), SS_NET_PROFIT NUMBER(7,2) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=csv.store_sales) ) REJECT LIMIT UNLIMITED PARALLEL ; For the second table I defined Oracle BINARY_DOUBLE instead NUMBER(7,2) :  SQL> CREATE TABLE STORE_SALES_CSV ( SS_SOLD_DATE_SK NUMBER(10,0), SS_SOLD_TIME_SK NUMBER(10,0), SS_ITEM_SK NUMBER(10,0), SS_CUSTOMER_SK NUMBER(10,0), SS_CDEMO_SK NUMBER(10,0), SS_HDEMO_SK NUMBER(10,0), SS_ADDR_SK NUMBER(10,0), SS_STORE_SK NUMBER(10,0), SS_PROMO_SK NUMBER(10,0), SS_TICKET_NUMBER NUMBER(10,0), SS_QUANTITY NUMBER(10,0), SS_WHOLESALE_COST BINARY_DOUBLE, SS_LIST_PRICE BINARY_DOUBLE, SS_SALES_PRICE BINARY_DOUBLE, SS_EXT_DISCOUNT_AMT BINARY_DOUBLE, SS_EXT_SALES_PRICE BINARY_DOUBLE, SS_EXT_WHOLESALE_COST BINARY_DOUBLE, SS_EXT_LIST_PRICE BINARY_DOUBLE, SS_EXT_TAX BINARY_DOUBLE, SS_COUPON_AMT BINARY_DOUBLE, SS_NET_PAID BINARY_DOUBLE, SS_NET_PAID_INC_TAX BINARY_DOUBLE, SS_NET_PROFIT BINARY_DOUBLE) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=csv.store_sales ) REJECT LIMIT UNLIMITED PARALLEL; For performance test I used full table scan query (actually it's query that gather statistics): SQL> SELECT to_char(COUNT(SS_SOLD_DATE_SK)), substrb(dump(MIN(SS_SOLD_DATE_SK),16,0,64),1,240), substrb(dump(MAX(SS_SOLD_DATE_SK),16,0,64),1,240), ... to_char(COUNT(SS_NET_PROFIT)), substrb(dump(MIN(SS_NET_PROFIT),16,0,64),1,240), substrb(dump(MAX(SS_NET_PROFIT),16,0,64),1,240) FROM STORE_SALES_CSV and the same query for the second table: SQL> SELECT to_char(COUNT(SS_SOLD_DATE_SK)), substrb(dump(MIN(SS_SOLD_DATE_SK),16,0,64),1,240), substrb(dump(MAX(SS_SOLD_DATE_SK),16,0,64),1,240), ... to_char(COUNT(SS_NET_PROFIT)), substrb(dump(MIN(SS_NET_PROFIT),16,0,64),1,240), substrb(dump(MAX(SS_NET_PROFIT),16,0,64),1,240) FROM STORE_SALES_CSV_NUM Performance numbers are way different.    Datatype transformation  Elapsed time  Comments hive(string) -> Oracle(Number) 18 mins We do String -> Oracle number transformation  hive(string) -> Oracle(BINARY_DOUBLE) 64 mins We do String -> Oracle BINARY_DOUBLE transformation, which is very expensive   it's also obvious from the graphs that we spend way more CPU in case of binary_double transformation: Important: In case if you define binary_double in Oracle RDBMS and you do have double type with Parquet, ORC, RC and AVRO you don't do any conversion. You just pass data directly to the Oracle Smart Scan.  In case of Textfile or Sequencefile you always do data transformation (because text input format considered as a string) and here you have to choose cheapest one way (which is NUMBER(7,2) instead BINARY_DOUBLE).

Today I'm going to share one of the easiest way to improve overall Big Data SQL performance. Big Data SQL is the complex system, which contains two main pieces - Database and Hadoop. Each system has...

Single PX Server Execution

I recently helped a customer tune a few queries in their Oracle Database In Memory POC. I want to talk about a simplified version of one of their queries as it is a nice tuning example and also a good opportunity to talk about a new parallel execution feature introduced in Oracle Database 12c. Let me start with what the feature is and then look at the performance problem this particular customer was having. Single PX server execution SQL constructs like rownum can introduce serialization points in the execution plan of a SQL statement. Until 12c these serial steps are executed by the query coordinator (QC). This can cause several issues: 1. The query coordinator (QC) will be doing processing for the serial steps. This can impact performance negatively as the QC will be busy doing processing rather than coordination. 2. The statement can use multiple parallelizers which introduce their own implications. 3. The serialization point can stop the parallelization of the subsequent plan steps, again leaving all further processing to the QC. Here is an example showing some of these issues, this is from an 11.2.0.4 database. I create a table with 5M rows. The query selects the top 10 rows from this table based on column ID, then joins that result set to the same table. create table t as with ttemp as (select rownum r from dual connect by level<=10000)select rownum id,rpad('X',100) pad from ttemp,ttemp where rownum<=5000000;select /*+ parallel(2) */ count(*)from ( select * from (select id from t order by id desc) where rownum<=10 ) t1, twhere t1.id=t.id ; The rownum predicate introduces a serialization point in line ID 12. The data is sent to the QC at that step and all operations in line IDs 8-11 are executed by the QC. In line ID 11 we see that an additional parallelizer has been used. 12c introduces the concept of a single PX server executing parts of a plan instead of the QC. A DFO can be executed by a single PX server to free the QC so that it can do its coordination job. This also prevents the statement from using extra parallelizers. Here is the plan for the same statement in 12c. There is a new distribution method, PX send 1 slave, in line ID 12. This is the same serialization point caused by the rownum predicate just like in 11g, but this time the data is sent to a single PX server rather than the QC. Operations in line IDs 7-11 are executed by this single PX server, this frees up the load on the QC. Also note that there is no extra parallelizer caused by the serialization point. Tuning exercise Let's now look at what issue this customer was having with single PX server execution. I start with the same table with 5M rows, this time declared as inmemory to match the customer's table definition. create table t inmemory no duplicate distribute by rowid range as with ttemp as (select rownum r from dual connect by level<=10000)select rownum id,rpad('X',100) pad from ttemp,ttemp where rownum<=5000000; Again, the query given below joins the top 10 rows based on column ID from table T and joins that result set to the same table. In the customer's case there were a lot more joins, I simplified the case as other parts of the query were irrelevant in this tuning exercise. This is on a 2-node RAC database. select /*+ parallel(2) */ count(*)from ( select * from (select id, rownum from t order by id desc) where rownum<=10 ) t1, twhere t1.id=t.id ; Here is the SQL Monitor report for this query. Line IDs 13 and 14 are interesting, they do 50% of the activity to send and receive 5M rows. The distribution method, PX send 1 slave, is a new distribution method in 12c which sends all rows to a single PX server. This means 5M rows were sent to one PX server in those plan steps. This also means all the steps in that specific DFO (Line IDs 7-13) were executed by that single PX server even though SQL Monitor indicates those steps as parallel with the red people icon. If you look at the Parallel tab in SQL Monitor you see that one PX server (p001) in parallel set 2 does all the work in that set. The query is actually asking for top 10 rows from table T in the inner query, then why does it send all 5M rows to a single PX server rather than sending only the top 10 rows from each producer PX server? This is because the select list contains rownum in addition to the table columns, the evaluation of the select list is done serially so all the rows are sent to a single process. That rownum expression in the select list is not used anywhere in the outer query, the outer query is only asking for the count of rows. So, in this case it is safe to remove rownum from the select list, this gives us the following SQL Monitor report. select /*+ parallel(2) */ count(*)from ( select * from (select id from t order by id desc) where rownum<=10 ) t1, twhere t1.id=t.id ; This time we see that only 19 rows were sent to a single PX server in line ID 12. This is because the rownum is pushed down to the PX servers doing the scan of the table (line ID 17) and only the top 10 rows from each PX server are sent further. Distributing these few rows as opposed to 5M rows before improves the elapsed time dramatically. In addition, the single PX server executing line IDs 7-11 works on only a few rows which also takes less time than before. The elapsed time now is 3 secs compared to 11 secs before.

I recently helped a customer tune a few queries in their Oracle Database In Memory POC. I want to talk about a simplified version of one of their queries as it is a nice tuning example and also a good...

Data Warehousing

New pattern matching tutorial on LiveSQL

If you always wanted to try our new SQL pattern matching feature, MATCH_RECOGNIZE, but never had access to a Database 12c instance then you really need to checkout the our great new LiveSQL playground environment. LiveSQL is a great place to learn about all the new features of Database 12c along with all the existing features from earlier releases. The new tutorial is called “Log file sessionization analysis with MATCH_RECOGNIZE” and you can view by clicking here. This tutorial is designed to help show how you can run sessionization analysis on application logs, web logs, etc. Using a simplified table of click data you create a sessionization data set which tracks each session, the duration of the session and the number of clicks/events. It’s all fully annotated and there are links to supporting topics where you can get more information. The objective is to introduce you to some of the important keywords and concepts that are part of the MATCH_RECOGNIZE clause. I have put together a quick video that shows how to access my new pattern matching tutorial so simply click on the image below to access the video: There are lots of code samples and tutorials for analytical SQL already loaded and available for you to run. Ever wanted to do a mortgage calculation using the SQL Model clause? Well Carsten Czarski, Oracle Database Developer guru, has created a great script that you can run to see how it works, just follow this link - https://livesql.oracle.com/apex/livesql/file/content_CA67VTHEVZZPYG94E5HWG6RWC.html The AskTom team (Connor and Chris) have uploaded scripts that they have created in response to questions posted on the AskTom forum. For example, here is the AskTom answer for finding the min/max rows within price change groups: https://livesql.oracle.com/apex/livesql/file/content_CK3GKD93H9AE0MJ3QSAQUP97N.html. I hope this is useful and don’t forget to share your own scripts and tutorials as this site is community driven, the more you upload the more we all learn. Looking for more Information Use the tag search to see more information about pattern matching or SQL Analytics or Database 12c.

If you always wanted to try our new SQL pattern matching feature, MATCH_RECOGNIZE, but never had access to a Database 12c instance then you really need to checkout the our great new LiveSQL playground...

Big Data SQL

Big Data SQL Quick Start. Partition Pruning - Part7.

partitioning is the very common technique in data warehousing and all kind of databases. I assume that reader know what partitioning is and I will not explain theoretical part. If you want you could consider Oracle RDBMS example. But I directly start with the practical. Hive partitioning. Hive originally was created as an easy way to write MapReduces over  HDFS. HDFS is a file system, which has Linux like structure. So, it's easy to assume that partition, in this case, is just sub-directory. I used Intel BigBench dataset for creating partitioned Hive table. I took two tables - big fact table store_sales and small dimension table date_dim. They have followed relationship: fact table (store_sales) doesn't have clear time identificator, this table related with dimension (dictionary) table date_dim, which have columns for explicit data definition (d_dom - day of month, d_moy - month of year, d_year - year). Now I'm going to create a partitioned store_sales table: SQL> CREATE TABLE store_sales_part( ss_sold_date_sk bigint, ... ss_net_profit double) partitioned by ( yearINT, month INT, day INT) stored as ORC; The statement above creates a partitioned table with 3 virtual columns - year, month, day. Not I will insert data, into this hive table (I added few parameters which are mandatory for dynamic partitioning): hive> SET hive.exec.dynamic.partition=true; hive> SET hive.exec.dynamic.partition.mode=nonstrict; hive> SET hive.exec.max.dynamic.partitions =10000; hive> INSERT INTO TABLE store_sales_part PARTITION (year, month, day) hive> SELECT store_sales.*, dt.d_year, dt.d_moy, dt.d_dom FROM store_sales, date_dim dt WHERE dt.d_date_sk = store_sales.ss_sold_date_sk; after this insert, i want to check file distribution on HDFS. $ hadoop fs -du -h /user/hive/warehouse/orc.db/store_sales_part/*/*/|tail -2 168.5 M 505.5 M /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=8 168.7 M 506.0 M /user/hive/warehouse/orc.db/store_sales_part/year=2005/month=9/day=9 so, now new table store_sales_part has three virtual column, that actually doesn't stored into disks (and don't occupy space), but could be used for avoiding unnecessary IO. Also, those columns could be queried from hive console: hive> select ss_sold_date_sk, year, month, day from store_sales_part limit 2; OK 36890 2001 1 1 36890 2001 1 1 Great! Now let's turn to Oracle RDBMS and will create a table there, that will be linked with this hive table: SQL> CREATE TABLE STORE_SALES_ORC_PART (SS_SOLD_DATE_SK NUMBER(10,0), .... SS_NET_PROFIT BINARY_DOUBLE, YEAR NUMBER, MONTH NUMBER, DAY NUMBER) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=bds30 com.oracle.bigdata.tablename=orc.store_sales_part) ) REJECT LIMIT UNLIMITED PARALLEL; Now in Oracle we have an external table, that has columns, which could prune unnecessary partition. Let's verify this! Query table without partition predicate:  SQL> SELECT COUNT(1) FROM STORE_SALES_ORC_PART and after the query has finished check the statistics:  SQL> SELECT n.name, round(s.value / 1024 / 1024 / 102