X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Recent Posts

Big Data

Oracle Big Data SQL 4.0 - Great New Performance Feature

Big Data SQL 4.0 introduces a data processing enhancement that can have a dramatic impact on query performance:  distributed aggregation using in-memory capabilities. Big Data SQL has always done a great job of filtering data on the Hadoop cluster.  It does this using the following optimizations:  1) column projection, 2) partition pruning, 3) storage indexes, 4) predicate pushdown. Column projection is the first optimization.  If your table has 200 columns – and you are only selecting one – then only a single column’s data will be transferred from the Big Data SQL Cell on the Hadoop cluster to the Oracle Database.  This optimization is applied to all file types – CSV, Parquet, ORC, Avro, etc. The image below shows the other parts of the data elimination steps.  Let’s say you are querying 100TB data set. Partition Pruning:  Hive partitions data by a table’s column(s).  If you have two years of data and your table is partitioned by day – and the query is only selecting 2 months – then in this example, 90% of the data will be “pruned” – or not scanned Storage Index:  SIs are a fine-grained data elimination technique.  Statistics are collected for each file’s data blocks based on query usage patterns – and these statistics are used to determine whether or not it’s possible that data for the given query is contained within that block.  If the data does not exist in that block, then the block is not scanned (remember, a block can represent a significant amount of data - oftentimes 128MB). This information is automatically maintained and stored in a lightweight, in-memory structure. Predicate Pushdown:  Certain file types – like Parquet and ORC – are really database files.  Big Data SQL is able to push predicates into those files and only retrieve the data that meets the query criteria Once those scan elimination techniques are applied, Big Data SQL Cells will process and filter the remaining data - returning the results to the database. In-Memory Aggregation In-memory aggregation has the potential to dramatically speed up queries.  Prior to Big Data SQL 4.0, Oracle Database performed the aggregation over the filtered data sets that were returned by Big Data SQL Cells.  With in-memory aggregation, summary computations are run across the Hadoop cluster data nodes.  The massive compute power of the cluster is used to perform aggregations. Below, detailed activity is captured at the customer location level; the query is asking for a summary of activity by region and month. When the query is executed, processing is distributed to each data node on the Hadoop cluster.  Data elimination techniques and filtering is applied – and then each node will aggregate the data up to region/month.  This aggregated data is then returned to the database tier from each cell - and the database then completes the aggregation and applies other functions. Big Data SQL is using an extension to the in-memory aggregation functionality offered by Oracle Database.  Check out the documentation for details on the capabilities and where you can expect a good performance gain. The results can be rather dramatic, as illustrated by the chart found below: This test compares running the same queries with aggregation offload disabled and then enabled.  It shows 1) a simple, single table “count(*)” query, 2) a query against a single table that performs a group by and 3) a query that joins a dimension table to a fact table.  The second and third examples also show increasing the number of columns accessed by the query.  In this simple test, performance improved from 13x to 36x :-). Lots of great new capabilities in Big Data SQL 4.0.  This one may be my favorite :-).

Big Data SQL 4.0 introduces a data processing enhancement that can have a dramatic impact on query performance:  distributed aggregation using in-memory capabilities. Big Data SQL has always done a...

Autonomous

SQL Developer Web comes to Autonomous Data Warehouse - oh YES!

If you login into your cloud console and create a new autonomous data warehouse, or if you have an existing data warehouse instance, then there is great news - you can now launch SQL Developer Web direct from the service console. There is no need to download and install the full desktop version of SQL Developer anymore.  If you want a quick overview of this feature then there is great video by Jeff Smith (Oracle Product Manager for SQL Developer) on YouTube: https://www.youtube.com/watch?v=asHlUW-Laxk. In the video Jeff gives an overview and a short demonstration of this new UI.  ADMIN-only Access First off, straight out the box, only the ADMIN user can access SQL Developer web - which makes perfect sense when you think about it! Therefore, the ADMIN user is always going to be the first person to connect to SQL Dev Web and then they enable access for other users/schemas as required.  A typical autonomous workflow will look something like this: Create a new ADW instance Open Service Console Connect to SQL Dev Web as ADMIN user Enable each schema/user via the ords_admin.enable_schema package Send schema-specific URL to each developer     Connecting as the ADMIN user From the Administration tab on the service console you will see that we added two new buttons - one to access APEX (more information here) and one to access SQL Developer Web.: As this is on the Administration tab then the link for SQL Developer Web, not surprisingly, provides a special admin-only URL which, one you are logged in as the admin user, brings you to the home screen:   and the admin user has some additional features enabled for monitoring their autonomous data warehouse via the hamburger menu in the top left corner     The Dashboard view displays general status information about the data warehouse:. Database Status: Displays the overall status of the database. Alerts: Displays the number of Error alerts in the alert log.  Database Storage: Displays how much storage is being used by the database. Sessions: Displays the status of open sessions in the database. Physical IO Panel: Displays the rates of physical reads and writes of database data. Waits: Displays how many wait events are occurring in the database for various reasons Quick Links: Provides buttons to open the Worksheet, Data Modelel. It also provides a button to open the Oracle Application Express sign-in page for the current database.   Home Page View This page has some cool features - there is a timeline that tracks when objects got added to the database: and there is an associated quick glance view that shows the status of those objects so you know that if it's a table whether it's been automatically analyzed and the stats are up to date:   Enabling the users/schemas To allow a developer to access their schema and login the ADMIN user has to run a small PL/SQL script to enable the schema and that process is outlined here: https://docs.oracle.com/en/database/oracle/sql-developer-web/19.1/sdweb/about-sdw.html#GUID-A79032C3-86DC-4547-8D39-85674334B4FE. Once that's done the ADMIN user can provide the developer with their personal URL to access SQL Developer Web. Essentially, this developer URL is the same as the URL the ADMIN user gets from the service console, but with the /admin/ segment of the URL replaced by /schema-alias/ specified during the "enable-user-access" step. The doc lays this out very nicely.   Guided Demo   Overall adding SQL Dev Web to Autonomous Data Warehouse is going to make life so much easier for DBAs and developers. For most tasks SQL Developer Web can now be the go-to interface for doing most in-database tasks which means you don't have to download and install a desktop tool (which in most corporate environments creates all sorts of problems due to locked-down Windows and Mac desktops). Where to get more information When it comes to SQL Developer there is only one URL you need and it belongs the Jeff Smith who is the product manager for SQL Developer: https://www.thatjeffsmith.com/. Jeff's site contains everything you could ever want to know about using SQL Developer Desktop and SQL Developer Web. There are overview videos, tutorial videos, feature videos, tips & tricks etc etc. Have fun with SQL Developer Web and Autonomous Data Warehouse      

If you login into your cloud console and create a new autonomous data warehouse, or if you have an existing data warehouse instance, then there is great news - you can now launch SQL Developer...

Autonomous

APEX comes to Autonomous Data Warehouse - oh YES!

A big "Autonomous Welcome" to all our APEX developers because your favorite low-code development environment is now built in Autonomous Data Warehouse. And before you ask - YES, if you have existing autonomous data warehouse instances you will find an APEX launch button got added to the Admin tab on your service console (see the screen capture below).  APEX comes to ADW. YES! APEX is now bundled with Autonomous Data Warehouse (even existing data warehouse instances have been updated). What does this mean? It means that you now have free access to  Oracle’s premiere low-code development platform: Application Express (APEX). This provides a low-code development environment that enables customers and partners to build stunning, scalable, secure apps with world-class features fully supported by Autonomous Database.   As an application developer you can now benefit from a simple but very powerful development platform powered by an autonomous database. It’s the perfect combination of low-code development meets zero management database. You can focus on building rich, sophisticated applications with APEX and the database will take care of itself. There are plenty of great use cases for APEX combined with Autonomous Database and from a data warehouse perspective 2 key ones stand out: 1) A replacement for a data mart built around spreadsheets We all done it at some point in our careers - used spreadsheets to build business critical applications and reporting systems. We all know this approach is simply a disaster waiting to happen! Yet almost every organization utilizes spreadsheets to share and report on data. Why? Because spreadsheets are so easy to create - anyone can put together a spreadsheet once they have the data. Once created they often send it out to colleagues who then tweak the data and pass it on to other colleagues, and so forth. This inevitably leads to numerous copies with different data and very a totally flawed business processes. A far better solution is to have a single source of truth stored in a fully secured database with a browser-based app that everyone can use to maintain the data. Fortunately, you now have one! Using Autonomous Data Warehouse and APEX any user can go from a spreadsheet to web app in a few clicks.  APEX provides a very powerful but easy to use wizard that in just a few clicks can transform your spreadsheet into a fully-populated table in Oracle Autonomous Data Warehouse, complete with a fully functioning app with a report and form for maintaining the data. One of the key benefits of switching to APEX is that your data becomes completely secure. The Autonomous Data Warehouse automatically encrypts data at rest and in transit, you can apply data masking profiles on any sensitive data that you share with others and Oracle takes care of making sure you have all the very latest security patches applied. Lastly, all your data is automatically backed. 2) Sharing external data with partners and customers. Many data warehouses make it almost impossible to share data with partners. This can make it very hard to improve your business processes. Providing an app to enable your customers to interact with you and see the same data sets can greatly improve customer satisfaction and lead to repeat business. However, you don't want to expose your internal systems on the Internet, and you have concerns about security, denial of service attacks, and web site uptime. By combining Autonomous Data Warehouse with APEX you can now safely develop public facing apps. Getting Started with APEX!  Getting started with APEX is really easy. Below you will see that I have put together a quick animation which guides you through the process of logging in to your APEX workspace from Autonomous Data Warehouse: What see you above is the process of logging in to APEX for the first time. In this situation you connect as the ADMIN user to the reserved workspace called “INTERNAL”. Once you login you will be required to create a new workspace and assign a user to that workspace to get things setup. In the above screenshots a new workspace called GKL is created for the user GKL. Then at that point everything becomes fully focused on APEX and your  Autonomous Data Warehouse just fades into the background, taking care of itself. It could not be simpler! Learn More about APEX If you are completely new to APEX then I would recommend jumping over to the dedicated Application Express website - apex.oracle.com. On this site you will find the APEX PM team has put together a great 4-step process to get you up-and-running with APEX: https://apex.oracle.com/en/learn/getting-started/ - quick note: obviously, you can skip step 1 which covers how to request and environment on our public APEX service because you have your dedicated environment within your very own Autonomous Data Warehouse. Enjoy your new, autonomous APEX-enabled environment!      

A big "Autonomous Welcome" to all our APEX developers because your favorite low-code development environment is now built in Autonomous Data Warehouse. And before you ask - YES, if you have existing...

Autonomous

There's a minor tweak to our UI - DEDICATED

You may have spotted from all the recent online news headlines and social media activity that we launched a new service for transactional workloads - ATP Dedicated. It allows an organization to rethink how they deliver Database IT, enabling a customizable private database cloud in the public cloud. Obviously this does not affect you if you are using Autonomous Data Warehouse but it does have a subtle impact because our UI has had to change slightly. You will notice in the top left corner of the main console page we now have three types of services: Autonomous Database Autonomous Container Database Autonomous Exadata Infrastructure From a data warehouse perspective you are only interested in the first one in that list: Autonomous Database. In the main table that list all your instances you can see there is a new column headed “Dedicated Infrastructure”. For ADW, this will always show “No” as you can see below.     If you create a new ADW you will notice that the pop-up form has now been replaced by a full width page to make it easier to focus on the fields you need to complete. The new auto-scaling feature is still below the CPU Core Count box (for more information about auto scaling with ADW see this blog post). …and that’s about it for this modest little tweak to our UI. So nothing major, just a subtle change visible when you click on the "Transaction Processing" box. Moving on...  

You may have spotted from all the recent online news headlines and social media activity that we launched a new service for transactional workloads - ATP Dedicated. It allows an organization to...

Autonomous

How to Create a Database Link from an Autonomous Data Warehouse to a Database Cloud Service Instance

Autonomous Data Warehouse (ADW) now supports outgoing database links to any database that is accessible from an ADW instance including Database Cloud Service (DBCS) and other ADW/ATP instances. To use database links with ADW, the target database must be configured to use TCP/IP with SSL (TCPS) authentication. Since both ADW and ATP use TCPS authentication by default, setting up a database link between these services is pretty easy and takes only a few steps. We covered the ADB-to-ADB linking process in the first of this two part series of blog posts about using database links, see Making Database Links from ADW to other Databases. That post explained the simplest use case to configure and use. On the other hand, enabling TCPS authentication in a database that doesn't have it configured (e.g. in DBCS) requires some additional steps that need to be followed carefully. In this blog post, I will try to demonstrate how to create a database link from an ADW instance to a DBCS instance including the steps to enable TCPS authentication. Here is an outline of the steps that we are going to follow: Enable TCPS Authentication in DBCS Connect to DBCS Instance from Client via TCPS Create a DB Link from ADW to DBCS Enable TCPS Authentication in DBCS A DBCS instance uses TCP/IP protocol by default. Configuring TCPS in DBCS involves several steps that need to be performed manually. Since we are going to modify the default listener to use TCPS and it's configured under the grid user, we will be using both oracle and grid users. Here are the steps needed to enable TCPS in DBCS: Create wallets with self signed certificates for server and client Exchange certificates between server and client wallets (Export/import certificates) Add wallet location in the server and the client network files Add TCPS endpoint to the database listener Create wallets with self signed certificates for server and client As part of enabling TCPS authentication, we need to create individual wallets for the server and the client. Each of these wallets has to have their own certificates that they will exchange with one another. For the sake of this example, I will be using a self signed certificate. The client wallet and certificate can be created in the client side; however, I'll be creating my client wallet and certificate in the server and moving them to my local system later on. See Configuring Secure Sockets Layer Authentication for more information. Let's start... Set up wallet directories with the root user [root@dbcs0604 u01]$ mkdir -p /u01/server/wallet [root@dbcs0604 u01]$ mkdir -p /u01/client/wallet [root@dbcs0604 u01]$ mkdir /u01/certificate [root@dbcs0604 /]# chown -R oracle:oinstall /u01/server [root@dbcs0604 /]# chown -R oracle:oinstall /u01/client [root@dbcs0604 /]# chown -R oracle:oinstall /u01/certificate Create a server wallet with the oracle user [oracle@dbcs0604 ~]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet create -wallet ./ -pwd Oracle123456 -auto_login Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a server certificate with the oracle user [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -dn "CN=dbcs" -keysize 1024 -self_signed -validity 3650 -sign_alg sha256 Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a client wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet create -wallet ./ -pwd Oracle123456 -auto_login Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a client certificate with the oracle user [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -dn "CN=ctuzla-mac" -keysize 1024 -self_signed -validity 3650 -sign_alg sha256 Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Exchange certificates between server and client wallets (Export/import certificates) Export the server certificate with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet export -wallet ./ -pwd Oracle123456 -dn "CN=dbcs" -cert /tmp/server.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Export the client certificate with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet export -wallet ./ -pwd Oracle123456 -dn "CN=ctuzla-mac" -cert /tmp/client.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Import the client certificate into the server wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -trusted_cert -cert /tmp/client.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Import the server certificate into the client wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -trusted_cert -cert /tmp/server.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Change permissions for the server wallet with the oracle user We need to set the permissions for the server wallet so that it can be accessed when we restart the listener after enabling TCPS endpoint. [oracle@dbcs0604 wallet]$ cd /u01/server/wallet [oracle@dbcs0604 wallet]$ chmod 640 cwallet.sso Add wallet location in the server and the client network files Creating server and client wallets with self signed certificates and exchanging certificates were the initial steps towards the TCPS configuration. We now need to modify both the server and client network files so that they point to their corresponding wallet location and they are ready to use the TCPS protocol. Here's how those files look in my case: Server-side $ORACLE_HOME/network/admin/sqlnet.ora under the grid user # sqlnet.ora Network Configuration File: /u01/app/18.0.0.0/grid/network/admin/sqlnet.ora # Generated by Oracle configuration tools. NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT) wallet_location = (SOURCE= (METHOD=File) (METHOD_DATA= (DIRECTORY=/u01/server/wallet))) SSL_SERVER_DN_MATCH=(ON) Server-side $ORACLE_HOME/network/admin/listener.ora under the grid user wallet_location = (SOURCE= (METHOD=File) (METHOD_DATA= (DIRECTORY=/u01/server/wallet))) LISTENER=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))) # line added by Agent ASMNET1LSNR_ASM=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))) # line added by Agent ENABLE_GLOBAL_DYNAMIC_ENDPOINT_ASMNET1LSNR_ASM=ON # line added by Agent VALID_NODE_CHECKING_REGISTRATION_ASMNET1LSNR_ASM=SUBNET # line added by Agent ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER=ON # line added by Agent VALID_NODE_CHECKING_REGISTRATION_LISTENER=SUBNET # line added by Agent Server-side $ORACLE_HOME/network/admin/tnsnames.ora under the oracle user # tnsnames.ora Network Configuration File: /u01/app/oracle/product/18.0.0.0/dbhome_1/network/admin/tnsnames.ora # Generated by Oracle configuration tools. LISTENER_CDB1 = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) CDB1_IAD1W9 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) PDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = pdb1.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) Add TCPS endpoint to the database listener Now that we are done with configuring our wallets and network files, we can move onto the next step, which is configuring the TCPS endpoint for the database listener. Since our listener is configured under grid, we will be using srvctl command to modify and restart it. Here are the steps: [grid@dbcs0604 ~]$ srvctl modify listener -p "TCPS:1521" [grid@dbcs0604 ~]$ srvctl stop listener [grid@dbcs0604 ~]$ srvctl start listener [grid@dbcs0604 ~]$ srvctl stop database -database cdb1_iad1w9 [grid@dbcs0604 ~]$ srvctl start database -database cdb1_iad1w9 [grid@dbcs0604 ~]$ lsnrctl status LSNRCTL for Linux: Version 18.0.0.0.0 - Production on 05-JUN-2019 16:07:24 Copyright (c) 1991, 2018, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for Linux: Version 18.0.0.0.0 - Production Start Date 05-JUN-2019 16:05:50 Uptime 0 days 0 hr. 1 min. 34 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/18.0.0.0/grid/network/admin/listener.ora Listener Log File /u01/app/grid/diag/tnslsnr/dbcs0604/listener/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.0.0.4)(PORT=1521))) Services Summary... Service "867e3020a52702dee053050011acf8c0.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "8a8e0ea41ac27e2de0530400000a486a.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "cdb1XDB.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 1 handler(s) for this service... Service "cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "pdb1.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... The command completed successfully Please note that in the first step we added the TCPS endpoint to the port 1521 of the default listener. It's also possible to keep the port 1521 as is and add TCPS endpoint to a different port (e.g. 1523). Connect to DBCS Instance from Client via TCPS We should have TCPS authentication configured now. Before we move onto testing, let's take a look at the client-side network files (Please note the public IP address of the DBCS instance in tnsnames.ora): Client-side tnsnames.ora CDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = 132.145.151.208)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) PDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = 132.145.151.208)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = pdb1.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) Client-side sqlnet.ora WALLET_LOCATION = (SOURCE = (METHOD = FILE) (METHOD_DATA = (DIRECTORY = /Users/cantuzla/Desktop/wallet) ) ) SSL_SERVER_DN_MATCH=(ON) In order to connect to the DBCS instance from the client, you need to add an ingress rule for the port that you want to use (e.g. 1521) in the security list of your virtual cloud network (VCN) in OCI as shown below: We can now try to establish a client connection to PDB1 in our DBCS instance (CDB1): ctuzla-mac:~ cantuzla$ cd Desktop/InstantClient/instantclient_18_1/ ctuzla-mac:instantclient_18_1 cantuzla$ ./sqlplus /nolog SQL*Plus: Release 18.0.0.0.0 Production on Wed Jun 5 09:39:56 2019 Version 18.1.0.0.0 Copyright (c) 1982, 2018, Oracle. All rights reserved. SQL> connect c##dbcs/DBcs123_#@PDB1 Connected. SQL> select * from dual; D - X Create a DB Link from ADW to DBCS We now have a working TCPS authentication in our DBCS instance. Here are the steps from the documentation that we will follow to create a database link from ADW to DBCS: Copy your target database wallet (the client wallet cwallet.sso that we created in /u01/client/wallet) for the target database to Object Store. Create credentials to access your Object Store where you store the cwallet.sso. See CREATE_CREDENTIAL Procedure for details. Upload the target database wallet to the data_pump_dir directory on ADW using DBMS_CLOUD.GET_OBJECT: SQL> BEGIN DBMS_CLOUD.GET_OBJECT( credential_name => 'OBJ_STORE_CRED', object_uri => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/adwctraining8/b/target-wallet/o/cwallet.sso', directory_name => 'DATA_PUMP_DIR'); END; / PL/SQL procedure successfully completed. On ADW create credentials to access the target database. The username and password you specify with DBMS_CLOUD.CREATE_CREDENTIAL are the credentials for the target database that you use to create the database link. Make sure the username consists of all uppercase letters. For this example, I will be using the C##DBCS common user that I created in my DBCS instance: SQL> BEGIN DBMS_CLOUD.CREATE_CREDENTIAL( credential_name => 'DBCS_LINK_CRED', username => 'C##DBCS', password => 'DBcs123_#'); END; / PL/SQL procedure successfully completed. Create the database link to the target database using DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK: SQL> BEGIN DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK( db_link_name => 'DBCSLINK', hostname => '132.145.151.208', port => '1521', service_name => 'pdb1.sub05282047220.vcnctuzla.oraclevcn.com', ssl_server_cert_dn => 'CN=dbcs', credential_name => 'DBCS_LINK_CRED'); END; / PL/SQL procedure successfully completed. Use the database link you created to access data on the target database: SQL> select * from dual@DBCSLINK; D - X That's it! In this blog post, we covered how to enable TCPS authentication in DBCS and create an outgoing database link from ADW to our DBCS instance. Even though we focused on the DBCS configuration, these steps can be applied when setting up a database link between ADW and any other Oracle database.

Autonomous Data Warehouse (ADW) now supports outgoing database links to any database that is accessible from an ADW instance including Database Cloud Service (DBCS) and other ADW/ATP instances. To use...

Autonomous

Making Database Links from ADW to other Databases

Autonomous Database now fully supports database links. What does this mean? It means that from within your Autonomous Data Warehouse you can make a connection to any other database (on-premise or in the cloud) including other Autonomous Data Warehouse instances and/or Autonomous Transaction Processing instances. Before I dive into an example, let’s take a small step backwards and get a basic understanding of what a database links. Firstly, what is a database link? What Are Database Links? A database link is a pointer that defines a one-way communication path from, in this case an Autonomous Data Warehouse instance to another database. The link is one-way in the sense that a client connected to Autonomous Data Warehouse A can use a link stored in Autonomous Data Warehouse A to access information (schema objects such as tables, views etc) in remote database B, however, users connected to database B cannot use the same link to access data in Autonomous Data Warehouse A. If local users on database B want to access data on Autonomous Data Warehouse A, then they must define their own link to Autonomous Data Warehouse A.     There is more information about database links in the Administrator's Guide. Why Are Database Links Useful? In a lot of situations it can be really useful to have access to the very latest data without having to wait for the next run of the ETL processing. Being able to reach through directly into other databases using a DBLINK can be the fastest way to get an upto-the-minute view of what’s happening with sales orders, or expense claims, or trading positions etc. Another use case is to actually make use of dblinks within the actual ETL processing by pulling data from remote databases into staging tables for further processing. This makes the ETL process impose a minimal processing overhead on the remote databases since all that is being typically executed is a basic SQL SELECT statement. There are additional security benefits as well. For example if you consider an example where employees submit expense reports to Accounts Payable (A/P) application and that information needs to be viewed within a financial data mart. The data mart users should be able to connect to the AP database and run queries to retrieve the desired information. The mart users do not need to be A/P application users to do their analysis or run their ETL jobs; they should only be able to access AP information in a controlled, secured way. Setting Up A Database Link in ADW There are not many steps involved in creating a new database link since all the hard work happens under the covers. The first step is to check that you can actually access the target database- i.e. you have a username and password along with all the connection information. To use database links with Autonomous Data Warehouse the target database must be configured to use TCP/IP with SSL (TCPS) authentication. Fortunately if you want to connect to another Autonomous Data Warehouse or Autonomous Transaction Processing instance then everything already in place because ADB’s use TCP/IP with SSL (TCPS) authentication by default. For other cloud and on-premise databases you will most likely have to configure them to use TCP/IP with SSL (TCPS) authentication. I will try and cover this topic in a separate blog post. Word of caution here…don’t forget to check your Network ACLs settings if you are connecting to another ATP or ADW instance since your attempt to connect might get blocked! There is more information about setting up Network ACLs here. Scenario 1 - Connecting an Autonomous Data Warehouse to your Autonomous Transaction Processing instance Let’s assume that I have an ATP instance running a web store application that contains information about sales orders, distribution channels, customers, products etc. I want to access some of that data in real-time from within my sales data mart. The first step is get hold of the secure connection information for my ATP instance - essentially I need the cwallet.sso file that is part of the client credential file. If I click on the “APDEMO” link above I can access the information about that autonomous database and in the list of “management” buttons is the facility to download the client credentials file…    this gets me a zip file containing a series of files two of which are needed to create a database link: cwallet.sso contains all the security credentials and tnsnames.ora contains all the connection information that I am going to need. Uploading the wallet file… Next I goto to my Object Storage page and create a new bucket to store my wallet file. In this case I have just called it “wallet”. Probably in reality you will name your buckets to identify the target database such as “atpdemo_wallet” simply because every wallet for each database will have exactly the same name - cwallet.sso - so you will need to have a way to identify the target database each wallet is associated with and avoid over-writing each wallet.   within my bucket and I click on the blue “Upload” button to find the cwallet.sso file and move it to my Object Storage bucket:     once my wallet file is in my bucket I then need to setup my autonomous data warehouse to use that file when it makes a connection to my ATP instance.   This is where we step out of the cloud GUI and switch to a client tool like SQL Developer. I have already defined my SQL Developer connection to my Autonomous Data Warehouse which means I can start building my new database link. Step 1 - Moving the wallet file To allow Autonomous Data Warehouse to access the wallet file for my ATP target database wallet I need to put it in a special location -  the data_pump_dir directory. This is done by using DBMS_CLOUD.GET_OBJECT as follows: BEGIN DBMS_CLOUD.GET_OBJECT( credential_name => 'DEF_CRED_NAME', object_uri => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/adwc/b/adwc_user/o/cwallet.sso', directory_name => 'DATA_PUMP_DIR'); END; / If you execute the above command all you will get back in the console is a message something like this “PL/SQL procedure successfully completed”. So to find out if the file actually got moved you can use the following query to query the data_pump_dir directory  SELECT *  FROM table(dbms_cloud.list_files('DATA_PUMP_DIR')) WHERE object_name LIKE '%.sso' which hopefully returns the following result within SQL Developer that confirms my wallet file is now available to my Autonomous Data Warehouse:     Step 2 - Setting up authentication When my database link process connects to my target ATP instance it obviously needs a valid username and password on my target ATP instance. However, if I can use an account in mu Autonomous Data Warehouse if it matches the account in my ATP instance. Chances are you will want to use a specific account on the target database so a credential is required. This can be setup relatively quickly using the following command: BEGIN DBMS_CLOUD.CREATE_CREDENTIAL( credential_name => ‘ATP_DB_LINK_CRED', username => ’scott', password => ’tiger' ); END; / Step 3 - Defining the new database link  For this step I am going to need access to the tnsnames.ora file to extract specific pieces of information about my ATP instance. Don’t forget that for each autonomous instances there is a range of connections that are identified by resource group ids such as “low”, “medium”, “high”, “tp_urgent” etc. When defining your database link make sure you select the correct information from your tnsnames file. You will need to find the following identifiers: hostname port service name  ssl_server_cert_dn In the example below I am using the “low” resource group connection: BEGIN DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK( db_link_name => 'SHLINK', hostname => 'adb.us-phoenix-1.oraclecloud.com', port => '1522', service_name => 'example_low.adwc.example.oraclecloud.com', ssl_server_cert_dn => ‘CN=adwc.example.oraclecloud.com,OU=Oracle BMCS PHOENIX,O=Oracle Corporation,L=Redwood City,ST=California,C=US’, credential_name => 'ATP_DB_LINK_CRED'); END; /   I could configure the database link to authenticate using the current user within my Autonomous Data Warehouse (assuming that I had a corresponding account in my Autonomous Transaction Processing instance). That’s all there is to it! Everything is now in place which means I can directly query my transactional data from my data warehouse. For example if I want to see the table of distribution channels for my tp_app_orders then I can simply query the channels table as follows: SELECT  channel_id,  channel_desc,  channel_class,  channel_class_id,  channel_total,  channel_total_id  FROM channels@SHLINK;  Will now return the following:   and if I query my tp_app_orders table I can see the live data in my Autonomous Transaction Processing instance: All Done! That's it. It's now possible to connect your Autonomous Data Warehouse to any other database running on-premise or in the cloud, including other Autonomous Database instances.  This makes it even quicker and easier to pull data from existing systems into your staging tables or even just query data directly from your source applications to get the most up to date view.  In this post you will have noticed that I have created a new database link between an Autonomous Data Warehouse and an Autonomous Transaction Processing instance. Whilst this is a great use case I suspect that many of you will want to connect your Autonomous Data Warehouse to an on-premise database. Well, as I mentioned at the start of this post there are some specific requirements related to using database links with Autonomous Data Warehouse where the target instance is not an autonomous database and we will deal with those in the next post: How to Create a Database Link from an Autonomous Data Warehouse to a Database Cloud Service Instance. For more information about using dblinks with ADW click here.  

Autonomous Database now fully supports database links. What does this mean? It means that from within your Autonomous Data Warehouse you can make a connection to any other database (on-premise or...

Autonomous

Autonomous Data Warehouse - Now with Spatial Intelligence

We are pleased to announce that Oracle Autonomous Data Warehouse now comes with spatial intelligence! If you are completely new to Oracle Autonomous Data Warehouse (where have you been for the last 18 months?) then here is a quick recap of the key features: What is  Oracle Autonomous Data Warehouse Oracle Autonomous Data Warehouse provides a self-driving, self-securing, self-repairing cloud service that eliminate the overhead and human errors associated with traditional database administration. Oracle Autonomous Data Warehouse takes care of configuration, tuning, backup, patching, encryption, scaling, and more. Additional information can be found at https://www.oracle.com/database/autonomous-database.html. Special Thanks... This post has been prepared by David Lapp who is part of the Oracle Spatial and Graph product management team.He is extremely well known within our spatial and graph community. If you want to follow David's posts on the Spatial and Graph blog then use this link and the spatial and graph blog is here. Spatial Features The core set of Spatial features have been enabled on Oracle Autonomous Data Warehouse.  Highlights of the enabled features are; native storage and indexing of point/line/polygon geometries, spatial analysis and processing, such as proximity, containment, combining geometries, distance/area calculations, geofencing to monitor objects entering and exiting areas of interest, and linear referencing to analyze events and activities located along linear networks such as roads and utilities. For details on enabled Spatial features, please see the Oracle Autonomous Data Warehouse documentation.   Loading Your Spatial Data into ADW In Oracle Autonomous Data Warehouse, data loading is typically performed using either Oracle Data Pump or Oracle/3rd party data integration tools. There are a few different ways to load and configure your spatial data sets: Load existing spatial data Load GeoJSON, WKT, or WKB and convert to Spatial using SQL.  Load coordinates and convert to Spatial using SQL.  Obviously the files containing your spatial data sets can be located in your on-premise data center or maybe your desktop computer, but for the fastest data loading performance Oracle Autonomous Data Warehouse also supports loading from files stored in Oracle Cloud Infrastructure Object Storage and other cloud file stores. Details can be found here: https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/load-data.html. Configuring Your Spatial Data Routine Spatial data configuration is performed using Oracle SQL Developer GUIs or SQL commands for: Insertion of Spatial metadata Creation of Spatial index Validation of Spatial data   Example Use Case The Spatial features enabled for Oracle Autonomous Data Warehouse support the most common use cases in data warehouse contexts. Organizations such as insurance, finance, and public safety require data warehouses to perform a wide variety of analytics. These data warehouses provide the clues to answer questions such as: What are the major risk factors for a potential new insurance policy? What are the patterns associated with fraudulent bank transactions? What are the predictors of various types of crimes?  In all of these data warehouse scenarios, location is an important factor, and the Spatial features of Oracle Autonomous Data Warehouse enable building and analyzing the dimensions of geographic data. Using the insurance scenario as an example, the major steps for location analysis are: Load historical geocoded policy data including outcomes such as claims and fraud Load geospatial reference data for proximity such as businesses and transportation features Use Spatial to calculate location-based metrics For example lets find the number of restaurants within 5 miles, and the distance to the nearest restaurant:   -- Count within distance -- Use a SQL statement with SDO_WITHIN_DISTANCE   -- and DML to build the result data SELECT policy_id, count(*) as no_restaurant_5_mi  FROM policies, businesses WHERE businesses.type = 'RESTAURANT' AND SDO_WITHIN_DISTANCE(          businesses.geometry,          policies.geometry,         'distance=5 UNIT=mile') = 'TRUE' GROUP BY policy_id; POLICY_ID  NO_RESTAURANT_5_MI 81902842   5 86469385   1 36378345   3 36323540   3 36225484   2 40830185   5 40692826   1 ...   Now we can expand the above query to use the SDO_NN function to do further analysis and find the closest restaurant within the group of restaurants that are within a mile radius of a specific location. Something like the following: -- Distance to nearest -- The SDO_NN function does not perform an implicit join -- so use PL/SQL with DML to build the result data DECLARE  distance_mi NUMBER; BEGIN FOR item IN (SELECT * FROM policies)   LOOP   execute immediate    'SELECT sdo_nn_distance(1) FROM businesses '||   'WHERE businesses.type = ''RESTAURANT'' '||   'AND SDO_NN(b.ora_geometry,:1,'||   '''sdo_batch_size=10 unit=mile'', 1) = ''TRUE'' '||   'AND ROWNUM=1'  INTO distance_mi USING item.geometry;  DBMS_OUTPUT.PUT_LINE(item.policy_id||' '||distance_mi); END LOOP; END;   POLICY_ID RESTAURANT_MI 81902842 4.100 86469385 1.839 36378345 4.674 36323540 3.092 36225484 1.376 40830185 2.237 40692826 4.272 44904642 2.216 ...   Generate the desired spectrum of location-based metrics by stepping through combinations of proximity targets (i.e., restaurants, convenience stores, schools, hospitals, police stations ...) and distances (i.e., 0.25 mi, 0.5 mi, 1 mi, 3 mi, 5 mi...). Combine these location-based metrics with traditional metrics (i.e., value of property, age of policy holder, household income ...) for analytics to identify predictors of outcomes. To enable geographic aggregation, start with a geographic hierarchy with geometry at the most detailed level. For example, a geographic hierarchy where ZONE rolls up to SUB_REGION which rolls up to REGION: DESCRIBE geo_hierarchy Name                     Type ---------------------    ------------------------------------------------- ZONE                     VARCHAR2(30) GEOMETRY                 SDO_GEOMETRY SUB_REGION               VARCHAR2(30) REGION                   VARCHAR2(30) Use Spatial to calculate containment (things found within a region) within the detailed level, which by extension associates the location with all levels of the geo-hierarchy for aggregations: -- Calculate containment -- -- The SDO_ANYINTERACT function performs an implicit join -- so, use a SQL statement with DML to build the result data -- SELECT policy_id, zone FROM policies, geo_hierarchy WHERE SDO_ANYINTERACT(policies.geometry, geo_hierarchy.geometry) = 'TRUE'; POLICY_ID ZONE 81902842 A23 86469385 A21 36378345 A23 36323540 A23 36225484 B22 40830185 C05 40692826 C10 44904642 B16 ...   With these and similar operations, analytics may be performed including the calculation of additional location-based metrics and aggregation by geography.     Summary For important best practices and further details on the use of these and many other Spatial operations, please refer to the Oracle Autonomous Data Warehouse documentation: https://www.oracle.com/database/autonomous-database.html.    

We are pleased to announce that Oracle Autonomous Data Warehouse now comes with spatial intelligence! If you are completely new to Oracle Autonomous Data Warehouse (where have you been for the last 18...

Autonomous

Using Oracle Management Cloud to Monitor Autonomous Databases

How To Monitor All Your Autonomous Database Instances  The latest release of Autonomous Database (which, as you should already know, covers both Autonomous Data Warehouse and Autonomous Transaction processing) has brought integration with Oracle Management Cloud (OMC). This is great news for cloud DBAs and cloud Fleet Managers since it means you can now monitor all your Autonomous Database instances from within a single, integrated console.  So What Is Oracle Management Cloud? Oracle Management Cloud is a suite of integrated monitoring, management services that can bring together information about all your autonomous database instances so you can monitor and manage everything from a single console. In a much broader context where you need to manage a complete application ecosystem or data warehouse ecosystem then Oracle Management Cloud can help you eliminate multiple management/system information silos and infrastructure data, resolve issues faster across your complete cloud ecosystem, and run IT like a business.  What About My Service Console? Each Autonomous Database instance has its own service console for managing and monitoring that specific instance (application database, data mart, data warehouse, sandbox etc). It has everything you, as a DBA or business user, needs to understand how your database is performing and using resources. To date this has been the console that everyone has used for monitoring. But, as you can see, this service console only shows you what’s happening within a specific instance. If you have not looked at the Service Console before then checkout Section 9 in the documentation (in this case for Autonomous Data Warehouse but the same applies to ATP) Managing and Monitoring Performance of Autonomous Data Warehouse. As more and more business teams deploy more and more Autonomous Database instances for their various projects, the challenge has been for anyone tasked with monitoring all these instances: how to get a high-level overview of what’s been deployed and in use across a whole organization? That’s where Oracle Management Cloud (OMC) comes in….the great news is that OMC monitoring of Autonomous Databases is completely free!   OMC Takes Monitoring To The Next Level  The purpose of this post is to look at the newly released integration between Autonomous Database and the Oracle Database Management part of Oracle Management Cloud’s. Let’s look at how to set up Oracle Database Management to discover your Autonomous Databases, how to monitor your Autonomous Database instances and check for predefined alerts and how to use the Performance Hub page. This is what we are aiming to setup in this blog post….the Oracle Database Fleet Home page which as you can see is telling me that I have 8 autonomous database instances - 6 ADW and 2 ATP instances - and 7 of those instances are up and running and one is currently either starting up or shutting down (identified as yellow).       Getting Started… Before you get started with OMC it’s worth taking a step back and thinking about how you want to manage your OMC instance. My view is that it makes sense to create a completely new, separate cloud account which will own your OMC instance (or instances if you want to have more than one). It’s not a requirement but in my opinion it keeps things nice and simple and your DBAs and fleet managers then typically won’t need access to the individual cloud accounts being used by each team for their autonomous database projects. So the first step is probably going to be registering a new cloud account and setting up users to access your OMC instance. Once you have a cloud account setup then there is some initial user configuration that needs to be completed before you can start work with OMC instance. The setup steps are outlined in the documentation - see here. To help you I have also covered these steps in the PowerPoint presentation which is provided at the end of this blog post. Creating an OMC Instance Starting from your My Services home page, click on the big “Create Instance” button and find the “Management Cloud” service in the list of all available services…  …this will allow you to provide the details for your new OMC instance   Oracle Cloud will send you an email notification as soon as your new OMC instance is ready, but it only takes a few minutes and then you can navigate to the “Welcome” page for your new instance which looks like this:   The next step is setup “Discovery Profiles” for each your cloud accounts. This will require a lot of the IDs and information that were part of the user account setup process so you may want to look back over that stage of this process for quick refresh. As you can see below, a discovery profile can be configured to look for only autonomous data warehouse instances or only autonomous transaction processing instances or it can search for both within a specific cloud account. Of course if you have multiple cloud accounts (maybe each department or project team has their own cloud account) then you will need to create discovery profiles for each cloud account. This gives you the ultimate flexibility in terms of setting up OMC in a way that best suits how you and your team want to work. There is more detailed information available in the OMC documentation, see here. The discovery process starts as soon as you click on the “Start” button in the top right-hand corner of the page. It doesn’t take long but the console provides feedback on the status of the discovery job. Once the job or jobs have completed you can navigate to the “Oracle Database” option in the left-hand menu which will bring you to this page - Oracle Database Fleet Home.       You can customise the graphs and information displayed on this page. For example the heat map in the middle of the page can display metrics for: DB Time, execution rate, network I/O, space used or transaction rate. You can switch between different types of views: listing the data as a table rather than a graph:   and because there is so much available data there is a Filter menu that allows you to focus on instances that are up or down, ADW instances vs. ATP instances, database version, and you can even focus in on a specific data center or group of data centers. Once you have setup your filters you can bookmark that view by saving the filters…   In the section headed “Fleet Members”, clicking on one of the instances listed in name column will drill into the performance analysis for that instance. This takes all the information from the Fleet Home page and brings the focus down to that specific instance. For example, selecting my demo instance which is the last row in the table above brings me to this page…   You will notice that this contains a mixture of information from the OCI console page and service console page for my demo instance so it provides me with a great overview of how many CPUs are allocated, database version, amount of storage allocated, database activity and a list of any SQL statements. A sort of mash-up of my OCI console page and service console page. If you then go to the Performance Hub page, we can start to investigate what’s currently happening within my demo instance…   as with the previous screens I can customize the metrics displayed on the graph although this time there is a huge library of metrics to choose from:   and OMC allows me to drill-in to my currently running SQL statement (highlighted in blue in the above screenshot) to look at the resource usage..   and I can get right down to the SQL execution plan…   Take Monitoring To The Next Level With OMC As you can see Oracle Management Cloud takes monitoring of your autonomous database instances to whole new level. Now you can get a single, integrated view for managing all your autonomous database instances, across all your different cloud accounts and across all your data centres. For cloud DBAs and cloud Fleet Managers this is definitely the way to go and more importantly OMC is free for Autonomous Database customers. If you are already using OMC to monitor other parts of your Oracle Cloud deployments (mobile apps, GoldenGate, data integration tools, IaaS, SaaS) then monitoring Autonomous Database instances can now be included in your day-to-day use of OMC. Which means you can use a single console to manage a complete application ecosystem and/or a complete data warehouse ecosystem. For cloud DBAs and cloud Fleet Managers life just got a whole lot easier! Happy monitoring with Oracle Management Cloud.   Where To Get More Information: Step-by-Step setup guide in PDF format is here. Autonomous Data Warehouse documentation is here. Autonomous Transaction Processing documentation is here. OMC documentation for Autonomous Database is here. 

How To Monitor All Your Autonomous Database Instances  The latest release of Autonomous Database (which, as you should already know, covers both Autonomous Data Warehouse and Autonomous Transaction proc...

Autonomous

Loading data into Autonomous Data Warehouse using Datapump

Oracle introduced Autonomous Data warehouse over a year ago, and one of the most common questions that customers ask me is how they can move their data/schema's to ADW (Autonomous Data Warehouse) with minimal efforts. My answer to that is to use datapump, also known as expdp/impdp. ADW doesn't support traditional import and export, so you have to use datapump. Oracle suggests using schema and parallel as a parameter while using datapump. Use the parallel depending upon the number of OCPU that you have for your ADW instance. Oracle also suggests to exclude index, cluster, indextype, materialized_view, materialized_view_log, materialized_zonemap, db_link. This is done in order to save space and speed up the data load process. At present you can only use data_pump_dir as an option for directory. This is the default directory created in ADW.  Also you don't have to worry about the performance of the database since ADW uses technologies like storage indexes, Machine learning, etc to achieve the optimal performance. You can use the file stored on Oracle Object Storage, Amazon S3 storage and Azure Blob Storage as your dumpfile location. I will be using Oracle Object storage in this article.  We will be using the steps below to load data: 1) Export the schema of your current database using expdp 2) Upload the .dmp file to object storage 3) Create Authentication Token  4) Login into ADW using SQLPLUS 5) Create credentials in Autonomous Data Warehouse to connect to Object Storage 6) Run import using Datapump 7) Verify the import Instead of writing more, let's show you how easy it is to do it.    Step 1 : Export the schema of your current database using expdp Use the expdp on your current database to run export. Copy that dump file put it in a location from where you can upload it to object storage.  Step 2 : Upload the .dmp file to object storage.  In order to upload the .dmp file on object storage log in into your cloud console and click object storage:  Once in object storage, select the compartment that you want to use and create a bucket. I am going to use compartment  "Liftandshift" and create bucket "LiftandShiftADW".   Next click on the bucket and click upload to upload the .dmp file. At this point either you can use CLI (Command line Interface) or GUI (Graphic User interface) to upload the .dmp file. If your .dmp file is larger that 2Gib then you have to use CLI. I am going to use GUI since i have a small schema for the demonstration purpose.  Select the .dmp file that you want to upload to object storage and then click upload object. Once you're done, your .dmp file will show up under objects in your Bucket Details Section Step 3 : Create Authentication Token Authentication Token will help us access Object Storage from Autonomous DB.  Under Governance and Administration Section, Click on Identity tab and go to users     Click on authorized user id and then click on Auth Token under resources on the left side to generate the Auth token. Click Generate Token, give it a description, and then click Generate token again and it will create the token for you. Remember to save the token. Once the token is created and saved, you won't be able to retrieve it again.  You can click on the copy button and copy the token to a notepad. Remember to save the token because you will not be able to see the token again. Once done, you can hit the close button on the screen.  Step 4 : Login into ADW using SQLPLUS Go to ADW homepage and click on the ADW database you have created.  Once in the database page click on DB Connection. Click on the Download button to download the wallet. Once the zip file is downloaded, hit the close button.   Download the latest version of instant-client from Oracle website : https://www.oracle.com/technetwork/database/database-technologies/instant-client/downloads/index.html Unzip all the files in one location. I used the location "C:\instantclient\instantclient_18_3" on my system. Once unzipped you will be able to use sqlplus.exe and impdp.exe at that location. Also move the compressed wallet file to that location and unzip the file.  Next update the entries in the sqlnet.ora file and point it to the location of your wallet. I have changed mine to "C:\instantclient\instantclient_18_3" as shown below. Test the connectivity using sqlplus.exe and make sure you are able to connect using the user-id admin. Step 5: Create credentials in Autonomous Data Warehouse to connect to Object Storage Use the below script to create credentials in ADW, and use the Authentication token created earlier as the password. BEGIN   DBMS_CLOUD.CREATE_CREDENTIAL(     credential_name => 'DEF_CRED_NAME',     username => 'oracleidentitycloudservice/ankur.saini@oracle.com',     password => '<password>'                                   <------------ (Use Authentication token Value here instead of the password)   ); END; / Step 6 : Run import using Datapump Since my ADW instance is built using 1 OCPU, I won't be using parallel as an option. I used the script below to run the import: ./impdp.exe admin/<Password>@liftandshift_high directory=data_pump_dir credential=def_cred_name dumpfile= https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/orasenatdoracledigital05/AnkurObject/hrapps.dmp exclude=index, cluster, indextype, materialized_view, materialized_view_log, materialized_zonemap, db_link   Step 7: Verify the import Login into the database using sqlplus or sqldeveloper and verify the import.  You can see how easy it is to move the data to ADW, and that there is not a huge learning curve. Now you can be more productive and focus on your business. Reference: https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/load-data.html#GUID-297FE3E6-A823-4F98-AD50-959ED96E6969    

Oracle introduced Autonomous Data warehouse over a year ago, and one of the most common questions that customers ask me is how they can move their data/schema's to ADW (Autonomous Data Warehouse) with...

Autonomous

So you have your CSV, TSV and JSON data lying in your Oracle Cloud Object Store. How do you get it over into your Autonomous Database?

You have finally gotten your ducks in a row to future proof your data storage, and uploaded all your necessary production data into Oracle Cloud's object store. Splendid! Now, how do you get this data into your Autonomous Database? Here I provide some practical examples of how to copy over your data from the OCI object store to your Autonomous Data Warehouse (ADW). You may use a similar method to copy data into your Autonomous Transaction Processing (ATP) instance too. We will dive into the meanings of some of the widely used parameters, which will help you and your teams derive quick business value by creating your Data Warehouse in a jiffy! An extremely useful feature of the fully managed ADW service is the ability to copy data lying in your external object store quickly and easily. The DBMS_CLOUD.COPY_DATA API procedure enables this behavior of copying (or loading) your data into your database from data files lying in your object store, enabling your ADW instance to run queries and analyses on said data. A few pre-cursor requirements to get us running these analyses: Make sure you have a running ADW instance with a little storage space, a credentials wallet and a working connection to your instance. If you haven’t done this already you can follow this simple Lab 1 tutorial. Use this link to download the data files for the following examples. You will need to unzip and upload these files to your Object Store. Once again, if you don’t know how to do this, follow Lab 3 Step 4 in this tutorial, which uploads files to a bucket in the Oracle Cloud Object Store, the most streamlined option. You may also use AWS or Azure object stores if required, you may refer to the documentation for more information on this. You will provide the URLs of the files lying in your object store to the API. If you already created your object store bucket’s URL in the lab you may use that, else to create this, use the URL below and replace the placeholders <region_name>, <tenancy_name> and <bucket_name> with your object store bucket’s region, tenancy and bucket names. The easiest way to find this information is to look at your object’s details in the object store, by opening the right-hand menu and clicking “Object details” (see screenshot below).  https://objectstorage.<region_name>..oraclecloud.com/n/<tenancy_name>/b/<bucket_name>/o/ Note: You may also use a SWIFT URL for your file here if you have one.   Have the latest version of SQL Developer installed (ADW requires at least v18.3 and above).   Comma Separated Value (CSV) Files   CSV files are one of the most common file formats out there. We will begin by using a plain and simple CSV format file for Charlotte’s (NC) Weather History dataset, which we will use as the data for our first ADW table. Open this Weather History ‘.csv’ file in a text editor to have a look at the data. Notice each field is separated by a comma, and each row ends by going to the next line. (ie. Which implies a newline ‘\n’ character). Also note that the first line is not data, but metadata (column names).   Let us now write a script to create a table, with the appropriate column names, in our ADW instance, and copy over this data file lying in our object store into it. We will specify the format of the file as CSV. The format parameter in the DBMS_CLOUD.COPY_DATA procedure takes a JSON object, which can be provided in two possible formats. format => '{"format_option" : “format_value” }' format => json_object('format_option' value 'format_value')) The second format option has been used in the script below. set define on define base_URL = <paste Object Store or SWIFT URL created above here> create table WEATHER_REPORT_CSV (REPORT_DATE VARCHAR2(20),     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2)); begin    DBMS_CLOUD.COPY_DATA(   table_name =>'WEATHER_REPORT_CSV',   credential_name =>'OBJ_STORE_CRED',     file_uri_list =>   '&base_URL/Charlotte_NC_Weather_History.csv',   format =>     json_object('type' value 'csv',      'skipheaders' value '1',     'dateformat' value 'mm/dd/yy')); end; / Let us breakdown and understand this script. We are first creating the WEATHER_REPORT_CSV table with the appropriate named columns for our destination table. We are then invoking the “COPY_DATA” procedure in the DBMS_CLOUD API  package and providing it the table name we created in our Data Warehouse, our user credentials (we created this in the pre-requisites), the object store file list that contains our data, and a format JSON object that describes the format of our file to the API.  The format parameter is a constructed JSON object with format options ‘type’ and ‘skipheaders’. The type specifies the file format as CSV, while skipheaders tells the API how many rows are metadata headers which should be skipped. In our file, that is 1 row of headers. The 'dateformat' parameter specifies the format of the date column in the file we are reading from; We will look at this parameter in more detailed examples below. Great! If this was successful, we have our first data warehouse table containing data from an object store file. If you do see errors during this copy_data process, follow Lab 3 Step 12 to troubleshoot them with the help of the necessary log file. If required, you can also drop this table with the “DROP TABLE” command. On running this copy data without errors, you now have a working data warehouse table. You may now query and join the WEATHER_REPORT_CSV with other tables in your Data Warehouse instance with the regular SQL or PL/SQL you know and love. As an example, let us find the days in our dataset during which it was pleasant in Charlotte. SELECT * FROM WEATHER_REPORT_CSV where actual_mean_temp > 69 and        actual_mean_temp < 74;   Tab Separated Value (TSV) Files   Another popular file format involves tab delimiters or TSV files. In the files you downloaded look for the Charlotte Weather History ‘.gz’ file. Unzip, open and have look at the ".tsv" file in it in a text editor as before. You will notice each row in this file is ended by a pipe ‘|’ character instead of a newline character, and the fields are separated by tabspaces. Oftentimes applications you might work with will output data in less intelligible formats such as this one, and so below is a slightly more advanced example of how to pass such data into DBMS_CLOUD. Let’s run the following script: create table WEATHER_REPORT_TSV (REPORT_DATE VARCHAR2(20),     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2));   begin   DBMS_CLOUD.COPY_DATA(     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object(                           'removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                            'recorddelimiter' value '''|''',                            'skipheaders' value '1'                           )  ); end; /   SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Let us understand the new parameters here: 'ignoremissingcolumns' value 'true': Notice there is no data for the last column “AVERAGE_PRECIPITATION”. This parameter allows the copy data script to skip over columns from the column list, that have no data in the data file. 'removequotes' value 'true': The first column ‘date’ has data surrounded by double quotes. For this data to be converted to an Oracle date type, these quotes need to be removed. Note that when using the type parameter for CSV files as we did in the first example, this removequotes option is true by default. 'dateformat' value 'mm/dd/yy': If we expect a date column to be converted and stored into an Oracle date column (after removing the double quotes of course), we should provide the date column’s format. If we don’t provide a format, the date column will look for the database's default date format. You can see the dateformat documentation here. 'delimiter' value '\t': Fields in this file are tab delimited, so the delimiter we specify is the special character. 'recorddelimiter' value '''|''': Each record or row in our file is delimited by a pipe ‘|’ symbol, and so we specify this parameter which separates out each row. Note that unlike the delimiter parameter, the recorddelimiter must be enclosed in single quotes as shown here. A nuance here is that the last row in your dataset doesn’t need the record delimiter when it is the default newline character, however it does for other character record delimiters to indicate the end of that row. Also note that since ADW is LINUX/UNIX based, source data files with newline as record delimiters, that have been created on Windows, must use “\r\n” as the format option. Both these nuances will likely have updated functionality in future releases. 'rejectlimit' value '1': We need this parameter here to fix an interesting problem. Unlike with the newline character, if we don’t specify a pipe record delimiter here at the very end of the file, we get an error because the API doesn’t recognize where the last row’s, last column ends. If we do specify the pipe record delimiter however, the API expects a new line because the record has been delimited, and we get a null error for the last non-existent row. To fix situations like this, where we know we might have one or more problem rows, we use the reject limit parameter to allow some number of rows to be rejected. If we use ‘unlimited’ as our reject limit, then any number of rows may be rejected. The default reject limit is 0. 'compression' value 'gzip': Notice the .tsv file is zipped into a gzip “.gz” file, which we have used in the URL. We use this parameter so the file will be unzipped appropriately before the table is created. As before, once this is successful, the table structure has been created following which the data is loaded into the table from the data file in the object store. We then proceed to query the table in your Data Warehouse.   Field Lists - For more Granular parsing options:   A more advanced feature of the DBMS_CLOUD.COPY_DATA is the Field_List parameter, which borrows it’s feature set from the Field_List parameter of the Oracle Loader access driver. This parameter allows you to specify more granular information about the fields being loaded. For example, let’s use “Charlotte_NC_Weather_History_Double_Dates.csv” from the list of files in our object store. This file is similar to our first CSV example, except it has a copy of the date column in a different date format. Now, if we were to specify a date format in the format parameter, it would apply to universally to all date columns. With the field_list parameter, we can specify two different date formats for the two date columns. We do need to list all the columns and their types when including the field_list; Not mentioning any type parameters simply uses default Varchar2 values. create table WEATHER_REPORT_DOUBLE_DATE (REPORT_DATE VARCHAR2(20),     REPORT_DATE_COPY DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2));   begin  DBMS_CLOUD.COPY_DATA(     table_name =>'WEATHER_REPORT_DOUBLE_DATE',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History_Double_Dates.csv',     format => json_object('type' value 'csv',  'skipheaders' value '1'),     field_list => 'REPORT_DATE DATE ''mm/dd/yy'',                    REPORT_DATE_COPY DATE ''yyyy-mm-dd'',                    ACTUAL_MEAN_TEMP,                   ACTUAL_MIN_TEMP,                   ACTUAL_MAX_TEMP,                   AVERAGE_MIN_TEMP,                   AVERAGE_MAX_TEMP,                   AVERAGE_PRECIPITATION'  ); end; /   SELECT * FROM WEATHER_REPORT_DOUBLE_DATE where actual_mean_temp > 69 and actual_mean_temp < 74; It's important to recognize that the date format parameters are to provide the API with the information to read the data file. The output format from your query will be your Database default (based on your NLS Parameters). This can also be formatted in your query using TO_CHAR.   JSON Files   You may be familiar with JSON files for unstructured and semi-structured data. The "PurchaseOrders.txt" file contains JSON Purhcase Order data, which when parsed and formatted looks like the following. Using JSON data in an ADW instance can be as simple as putting each JSON document into a table row as a BLOB, and using the powerful, native JSON features that the Oracle Database provides to parse and query it. You can also view the JSON documentation for additional features here.  Let’s try this! Copy and run the script below: CREATE TABLE JSON_DUMP_FILE_CONTENTS (json_document blob); begin  DBMS_CLOUD.COPY_DATA(    table_name =>'JSON_DUMP_FILE_CONTENTS',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/PurchaseOrders.dmp',    field_list => 'json_document CHAR(5000)' ); end; / COLUMN Requestor FORMAT A30 COLUMN ShippingInstructions FORMAT A30 SELECT JSON_VALUE(json_document,'$.Requestor') as Requestor,        JSON_VALUE(json_document,'$.ShippingInstructions.Address.city') as ShippingInstructions FROM JSON_DUMP_FILE_CONTENTS where rownum < 50; The query above lists all the PO requestors and the city where their shipment is to be delivered. Here, we have simply created one column ‘json_document’ in the table ‘JSON_FILE_CONTENTS’. We do not incur the time it takes to validate these JSON document, and are instead directly querying the table using the Database’s JSON_VALUE feature. This means the check for well-formed JSON data will be on the fly, which would fail unless you properly skip over the failed data. Here, 'COPY_DATA' will not check for valid JSON data, but will simply check that the data is of the correct native datatype (less than 5000 characters long), that is the datatype of the table’s column. For better performance on large JSON data files, using this ADW table we can also make use of the Database’s JSON features to parse and insert the JSON data into a new table ‘j_purchaseorder’ ahead of time, as below. Note that this insert statement actually brings the data into your ADW instance. You benefit from doing this as it checks to make sure your JSON data is well-formed and valid ahead of time, and therefore incur less of a performance impact when you query this JSON data from your ADW instance. CREATE TABLE j_purchaseorder  (id          VARCHAR2 (32) NOT NULL,   date_loaded TIMESTAMP (6) WITH TIME ZONE,   po_document BLOB   CONSTRAINT ensure_json CHECK (po_document IS JSON));   INSERT INTO j_purchaseorder (id, date_loaded, po_document) SELECT SYS_GUID(), SYSTIMESTAMP, json_document FROM json_file_contents    WHERE json_document IS JSON;   We can now query down JSON paths using the JSON simplified syntax as with the following query:     SELECT po.po_document.Requestor,          po.po_document.ShippingInstructions.Address.city                  FROM j_purchaseorder po;   Beyond Copying Data into your Autonomous Data Warehouse Here, we've gone through simple examples of how to copy your Oracle object store data into your Autonomous Data Warehouse instance. In following posts, we will walk through more ways you might use to load your data, from on-premise or cloud based storage, as well as more detail on how you might troubleshoot any data loading errors you may encounter. See you in the next one!

You have finally gotten your ducks in a row to future proof your data storage, and uploaded all your necessary production data into Oracle Cloud's object store. Splendid! Now, how do you get this data...

Autonomous

Oracle Autonomous Databases - Accessing Apache Avro Files

Apache Avro is a common data format in big data solutions.  Now, these types of files are easily accessible to Oracle Autonomous Databases.  One of Avro's key benefits is that it enables efficient data exchange between applications and services.  Data storage is compact and efficient – and the file format itself supports schema evolution.  It does this by including the schema within each file – with an explanation of the characteristics of each field. In a previous post about Autonomous Data Warehouse and access parquet, we talked about using a utility called parquet-tools to review parquet files.  A similar tool – avro-tools – is available for avro files.  Using avro-tools, you can create avro files, extract the schema from a file, convert an avro file to json, and much more (check out the Apache Avro home for details).  A schema file is used to create the avro files.  This schema file describes the fields, data types and default values.  The schema becomes part of the generated avro file – which allows applications to read the file and understand its contents.  Autonomous Database uses this schema to automate table creation.  Similar to parquet sources, Autonomous Database will read the schema to create the columns with the appropriate Oracle Database data types.  Avro files may include complex types – like arrays, structs, maps and more; Autonomous Database supports Avro files that contain Oracle data types. Let’s take a look at an example.  Below, we have a file - movie.avro - that contains information about movies (thanks to Wikipedia for providing info about the movies).  We’ll use the avro-tools utility to extract the schema: $ avro-tools getschema movie.avro {   "type" : "record",   "name" : "Movie",   "namespace" : "oracle.avro",   "fields" : [ {     "name" : "movie_id",     "type" : "int",     "default" : 0   }, {     "name" : "title",     "type" : "string",     "default" : ""   }, {     "name" : "year",     "type" : "int",     "default" : 0   }, {     "name" : "budget",     "type" : "int",     "default" : 0   }, {     "name" : "gross",     "type" : "double",     "default" : 0   }, {     "name" : "plot_summary",     "type" : "string",     "default" : ""   } ] } The schema is in an easy to read JSON format.  Here, we have movie_id, , title, year, budget, gross and plot_summary columns. The data has been loaded into an Oracle Object Store bucket called movies.  The process for making this data available to ADW is identical to the steps for parquet – so check out that post for details.  At a high level, you will: Create a credential that is used to authorize access to the object store bucket Create an external table using dbms_cloud.create_external_table. Query the data! 1.  Create the credential begin   DBMS_CLOUD.create_credential (     credential_name => 'OBJ_STORE_CRED',     username => '<user>',     password => '<password>'   ) ; end; / 2.  Create the table begin     dbms_cloud.create_external_table (     table_name =>'movies_ext',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'https://objectstorage.ca-toronto-1.oraclecloud.com/n/<tenancy>/b/<bucket>/o/*',     format =>  '{"type":"avro",  "schema": "first"}'     ); end; / Things got a little easier when specifying the URI list.  Instead of transforming the URI into a specific format, you can use the same path that is found in the OCI object browser: Object Details -> URL Path (accessed from the OCI Console:  Oracle Cloud -> Object Storage – “movies” bucket): This URL Path was specified in the file-uri-list parameter – although a wildcard was used instead of the specific file name. Now that the table is created, we can look at its description: SQL> desc movies_ext Name         Null? Type          ------------ ----- -------------- MOVIE_ID           NUMBER(10)    TITLE              VARCHAR2(4000) YEAR               NUMBER(10)    BUDGET             NUMBER(10)    GROSS              BINARY_DOUBLE PLOT_SUMMARY       VARCHAR2(4000) And, run queries against that table: SQL> select title, year, budget, plot_summary from movies_ext where title like 'High%'; All of this is very similar to parquet – especially from a usability standpoint.  With both file formats, the metadata in the file is used to automate the creation of tables.  However, there is a significant difference when it comes to processing the data.  Parquet is a columnar format that has been optimized for queries.  Column projection and predicate pushdown is used to enhance performance by minimizing the amount of data that is scanned and subsequently transferred from the object store to the database.  The same is not true for Avro; the entire file will need to be scanned and processed.  So, if you will be querying this data frequently – consider alternative storage options and use tools like dbms_cloud.copy_data to easily load the data into Autonomous Database.

Apache Avro is a common data format in big data solutions.  Now, these types of files are easily accessible to Oracle Autonomous Databases.  One of Avro's key benefits is that it enables efficient...

Autonomous

How To Update The License Type For Your Autonomous Database

How To Update The License Type For Your Autonomous Database As of today (Wednesday April 17) you can now quickly and easily change the type licensing for your Autonomous Database from BYOL to a new cloud subscription or vice-versa. It’s as easy as 1-2-3. So assuming you already have already created an autonomous database instance, how do you go about changing your licensing? Let me show you! Step 1 - Accessing the management console First stage involves signing into your cloud account using your tenancy name and cloud account. Then you can navigate to either your “Autonomous Data Warehouse” or “Autonomous Transaction Processing” landing pad as shown below: Let’s now change the type of license for the autonomous database instance “pp1atpsep2”. If we click on the blue text of the instance name which is in the first column of the table, this will take us to the instance management console page as shown below: Notice in the above image that, on the righthand side, the console shows the current license type as set to "Bring Your Own License" which is often referred to as a BYOL-license. Step 2 - Selecting “Update License Type” from Action menu Now click on the “Actions” button in the row of menu buttons as shown below:   Step 3 - Change the “License Type” The pop-up form shows the current type of license associated with our autonomous database instance “pp1atpsep2”, which in this case is set to BYOL If you want more information about what is and is not covered within a BYOL-license then visit the BYOL FAQ page which is here. In this case we are going to flip to using a new cloud subscription, as shown below: That's it! All that's left to do is click on the blue Update button and the new licensing model will be applied to our autonomous database instance. At this point your autonomous database will switch into “Updating” mode, as shown below. However, the database is still up and accessible. There is no downtime. When the update is complete the status will return to “Available” and the console will show that the license type has changed to "License Included" as shown below. Summary Congratulations you have successfuly swapped your BYOL license to a new cloud subscription license for your autonomous database with absolutely no downtime or impact on your users. In this post I have showed you how to quickly and easily change the type of license associated with your autonomous database. An animation of the complete end-to-end process is shown below:   Featured Neon "Change" image courtesy of wikipedia  

How To Update The License Type For Your Autonomous Database As of today (Wednesday April 17) you can now quickly and easily change the type licensing for your Autonomous Database from BYOL to a new...

Hadoop Best Practices

Big Data. See How Easily You Can Do Disaster Recovery

Earlier I've written about Big Data High Availability in different aspects and I intentionally avoided the Disaster Recovery topic. High Availability answers on the question how system should process in case of failure one of the component (like Name Node or KDC) within one system (like one Hadoop Cluster), Disaster Recovery answers on the question what to do in case if entire system will fail (Hadoop cluster or even Data Center will go down). In this blog, I also would like to talk about backup and how to deal with human mistakes (it's not particularly DR topics, but quite close).  Also, I'd like to introduce few terms. From Wikipedia: Business Continuity:  Involveskeeping all essential aspects of a business functioning despite significant disruptive events.  Disaster Recovery: (DR)  A set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a disaster. Step 1. Protect system from human errors. HDFS snapshots. HDFS snapshots functionality has been a while in Hadoop portfolio. This is a great way to protect system from human mistakes. There are few simple steps to enable it (full snapshot documentation you could find here).  - go to Cloudera Manager and drill down into the HDFS service: - then go to the "File Browser" and navigate to the directory, which you would like to protect by snapshots - click on the "Enable Snapshots" button: as soon as command finished, you have directory, protected by snapshots! you may take snapshots on demand: or you may create a snapshot policy, which will be periodically repeated (it's recommended). In order to make it work you have to go to a Cloudera Manager -> Backup -> Snapshot Policies:   - Click on the "Create Policy" (Note: you have to enable Snapshots for certain directory before creating policy) - and fill up the form: easy, but very powerful. It's a good time for a demo. Let's imagine, that we have directory with critical datasets on HDFS: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir1 drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir2   then occasionally user deleted one of the directory: [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /tmp/snapshot_demo/dir1 Deleted /tmp/snapshot_demo/dir1 [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 1 items drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir2 fortunately, it's quite easy to restore state of this dir using snapshots: - go to a Cloudera Manager -> HDFS -> File Browser - choose option "Restore from snapshot": - choose appropriate snapshot and click "Restore": - check what you have: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir2 Note: snapshot revert you to the stage where you've made it. For example, if you add some directory and then restore to a snapshot, you will not have this directory, which I've created after taking snapshot: [root@destination_cluster15 ~]# hadoop fs -mkdir /tmp/snapshot_demo/dir3 [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 3 items drwxr-xr-x   - hdfs    supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs    supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir2 drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:35 /tmp/snapshot_demo/dir3   and restore from the early taken snapshot: after recovery done: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2   I have only two directories.  Another one common case - when user change file permissions or file owner by accident and want to return it back: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2 [root@destination_cluster15 ~]# hadoop fs -chown yarn:yarn /tmp/snapshot_demo/* [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - yarn yarn          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - yarn yarn          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2 restore from snapshot and have previous file owner: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:38 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:38 /tmp/snapshot_demo/dir2 Conclusion: snapshots is very powerful tool for protect your file system from human mistakes. It stores only delta (changes), so that means that it will not consume many space in case if you don't delete data frequently. Step 2.1. Backup data. On-premise backup. NFS Storage. Backups in Hadoop world is ticklish topic. The reason is time to recovery. How long will it take to bring data back to the production system? Big Data systems tend to be not so expensive and have massive datasets, so it may be easier to have second cluster (details on how to do this coming later in this blog). But if you have some reasons to do backups, you may consider either NFS storage (in case if you want to take backup on-premise in your datacenter) or Object Store (if you want to take backup outside of your data center) in Oracle Cloud Infrastructure (OCI) as an options. In case of NFS storage (like Oracle ZFS), you have to mount your NFS storage at the same directory on every Hadoop node. Like this: Run on each BDA node: [root]#  mount nfs_storage_ip:/stage/files /tmp/src_srv Now you have share storage on every server and it means that every single Linux server has the same directory. It allows you to run distcp command (that originally was developed for coping big amount of data between HDFS filesystems). For start parallel copy, just run: $ hadoop distcp -m 50 -atomic hdfs://nnode:8020/tmp/test_load/* file:///tmp/src_srv/files/; You will create MapReduce job that will copy from one place (local file system) to HDFS with 50 mappers. Step 2.2. Backup data. Cloud. Object Storage. Object Store is key element for every cloud provider, oracle is not an exception. Documentation for Oracle Object Storage you could find here. Object Store provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to they need. Need some extra space? Simply load data into Object Store. There is no difference in process to copy 1GB or 1PB of data - Scalability. It's infinitely scale. At least theoretically :) - Durability and Availability. Object Store is first class citizen in all Cloud Stories, so all vendors do all their best to maintain 100% availability and durability. If some diet will go down, it shouldn't worry you. If some node with OSS software will go down, it shouldn't worry you. As user you have to put data there and read data from Object Store.  - Cost. In a Cloud Object Store is most cost efficient solution. Nothing comes for free and as downside I may highlight: - Performance in comparison with HDFS or local block devices. Whenever you read data from Object Store, you read it over the network. - Inconsistency of performance. You are not alone on object store and obviously under the hood it uses physical disks, which have own throughput. If many users will start to read and write data to/from Object Store, you may get performance which is different with what you use to have a day, week, month ago - Security. Unlike filesystems Object Store has not file grain permissions policies and customers will need to reorganize and rebuild their security standards and policies. Before running backup, you will need to configure OCI Object Store in you Hadoop system.  After you config your object storage, you may check the bucket that you've intent to copy to: [root@source_clusternode01 ~]# hadoop fs -ls oci://BDAx7Backup@oraclebigdatadb/   Now you could trigger actual copy by running either distcp: [root@source_clusternode01 ~]# hadoop distcp -Dmapred.job.queue.name=root.oci -Dmapreduce.task.timeout=6000000 -m 240 -skipcrccheck -update -bandwidth 10240 -numListstatusThreads 40 /user/hive/warehouse/parq.db/store_returns oci://BDAx7Backup@oraclebigdatadb/ or ODCP - oracle build tool. You could find more info about ODCP here.   [root@source_clusternode01 ~]# odcp --executor-cores 3 --executor-memory 9 --num-executors 100  hdfs:///user/hive/warehouse/parq.db/store_sales oci://BDAx7Backup@oraclebigdatadb/   and after copy done, you will be able to see all your data in OCI Object Store bucket:   Step 2.3. Backup data. Big Data Appliance metadata. it's easies section for me because Oracle support engineers made a huge effort writing support note, which tells customer on how to take backups of metadata. For more details please refer to: Customer RecommendedHow to Backup Critical Metadata on Oracle Big Data Appliance Prior to Upgrade V2.3.1 and Higher Releases (Doc ID 1623304.1)  Step 2.4. Backup data. MySQL Separately, I'd like to mention that MySQL backup is very important and you could get familiar with this here: How to Redirect a MySQL Database Backup on BDA Node 3 to a Different Node on the BDA or on a Client/Edge Server (Doc ID 1926159.1) Step 3. HDFS Disaster Recovery. Recommended Architecture Here I'd like to share Oracle recommended architecture for Disaster Recovery setup: We do recommend to have same Hardware and Software environment for Production and DR environments. If you want to have less powerful nodes on the DR side, you should answer to yourself on the question - what you are going to do in case of disaster? What is going to happen if you will switch all production applications to the DR side. Will it be capable to handle this workload? Also, one very straight recommendation from Oracle is to have small BDA (3-6 nodes) in order to perform tests  on it. Here is the rough separation of duties for these three clusters: Production (Prod): - Running production workload Disaster Recovery (DR): - Same (or almost the same) BDA hardware configuration - Run non-critical Ad-Hoc queries - Switch over in case of unplanned (disaster) or planned (upgrade) outages of the prod Test: Use small BDA cluster (3-6 nodes) to test different things, such as: - Upgrade - Change settings (HDFS, YARN) - Testing of new engines (add and test Hbase, Flume..) - Testing integration with other systems (AD, Database) - Test Dynamic pools … Note: for test environment you also consider Oracle cloud offering Step 3.1 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Now we are approaching most interesting part of the blog. Disaster recovery.  Cloudera offers the tool out of the box, which called Big Data Disaster Recovery (BDR), which allows to Hadoop administrators easily, using web interface create replication policies and schedule the data replication. Let me show the example how to do this replication with BDR. I have two BDA clusters source_cluster and destination_cluster. under the hood BDR uses special version of distcp, which has many performance and functional optimizations. Step 3.1.1 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Network throughput It's crucial to understand network throughput between clusters. To measure network throughput you may use any tool convenient for you. I personally prefer iperf. Note: iperf has two modes - UDP and TCP. Use TCP in order to make measurements between servers in context of BDR, because it uses TCP connections. After installation it's quite easy to run it: On probation make one machine (let's say on destination cluster) as server and run iperf in listening mode: [root@destination_cluster15 ~]# iperf -s on the source machine run client command, which will send some TCP traffic to server machine for 10 minutes with maximum bandwidth: [root@source_clusternode01 ~]# iperf -c destination_cluster15 -b 10000m -t 600   after getting this numbers you may understand what could you count on when you will run the copy command.   Step 3.1.2 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Ports. it's quite typical when different Hadoop clusters are in different data centers over firewalls. Before start running  BDR jobs, please make sure, that all necessary ports are open from both sides.   Step 3.1.3 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Kerberos Naming Recommendations. We (as well as Cloudera) generally recommend you use different KDC realms with trust between them and different realm names for each cluster. All user principals obtain their credentials from AD, MIT KDC will store service principals. More details on BDA security good practices you could find here. I'll assume that both clusters are kerberized (kerberos is almost default nowadays) and we will need to do some config around this. Detailed steps on how to setup Kerberos setting for two clusters, which use different KDCs you could find here. If you want to know how to set up trusted relationships between clusters you could refer here. I just briefly want to highlight most important steps. 1) On the source cluster go to "Administration -> Settings -> Search for ("KDC Server Host") and set up hostname for Source KDC. Do the same for "KDC Admin Server Host". It's important because when destination cluster comes to the source and ask for KDC it does not read the /etc/krb5.conf as you may think. It read KDC address from this property. 2) Both clusters are in the same domain. It's quite probable and quite often case. You may conclude this by seen follow error message: "Peer cluster has domain us.oracle.com and realm ORACLE.TEST but a mapping already exists for this domain us.oracle.com with realm US.ORACLE.COM. Please use hostname(s) instead of domain(s) for realms US.ORACLE.COM and ORACLE.TEST, so there are no conflicting domain to realm mapping." it's easy to fix by adding in /etc/krb5.conf exact names of the hosts under "domain_realm" section: [domain_realm] source_clusternode01.us.oracle.com = ORACLE.TEST source_clusternode02.us.oracle.com = ORACLE.TEST source_clusternode03.us.oracle.com = ORACLE.TEST source_clusternode04.us.oracle.com = ORACLE.TEST source_clusternode05.us.oracle.com = ORACLE.TEST source_clusternode06.us.oracle.com = ORACLE.TEST destination_cluster13.us.oracle.com = US.ORACLE.COM destination_cluster14.us.oracle.com = US.ORACLE.COM destination_cluster13.us.oracle.com = US.ORACLE.COM .us.oracle.com = US.ORACLE.COM us.oracle.com = US.ORACLE.COM   Note: here you do Host-Realm mapping, because I have two different REALMs and two different KDCs, but only one domain. In case if I'll use any host outside of the given list, I need to specify default realm for the domain (last two rows)   3) at a destination cluster add REALM for the source cluster in /etc/krb5.conf: [root@destination_cluster13 ~]# cat /etc/krb5.conf ... [realms]  US.ORACLE.COM = {   kdc = destination_cluster13.us.oracle.com:88   kdc = destination_cluster14.us.oracle.com:88   admin_server = destination_cluster13.us.oracle.com:749   default_domain = us.oracle.com  } ORACLE.TEST = { kdc = source_clusternode01.us.oracle.com admin_server = source_clusternode01.us.oracle.com default_domain = us.oracle.com } ... try to obtain credentials and explore source Cluster HDFS: [root@destination_cluster13 ~]# kinit oracle@ORACLE.TEST Password for oracle@ORACLE.TEST:  [root@destination_cluster13 ~]# klist  Ticket cache: FILE:/tmp/krb5cc_0 Default principal: oracle@ORACLE.TEST   Valid starting     Expires            Service principal 02/04/19 22:47:42  02/05/19 22:47:42  krbtgt/ORACLE.TEST@ORACLE.TEST     renew until 02/11/19 22:47:42 [root@destination_cluster13 ~]# hadoop fs -ls hdfs://source_clusternode01:8020 19/02/04 22:47:54 WARN security.UserGroupInformation: PriviledgedActionException as:oracle@ORACLE.TEST (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)]   it fails, but it's not a big surprise - Cluster don't have trusted relationships. let's fix this.  Note: Big Data Disaster Recovery doesn't require trusted kerberos relationships between clusters (distcp does), but in order to make it easier to debug and to some other operation activities, I'd recommend to add it on. On the destination cluster: [root@destination_cluster13 ~]# kadmin.local  kadmin.local:  addprinc krbtgt/ORACLE.TEST@US.ORACLE.COM WARNING: no policy specified for krbtgt/ORACLE.TEST@US.ORACLE.COM; defaulting to no policy Enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Re-enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Principal "krbtgt/ORACLE.TEST@US.ORACLE.COM" created. on the source Cluster: [root@source_clusternode01 ~]# kadmin.local  kadmin.local:  addprinc krbtgt/ORACLE.TEST@US.ORACLE.COM WARNING: no policy specified for krbtgt/ORACLE.TEST@US.ORACLE.COM; defaulting to no policy Enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Re-enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Principal "krbtgt/ORACLE.TEST@US.ORACLE.COM" created.   make sure that you create same user within the same passwords on both KDCs. try to explore destination's HDFS: [root@destination_cluster13 ~]# hadoop fs -ls hdfs://source_clusternode01:8020 Found 4 items drwx------   - hbase hbase               0 2019-02-04 22:34 /hbase drwxr-xr-x   - hdfs  supergroup          0 2018-03-14 06:46 /sfmta drwxrwxrwx   - hdfs  supergroup          0 2018-10-31 15:41 /tmp drwxr-xr-x   - hdfs  supergroup          0 2019-01-07 09:30 /user Bingo! it works. Now we have to do the same on both clusters to allow reverse direction: 19/02/05 02:02:02 INFO util.KerberosName: No auth_to_local rules applied to oracle@US.ORACLE.COM 19/02/05 02:02:03 WARN security.UserGroupInformation: PriviledgedActionException as:oracle@US.ORACLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)] same error and same fix for it. I just simply automate this by running follow commands on the both KDCs: delprinc -force krbtgt/US.ORACLE.COM@ORACLE.TEST delprinc -force krbtgt/ORACLE.TEST@US.ORACLE.COM addprinc -pw "welcome1" krbtgt/US.ORACLE.COM@ORACLE.TEST addprinc -pw "welcome1" krbtgt/ORACLE.TEST@US.ORACLE.COM   4) make sure that in /var/kerberos/krb5kdc/kdc.conf you have: default_principal_flags = +renewable, +forwardable   Step 3.1.4 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). SSL The next assumption is that your's Cloudera manager is working over encrypted channel and if you will try to do add source peer, most probably, you'll get an exception:   in order to fix this: a. Check certificate for Cloudera Manager (run this command on the destination cluster): [root@destination_cluster13 ~]# openssl s_client -connect source_clusternode05.us.oracle.com:7183 CONNECTED(00000003) depth=0 C = , ST = , L = , O = , OU = , CN = source_clusternode05.us.oracle.com verify error:num=18:self signed certificate verify return:1 depth=0 C = , ST = , L = , O = , OU = , CN = source_clusternode05.us.oracle.com verify return:1 --- Certificate chain  0 s:/C=/ST=/L=/O=/OU=/CN=source_clusternode05.us.oracle.com    i:/C=/ST=/L=/O=/OU=/CN=source_clusternode05.us.oracle.com --- Server certificate -----BEGIN CERTIFICATE----- MIIDYTCCAkmgAwIBAgIEP5N+XDANBgkqhkiG9w0BAQsFADBhMQkwBwYDVQQGEwAx CTAHBgNVBAgTADEJMAcGA1UEBxMAMQkwBwYDVQQKEwAxCTAHBgNVBAsTADEoMCYG A1UEAxMfYmRheDcyYnVyMDlub2RlMDUudXMub3JhY2xlLmNvbTAeFw0xODA3MTYw MzEwNDVaFw0zODA0MDIwMzEwNDVaMGExCTAHBgNVBAYTADEJMAcGA1UECBMAMQkw BwYDVQQHEwAxCTAHBgNVBAoTADEJMAcGA1UECxMAMSgwJgYDVQQDEx9iZGF4NzJi dXIwOW5vZGUwNS51cy5vcmFjbGUuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A MIIBCgKCAQEAkLwi9lAsbiWPVUpQNAjtGE5Z3pJOExtJMSuvnj02FC6tq6I09iJ0 MsTu6+Keowv5CUlhfxTy1FD19ZhX3G7OEynhlnnhJ+yjprYzwRDhMHUg1LtqWib/ osHR1QfcDfLsByBKO0WsLBxCz/+OVm8ZR+KV/AeZ5UcIsvzIRZB4V5tWP9jziha4 3upQ7BpSvQhd++eFb4wgtiBsI8X70099ZI8ctFpmPjxtYHQSGRGdoZZJnHtPY4IL Vp0088p+HeLMcanxW7CSkBZFn9nHgC5Qa7kmLN4EHhjwVfPCD+luR/k8itH2JFw0 Ub+lCOjSSMpERlLL8fCnETBc2nWCHNQqzwIDAQABoyEwHzAdBgNVHQ4EFgQUkhJo 0ejCveCcbdoW4+nNX8DjdX8wDQYJKoZIhvcNAQELBQADggEBAHPBse45lW7TwSTq Lj05YwrRsKROFGcybpmIlUssFMxoojys2a6sLYrPJIZ1ucTrVNDspUZDm3WL6eHC HF7AOiX4/4bQZv4bCbKqj4rkSDmt39BV+QnuXzRDzqAxad+Me51tisaVuJhRiZkt AkOQfAo1WYvPpD6fnsNU24Tt9OZ7HMCspMZtYYV/aw9YdX614dI+mj2yniYRNR0q zsOmQNJTu4b+vO+0vgzoqtMqNVV8Jc26M5h/ggXVzQ/nf3fmP4f8I018TgYJ5rXx Kurb5CL4cg5DuZnQ4zFiTtPn3q5+3NTWx4A58GJKcJMHe/UhdcNvKLA1aPFZfkIO /RCqvkY= -----END CERTIFICATE----- b. Go to the source cluster and find a file which has this certificate (run this command on the source cluster): [root@source_clusternode05 ~]# grep -iRl "MIIDYTCCAkmgAwIBAgIEP5N+XDANBgkqhkiG9w0BAQsFADBhMQkwBwYDVQQGEwAx" /opt/cloudera/security/|grep "cert" /opt/cloudera/security/x509/node.cert /opt/cloudera/security/x509/ssl.cacerts.pem   c. Make sure that each node has different certificate by calculating the hash (run this command on the source cluster): [root@source_clusternode05 ~]# dcli -C "md5sum /opt/cloudera/security/x509/node.cert" 192.168.8.170: cc68d7f5375e3346d312961684d728c0  /opt/cloudera/security/x509/node.cert 192.168.8.171: 9259bb0102a1775b164ce56cf438ed0e  /opt/cloudera/security/x509/node.cert 192.168.8.172: 496fd4e12bdbfc7c6aab35d970429a72  /opt/cloudera/security/x509/node.cert 192.168.8.173: 8637b8cfb5db843059c7a0aeb53071ec  /opt/cloudera/security/x509/node.cert 192.168.8.174: 4aabab50c256e3ed2f96f22a81bf13ca  /opt/cloudera/security/x509/node.cert 192.168.8.175: b50c2e40d04a026fad89da42bb2b7c6a  /opt/cloudera/security/x509/node.cert [root@source_clusternode05 ~]#    d. rename this certificates (run this command on the source cluster):   [root@source_clusternode05 ~]# dcli -C cp /opt/cloudera/security/x509/node.cert /opt/cloudera/security/x509/node_'`hostname`'.cert e. Check the new names (run this command on the source cluster): [root@source_clusternode05 ~]# dcli -C "ls /opt/cloudera/security/x509/node_*.cert" 192.168.8.170: /opt/cloudera/security/x509/node_source_clusternode01.us.oracle.com.cert 192.168.8.171: /opt/cloudera/security/x509/node_source_clusternode02.us.oracle.com.cert 192.168.8.172: /opt/cloudera/security/x509/node_source_clusternode03.us.oracle.com.cert 192.168.8.173: /opt/cloudera/security/x509/node_source_clusternode04.us.oracle.com.cert 192.168.8.174: /opt/cloudera/security/x509/node_source_clusternode05.us.oracle.com.cert 192.168.8.175: /opt/cloudera/security/x509/node_source_clusternode06.us.oracle.com.cert f. Pull those certificates from source cluster to one node of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# for i in {1..6}; do export NODE_NAME=source_clusternode0$i.us.oracle.com; scp root@$NODE_NAME:/opt/cloudera/security/x509/node_$NODE_NAME.cert /opt/cloudera/security/jks/node_$NODE_NAME.cert; done; g. propagate it on the all nodes of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# for i in {4..5}; do scp /opt/cloudera/security/jks/node_bda*.cert root@destination_cluster1$i:/opt/cloudera/security/jks; done;   h. on the destination host option truster password and trustore location (run this command on the destination cluster): [root@destination_cluster13 ~]# bdacli getinfo cluster_https_truststore_path Enter the admin user for CM (press enter for admin):  Enter the admin password for CM:  /opt/cloudera/security/jks/cdhs49.truststore   [root@destination_cluster13 ~]# bdacli getinfo cluster_https_truststore_password Enter the admin user for CM (press enter for admin):  Enter the admin password for CM:  dl126jfwt1XOGUlNz1jsAzmrn1ojSnymjn8WaA7emPlo5BnXuSCMtWmLdFZrLwJN i. and add them environment variables on all hosts of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# export TRUSTORE_PASSWORD=dl126jfwt1XOGUlNz1jsAzmrn1ojSnymjn8WaA7emPlo5BnXuSCMtWmLdFZrLwJN [root@destination_cluster13 ~]# export TRUSTORE_FILE=/opt/cloudera/security/jks/cdhs49.truststore j. now we are ready to copy add certificates to the destination clusters trustore (run this command on the destination cluster) do this on all hosts of the destination cluster: [root@destination_cluster13 ~]# for i in {1..6}; do export NODE_NAME=export NODE_NAME=source_clusternode0$i.us.oracle.com; keytool -import -noprompt  -alias $NODE_NAME -file /opt/cloudera/security/jks/node_$NODE_NAME.cert -keystore $TRUSTORE_FILE -storepass $TRUSTORE_PASSWORD; done; Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore   k. to validate that we add it, run (run this command on the destination cluster): [root@destination_cluster13 ~]# keytool -list -keystore $TRUSTORE_FILE -storepass $TRUSTORE_PASSWORD Keystore type: jks Keystore provider: SUN   Your keystore contains 9 entries   destination_cluster14.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): B3:F9:70:30:77:DE:92:E0:A3:20:6E:B3:96:91:74:8E:A9:DC:DF:52 source_clusternode02.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 3F:6E:B9:34:E8:F9:0B:FF:CF:9A:4A:77:09:61:E9:07:BF:17:A0:F1 source_clusternode05.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): C5:F0:DB:93:84:FA:7D:9C:B4:C9:24:19:6F:B3:08:13:DF:B9:D4:E6 destination_cluster15.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): EC:42:B8:B0:3B:25:70:EF:EF:15:DD:E6:AA:5C:81:DF:FD:A2:EB:6C source_clusternode03.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 35:E1:07:F0:ED:D5:42:51:48:CB:91:D3:4B:9B:B0:EF:97:99:87:4F source_clusternode06.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 16:8E:DF:71:76:C8:F0:D3:E3:DF:DA:B2:EC:D5:66:83:83:F0:7D:97 destination_cluster13.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): 76:C4:8E:82:3C:16:2D:7E:C9:39:64:F4:FC:B8:24:40:CD:08:F8:A9 source_clusternode01.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 26:89:C2:2B:E3:B8:8D:46:41:C6:C0:B6:52:D2:C4:B8:51:23:57:D2 source_clusternode04.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): CB:98:23:1F:C0:65:7E:06:40:C4:0C:5E:C3:A9:78:F3:9D:E8:02:9E [root@destination_cluster13 ~]#  l. now do the same on the others node of destination cluster   Step 3.1.5 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Create replication user On the destination cluster you will need to configure replication peer. [root@destination_cluster15 ~]# dcli -C "useradd bdruser -u 2000" [root@destination_cluster15 ~]# dcli -C "groupadd supergroup -g 2000" [root@destination_cluster15 ~]# dcli -C "usermod -g supergroup bdruser" and after this verify that this user belongs to the supergroup: [root@destination_cluster15 ~]# hdfs groups bdruser bdruser : supergroup Step 3.1.6 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Create separate job for encrypted zones It's possible to copy data from encrypted zone, but there is the trick with it. If you will try to do this, you will find the error in the BDR logs: java.io.IOException: Checksum mismatch between hdfs://distcpSourceNS/tmp/EZ/parq.db/customer/000001_0 and hdfs://cdhs49-ns/tmp/EZ/parq.db/customer/.distcp.tmp.4101922333172283041 Fortunately, this problem could easily be solved. You just need to skip calculating checksums for Encrypted Zones: This is a good practice to create separate Job to copy data from encrypted zones and exclude directories with Encryption from general backup job. Example. You have some directory, which you want to exclude (/tmp/excltest/bad) from common copy job. For do this, you need go to "Advanced" settings and add "Path Exclusion": In my example you need to put .*\/tmp\/excltest\/bad+.* you may this regexp it by creating follow directory structure and add Path Exclusion. [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/ [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/good1 [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/good2 [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/bad Note: it maybe quite hard to create and validate regular expression (this is Java), for this purposes you may use this on-line resource. Step 3.1.7 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Enable Snapshots on the Source Cluster Replication without snapshots may fail. Distcp automatically created snapshot before coping. Some replications, especially those that require a long time to finish, can fail because source files are modified during the replication process. You can prevent such failures by using Snapshots in conjunction with Replication. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable).  When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies.  What happens when you copy data with out snapshots. test case: 1) Start coping (decent amount of data) 2) in the middle of the copy process, delete from the source files 3) get an error: ERROR distcp.DistCp: Job failed to copy 443 files/dirs. Please check Copy Status.csv file or Error Status.csv file for error messages INFO distcp.DistCp: Used diff: false WARN distcp.SnapshotMgr: No snapshottable directories have found. Reason: either run-as-user does not have permissions to get snapshottable directories or source path is not snapshottable. ERROR org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /bdax7_4/store_sales/000003_0 to overcome this do: 1) On the source go to CM -> HDFS -> File Browser, pick right directory and click on: 2) after this when you will run the job, it automatically will take a snapshot and will copy from it: 3) if you will delete data, your copy job will finish it from the snapshot that it took. Note, you can't delete entire directory, but you could delete all files from it: [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /bdax7_4 rm: The directory /bdax7_4 cannot be deleted since /bdax7_4 is snapshottable and already has snapshots [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /bdax7_4/* Deleted /bdax7_4/new1.file Deleted /bdax7_4/store_sales Deleted /bdax7_4/test.file [root@destination_cluster15 ~]# hadoop fs -ls /bdax7_4 [root@destination_cluster15 ~]# 4) copy is successfully completed! Step 3.1.8 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Rebalancing data HDFS tries to keep data evenly distributed across all nodes in a cluster. But after intensive write it maybe useful to run rebalance. Default rebalancing threshold is 10%, which is a bit high. It make sense to change "Rebalancing Threshold" from 10 to 2 (Cloudera Manager -> HDFS -> Instances -> Balancer -> configuration) also, in order to speed up rebalance speed, we could increase value of "dfs.datanode.balance.max.concurrent.moves" from 10 to 1000 (Number of block moves to permit in parallel). After make this changes, save it and run rebalancing: In case of heavily used clusters for reading/writing/deleting HDFS data we may hit disbalance within one node (when data unevenly distributed across multiple disks within one node). Here is the Cloudera Blog about it. Shortly, we have to go to "Cloudera Manager - > HDFS -> Configuration -> HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" and add dfs.disk.balancer.enabled as a name and true as the value. Sometimes, you may have real data skew problem, which could easily be fixed by running rebalance: Note: If you want to visualize data distribution, you could check this tool developed at CERN. Step 3.1.9 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Hive There is also some option for coping Hive Metadata and as well as actual Data. Note: if you would like to copy some particular database schema you need to specify it in the copy wizard as well as specify regular expression which tables you want to copy ([\w].+ for all tables).  For example, this replication policy will copy all tables from database "parq". if you will leave it blank, you will not copy anything. More examples of regular expression you could find here. 3.1.10 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Sentry Sentry is the default authorization mechanism in Cloudera and you may want to replicate authorization rules on the second cluster. Unfortunately, there is no any mechanism embedded into BDR and you have to come with own solution.  Note: you have to configure Sentry in certain way to make it work with BDR. Please refer here to get more details. Step 3.1.11 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Advantages and Disadvantages of BDR Generally speaking, I could to recommend use BDR as long as it meet your needs and requirements. Here is a brief summary of advantages and disadvantages of BDR. Advantages (+): - It's available out of the box, no need to install - It's free of charge - It's relative easy to configure and start working with basic examples Disadvantages (-): - It's not real-time tool. User have to schedule batch jobs, which will run every time period - There is no transparent way to fail over. In case of failing of primary side, uses have to manually switch over their applications into the new cluster - BDR (distcp under the cover) is MapReduce job, which takes significant resources. - Because of previous one and because of MR nature in case of coping one big file it will not be parallelized (will be copied in one thread) - Hive changes not fully replicated (drop tables have to be backup manually) - In case of replication big number of files (or Hive table with big number of partitions) it takes long time to finish. I can say that it's near to impossible to replicate directory if it has around 1 million objects (files/directories)  - It's only Cloudera to Cloudera or Cloudera to Object Store copy. No way copy to Hortonworks (but after merging this companies it's not a huge problem anymore) Step 3.2 HDFS Disaster Recovery. Wandisco If you met one of the challenges, that I've explained before, it make sense to take a look not the alternative solution called Wandisco.  Step 3.2.1 HDFS Disaster Recovery. Wandisco. Edge (proxy) node In case of Wandisco you will need to prepare some proxy nodes on the source and destination side. We do recommend to use one of the Big Data Appliance node and here you may refer to MOS note, which will guide you on how to make one of the node available for been proxy node: How to Remove Non-Critical Nodes from a BDA Cluster (Doc ID 2244458.1) Step 3.2.2 HDFS Disaster Recovery. Wandisco. Installation WANdisco Fusion is enterprise class software. It requires careful environment requirements gathering for the installation, especially with multi-homed networking as in Oracle BDA.  Once the environment is fully understood, care must be taken in completing the installation screens by following the documentation closely. Note: if you have some clusters which you want deal by BDR and they don't have Wandisco fusion software, you have to install Fusion client on it Step 3.2.3 HDFS Disaster Recovery. Wandisco. Architecture For my tests I've used two Big Data Appliances (Starter Rack - 6 nodes). Wandisco required to install their software on the edge nodes and I've converted Node06 into the Edge node for Wandisco purposes. Final architecture looks like this: Step 3.2.4 HDFS Disaster Recovery. Wandisco. Replication by example Here I'd like to show how you have to set up replication between two clusters. You need to install Wandisco fusion software on both clusters. As soon as you install fusion on the second (DR) cluster you need to do Induction (peering) with the first (Prod) cluster. As a result of installation, you have WebUI for Wandisco Fusion (it's recommended to be installed in on the Edge Node), you have to go there and setup replication rules.  go to replication bookmark and click on "create" button: after this specify path, which you would like to replicate and choose source of truth: after this click on "make consistent" button to kick-off replication: you could monitor list of the files and permissions which is not replicated yet: and you could monitor performance of the replication in real time: on the destination cluster you may see files, which not been replicated yet (metadata only) with prefix "_REPAIR_" [root@dstclusterNode01 ~]#  hadoop fs -ls /tmp/test_parq_fusion/store_returns/ 19/03/26 22:48:38 INFO client.FusionUriUtils: fs.fusion.check.underlyingFs: [true], URI: [hdfs://gcs-lab-bdax72orl-ns], useFusionForURI: [true] 19/03/26 22:48:38 INFO client.FusionCommonFactory: Initialized FusionHdfs with URI: hdfs://gcs-lab-bdax72orl-ns, FileSystem: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_142575995_1, ugi=hdfs/dstclusterNode01.us.oracle.com@US.ORACLE (auth:KERBEROS)]], instance: 1429351083, version: 2.12.4.3 Found 26 items drwxrwxrwx   - hdfs supergroup          0 2019-03-26 18:29 /tmp/test_parq_fusion/store_returns/.fusion -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000000_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000001_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000002_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000003_0._REPAIR_ if you put some file on one of the side, it will appear automatically on the other side (no action required) Step 3.2.5 HDFS Disaster Recovery. Wandisco. Hive The Fusion Plugin for Live Hive enables WANdisco Fusion to replicate Hive’s metastore, allowing WANdisco Fusion to maintain a replicated instance of Hive’s metadata and, in future, support Hive deployments that are distributed between data centers. The Fusion Plugin for Live Hive extends WANdisco Fusion by replicating Apache Hive metadata. With it, WANdisco Fusion maintains a Live Data environment including Hive content, so that applications can access, use, and modify a consistent view of data everywhere, spanning platforms and locations, even at petabyte scale. WANdisco Fusion ensures the availability and accessibility of critical data everywhere. Here you could find more details. Step 3.2.6 HDFS Disaster Recovery. Wandisco. Sentry Use the Fusion Plugin for Live Sentry to extend the WANdisco Fusion server with the ability to replicate policies among Apache Sentry Policy Provider instances. Coordinate activities that modify Sentry policy definitions among multiple instances of the Sentry Policy Provider across separate clusters to maintain common policy enforcement in each cluster. The Fusion Plugin for Live Sentry uses WANdisco Fusion for coordination and replication. Here you could find more details. Step 3.2.7 HDFS Disaster Recovery. Wandisco. Advantages and Disadvantages Talking about Wandisco disadvantages I have to say that it was very hard to install it. Wandisco folks promised that it will be enhanced in a future, but time will show. Advantages (+): - It's realtime. You just load data into the one cluster and another cluster immediately pick up the changes - it's Active-Active replication. You could load data in both clusters and data-sync will be done automatically - Sentry policies replication - Use less resources than BDR - Easy to manage replication policies by WebUI - Wandisco supports cross Hadoop distros replication - Wandisco is multiple endpoint (or multi-target) replication.  A replication rule isn't limited to just source and target (e.g. Prod, DR, Object Store) Disadvantages (-): - A common trade-off for additional features can often be additional complexity during installation.  This is the case with WANdisco Fusion - it costs extra money (BDR is free) - It requires special Hadoop client. As consequence if you want to replicate data with BDR on some remote clusters, you need to install Wandisco Fusion Hadoop client on it Step 3.3 HDFS Disaster Recovery. Conclusion I'll leave it to customer to decide which replication approach is better, I'd just say that it's good approach start with Big Data Disaster Recovery (because it's free and ready to use out of the box) and if customer will have some challenges with it try Wandisco software.  Step 4.1 HBase Disaster Recovery in this blog post I've focused on the HDFS and Hive data replication. If you want to replicate HBase on the remote cluster, all details on how to do this you could find here. Step 5.1 Kafka Disaster Recovery Kafka is another place where users may store the data and want to replicate it. Cloudera recommends to use Mirror Maker in order to do this. Step 6.1 Kudu Disaster Recovery There is another option available for customers to store their data - Kudu. As of today (03/01/2019) Kudu doesn't have solution to replicate data on the Disaster Recovery side. Step 7.1 Solr Disaster Recovery Solr or Cloudera search is another one engine to store data. You may get familiar with DR best practices, by reading this blog from Cloudera. 

Earlier I've written about Big Data High Availability in different aspects and I intentionally avoided the Disaster Recovery topic. High Availability answers on the question how system should process...

Autonomous

Querying external CSV, TSV and JSON files from your Autonomous Data Warehouse

I would like to provide here some practical examples and best practices of how to make use of the powerful data loading and querying features of the Autonomous Data Warehouse (ADW) in the Oracle Cloud. We will dive into the meaning of the more widely used parameters, which will help you and your teams derive business value out of your Data Warehouse in a jiffy! An extremely useful feature of this fully managed service is the ability to directly query data lying in your external object store, without incurring the time and cost of physically loading your data from an object store into your Data Warehouse instance. The DBMS_CLOUD.CREATE_EXTERNAL_TABLE package & procedure enables this behavior, creating a table structure over your external object store data, and allowing your ADW instance to directly run queries and analyses on it. A few pre-cursor requirements to get us running these analyses: Make sure you have a running ADW instance, a credentials wallet and a working connection to your instance. If you haven’t done this already follow Lab 1 in this tutorial. Use this link to download the data files for the following examples. You will need to unzip and upload these files to your Object Store. Once again, if you don’t know how to do this, follow Lab 3 Step 4 in this tutorial, which uploads files to a bucket in the Oracle Cloud Object Store, the most streamlined option. You may also use AWS or Azure object stores if required, you may refer to the documentation for more information on this. If you are using the Oracle Cloud Object Store as in Lab 3 above, you will need Swift URLs for the files lying in your object store. If you already created your object store bucket’s URL in the lab you may use that, else to create this use the URL below and replace the placeholders <region_name>, <tenancy_name> and <bucket_name> with your object store bucket’s region, tenancy and bucket names. The easiest way to find this information is to look at your object’s details in the object store, by opening the right-hand menu and clicking “Object details” (see screenshot below).   https://swiftobjectstorage.<region_name>.oraclecloud.com/v1/<tenancy_name>/<bucket_name> Note: In coming updates you will be able to use this object store URL directly in the DBMS_CLOUD API calls, instead of a SWIFT URL. Have the latest version of SQL Developer preferably (ADW requires v18.3 and above).   Comma Separated Value (CSV) Files   CSV files are one of the most common file formats out there. We will begin by using a plain and simple CSV format file for Charlotte’s (NC) Weather History dataset, which we will use as the data for our first external table. Open this Weather History ‘.csv’ file in a text editor to have a look at the data. Notice each field is separated by a comma, and each row ends by going to the next line. (ie. Which implies a newline ‘\n’ character). Also note that the first line is not data, but metadata (column names).   Let us now write a script to create an external table in our ADW over  this data file lying in our object store. We will specify all the column names, and the format of the file as CSV. The format parameter in the DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure takes a JSON object, which can be provided in two possible formats. format => '{"format_option" : “format_value” }' format => json_object('format_option' value 'format_value')) The second format option has been used in the script below. set define on define base_URL = <paste SWIFT URL created above here> BEGIN DBMS_CLOUD.CREATE_EXTERNAL_TABLE( table_name =>'WEATHER_REPORT_CSV', credential_name =>'OBJ_STORE_CRED', file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.csv', format => json_object('type' value 'csv', 'skipheaders' value '1',   'dateformat' value 'mm/dd/yy'), column_list => 'REPORT_DATE DATE, ACTUAL_MEAN_TEMP NUMBER, ACTUAL_MIN_TEMP NUMBER, ACTUAL_MAX_TEMP NUMBER, AVERAGE_MIN_TEMP NUMBER, AVERAGE_MAX_TEMP NUMBER, AVERAGE_PRECIPITATION NUMBER' ); END; / Let us breakdown and understand this script. We are invoking the “CREATE_EXTERNAL_TABLE” procedure in the DBMS_CLOUD API  package. We are then providing the table name we want in our Data Warehouse, our user credentials (we created this in the pre-requisites), the object store file list that contains our data, a format JSON object that describes the format of our file to the API, and a list of named columns for our destination table. The format parameter is a constructed JSON object with format options ‘type’ and ‘skipheaders’. The type specifies the file format as CSV, while skipheaders tells the API how many rows are metadata headers which should be skipped. In our file, that is 1 row of headers. The 'dateformat' parameter specifies the format of the date column in the file we are reading from; We will look at this parameter in more detailed examples below. Great! If this was successful, we have our first external table. Once you have created an external table, it’s a good idea to validate that this external table structure works with your underlying data, before directly querying the table and possibly hitting a runtime error. Validating the table creates logs of any errors in case your external table was created incorrectly, which helps debug and fix any issues. Use the rowcount option in VALIDATE_EXTERNAL_TABLE if your data is large, to limit the validation to the specified number of rows. BEGIN  DBMS_CLOUD.VALIDATE_EXTERNAL_TABLE (       table_name => 'WEATHER_REPORT_CSV'  ); END; / If you do see errors during validation, follow Lab 3 Step 12 to troubleshoot them with the help of the log file. If required, you can also drop this table like you would any other table with the “DROP TABLE” command. On running this validation without errors, you now have a working external table which sits on top of the data in your object store. You may now query and join the WEATHER_REPORT_CSV table as though it is any other table in your Data Warehouse! Let us find the days in our dataset during which it was pleasant in Charlotte. SELECT * FROM WEATHER_REPORT_CSV where actual_mean_temp > 69 and        actual_mean_temp < 74;   Tab Separated Value (TSV) Files   Another popular file format involves tab delimiters or TSV files. In the files you downloaded look for the Charlotte Weather History ‘.gz’ file. Unzip, open and have look at the ".tsv" file in it in a text editor as before. You will notice each row in this file is ended by a pipe ‘|’ character instead of a newline character, and the fields are separated by tabspaces. Oftentimes applications you might work with will output data in less intelligible formats such as this one, and so below is a slightly more advanced example of how to pass such data into DBMS_CLOUD. Let’s run the following script: BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object('removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                           'recorddelimiter' value '''|''',                           'skipheaders' value '1'),     column_list => 'REPORT_DATE DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER'  ); END; / SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Whoops! You just hit a runtime error. An important lesson here is that we ran a query directly, without validating the external table like in the previous example. Thus we ran into an error even though the “CREATE_EXTERNAL_TABLE” went through without errors. This is because the “CREATE_EXTERNAL_TABLE” procedure simply creates a table structure (or metadata) over the data, but will not actually check to see whether the data itself is valid; That occurs at validation or runtime. Without validation, our only option would be to visually decipher the problem with the code. Here’s the real working script this time: DROP TABLE WEATHER_REPORT_TSV; BEGIN  DBMS_CLOUD.CREATE_EXTERNAL_TABLE (     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object('ignoremissingcolumns' value 'true',                           'removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                           'recorddelimiter' value '''|''',                           'skipheaders' value '1',                           'rejectlimit' value '1',                           'compression' value 'gzip'),     column_list => 'REPORT_DATE DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER' ); END; / SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Let us understand the new parameters here, and why our previous script failed: 'ignoremissingcolumns' value 'true': Notice there is no data for the last column “AVERAGE_PRECIPITATION”. This parameter allows the create external table script to skip over columns from the column list, that have no data in the data file. 'removequotes' value 'true': The first column ‘date’ has data surrounded by double quotes. For this data to be converted to an Oracle date type, these quotes need to be removed. Note that when using the type parameter for CSV files as we did in the first example, this removequotes option is true by default. 'dateformat' value 'mm/dd/yy': If we expect a date column to be converted and stored into an Oracle date column (after removing the double quotes of course), we should provide the date column’s format. If we don’t provide a format, the date column will look for the database's default date format. You can see the dateformat documentation here. 'delimiter' value '\t': Fields in this file are tab delimited, so the delimiter we specify is the special character. 'recorddelimiter' value '''|''': Each record or row in our file is delimited by a pipe ‘|’ symbol, and so we specify this parameter which separates out each row. Note that unlike the delimiter parameter, the recorddelimiter must be enclosed in single quotes as shown here. A nuance here is that the last row in your dataset doesn’t need the record delimiter when it is the default newline character, however it does for other character record delimiters to indicate the end of that row. Also note that since ADW is LINUX/UNIX based, source data files with newline as record delimiters, that have been created on Windows, must use “\r\n” as the format option. Both these nuances will likely have updated functionality in future releases. 'rejectlimit' value '1': We need this parameter here to fix an interesting problem. Unlike with the newline character, if we don’t specify a pipe record delimiter here at the very end of the file, we get an error because the API doesn’t recognize where the last row’s, last column ends. If we do specify the pipe record delimiter however, the API expects a new line because the record has been delimited, and we get a null error for the last non-existent row. To fix situations like this, where we know we might have one or more problem rows, we use the reject limit parameter to allow some number of rows to be rejected. If we use ‘unlimited’ as our reject limit, then any number of rows may be rejected. The default reject limit is 0. 'compression' value 'gzip': Notice the .tsv file is zipped into a gzip “.gz” file, which we have used in the URL. We use this parameter so the file will be unzipped appropriately before the table is created. As before, once this is successful, the external table structure has been created on top of the data in the object store. It may be validated using the VALIDATE_EXTERNAL_TABLE procedure. In the script above we have already queried it as a table in your Data Warehouse.   Field Lists - For more Granular parsing options:   A more advanced feature of the DBMS_CLOUD.CREATE_EXTERNAL_TABLE is the Field_List parameter, which borrows it’s feature set from the Field_List parameter of the Oracle Loader access driver. This parameter allows you to specify more granular information about the fields being loaded. For example, let’s use “Charlotte_NC_Weather_History_Double_Dates.csv” from the list of files in our object store. This file is similar to our first CSV example, except it has a copy of the date column in a different date format. Now, if we were to specify a date format in the format parameter, it would apply to universally to all date columns. With the field_list parameter, we can specify two different date formats for the two date columns. We do need to list all the columns and their types when including the field_list; Not mentioning any type parameters simply uses default Varchar2 values. BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (    table_name =>'WEATHER_REPORT_DOUBLE_DATE',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/Charlotte_NC_Weather_History_Double_Dates.csv',    format => json_object('type' value 'csv',  'skipheaders' value '1'),    field_list => 'REPORT_DATE DATE ''mm/dd/yy'',                   REPORT_DATE_COPY DATE ''yyyy-mm-dd'',                   ACTUAL_MEAN_TEMP,                   ACTUAL_MIN_TEMP,                   ACTUAL_MAX_TEMP,                   AVERAGE_MIN_TEMP,                   AVERAGE_MAX_TEMP,                   AVERAGE_PRECIPITATION',    column_list => 'REPORT_DATE DATE,                    REPORT_DATE_COPY DATE,                    ACTUAL_MEAN_TEMP NUMBER,                    ACTUAL_MIN_TEMP NUMBER,                    ACTUAL_MAX_TEMP NUMBER,                    AVERAGE_MIN_TEMP NUMBER,                    AVERAGE_MAX_TEMP NUMBER,                    AVERAGE_PRECIPITATION NUMBER'  ); END; /   SELECT * FROM WEATHER_REPORT_DOUBLE_DATE where actual_mean_temp > 69 and actual_mean_temp < 74; It's important to recognize that the date format parameters are to provide the API with the information to read the data file. The output format from your query will be your Database default (based on your NLS Parameters). This can also be formatted in your query using TO_CHAR.   JSON Files   You may be familiar with JSON files for unstructured and semi-structured data. The "PurchaseOrders.txt" file contains JSON Purhcase Order data, which when parsed and formatted looks like the following. Using JSON data in an ADW instance can be as simple as putting each JSON document into a table row as a BLOB, and using the powerful, native JSON features that the Oracle Database provides to parse and query it. You can also view the JSON documentation for additional features here.  Let’s try this! Copy and run the script below: BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (    table_name =>'JSON_FILE_CONTENTS',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/PurchaseOrders.txt',    column_list => 'json_document blob',    field_list => 'json_document CHAR(5000)' ); END; /   COLUMN Requestor FORMAT A30 COLUMN ShippingInstructions FORMAT A30 SELECT JSON_VALUE(json_document,'$.Requestor') as Requestor,        JSON_VALUE(json_document,'$.ShippingInstructions.Address.city') as ShippingInstructions FROM JSON_FILE_CONTENTS where rownum < 50; The query above lists all the PO requestors and the city where their shipment is to be delivered. Here, we have simply created one column ‘json_document’ in the external table ‘JSON_FILE_CONTENTS’. We do not incur the time it takes to validate these JSON document, and are instead directly querying the external table using the Database’s JSON_VALUE feature. This means the check for well-formed JSON data will be on the fly, which would fail unless you properly skip over the failed data. Here, ‘VALIDATE_EXTERNAL_TABLE’ will not check for valid JSON data, but will simply check that the data is of the correct native datatype (less than 5000 characters long), that is the datatype of the table’s column. For better performance on large JSON data files, using this external table we can also make use of the Database’s JSON features to parse and insert the JSON data into a new table ‘j_purchaseorder’ ahead of time, as below. Note that this insert statement actually brings the data into your ADW instance. You benefit from doing this as it checks to make sure your JSON data is well-formed and valid ahead of time, and therefore incur less of a performance impact when you query this JSON data from your ADW instance. CREATE TABLE j_purchaseorder  (id          VARCHAR2 (32) NOT NULL,   date_loaded TIMESTAMP (6) WITH TIME ZONE,   po_document BLOB   CONSTRAINT ensure_json CHECK (po_document IS JSON));   INSERT INTO j_purchaseorder (id, date_loaded, po_document) SELECT SYS_GUID(), SYSTIMESTAMP, json_document FROM json_file_contents    WHERE json_document IS JSON; We can now query down JSON paths using the JSON simplified syntax as with the following query:   SELECT po.po_document.Requestor,          po.po_document.ShippingInstructions.Address.city                  FROM j_purchaseorder po;   Copying Data into your Autonomous Data Warehouse Here, we've gone through examples to access your object store data via external tables in your ADW. In following posts, I will walk you through examples on how to use the DBMS_CLOUD.COPY_DATA API to copy that data from your files directly into your Data Warehouse, as well as how to diagnose issues while loading your ADW with data using the bad and log files. See you in the next one!

I would like to provide here some practical examples and best practices of how to make use of the powerful data loading and querying features of the Autonomous Data Warehouse (ADW) in the Oracle...

Autonomous

Oracle Autonomous Data Warehouse - Access Parquet Files in Object Stores

Parquet is a file format that is commonly used by the Hadoop ecosystem.  Unlike CSV, which may be easy to generate but not necessarily efficient to process, parquet is really a “database” file type.  Data is stored in compressed, columnar format and has been designed for efficient data access.  It provides predicate pushdown (i.e. extract data based on a filter expression), column pruning and other optimizations. Autonomous Database now supports querying and loading data from parquet files stored in object stores and takes advantage of these query optimizations.  Let’s take a look at how to create a table over a parquet source and then show an example of a data access optimization – column pruning. We’ll start with a parquet file that was generated from the ADW sample data used for tutorials (download here).  This file was created using Hive on Oracle Big Data Cloud Service.  To make it a little more interesting, a few other fields from the customer file were added (denormalizing data is fairly common with Hadoop and parquet).   Review the Parquet File A CSV file can be read by any tool (including the human eye ) – whereas you need a little help with parquet.  To see the structure of the file, you can use a tool to parse its contents.  Here, we’ll use parquet-tools (I installed it on a Mac using brew – but it can also be installed from github): $ parquet-tools schema sales_extended.parquet message hive_schema {   optional int32 prod_id;   optional int32 cust_id;   optional binary time_id (UTF8);   optional int32 channel_id;   optional int32 promo_id;   optional int32 quantity_sold;   optional fixed_len_byte_array(5) amount_sold (DECIMAL(10,2));   optional binary gender (UTF8);   optional binary city (UTF8);   optional binary state_province (UTF8);   optional binary income_level (UTF8); } You  can see the parquet file’s columns and data types, including prod_id, cust_id, income_level and more.  To view the actual contents of the file, we’ll use another option to the parquet-tools utility: $ parquet-tools head sales_extended.parquet prod_id = 13 cust_id = 987 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Adelaide state_province = South Australia income_level = K: 250,000 - 299,999 prod_id = 13 cust_id = 1660 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Dolores state_province = CO income_level = L: 300,000 and above prod_id = 13 cust_id = 1762 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Cayuga state_province = ND income_level = F: 110,000 - 129,999 The output is truncated – but you can get a sense for the data contained in the file. Create an ADW Table We want to make this data available to our data warehouse.  ADW makes it really easy to access parquet data stored in object stores using external tables.   You don’t need to know the structure of the data (ADW will figure that out by examining the file) – only the location of the data and an auth token that provides access to the source.  In this example, the data is stored in an Oracle Cloud Infrastructure Object Store bucket called “tutorial_load_adw”: Using the DBMS_CLOUD package, we will first create a credential using an auth token that has access to the data:   begin   DBMS_CLOUD.create_credential (     credential_name => 'OBJ_STORE_CRED',     username => user@oracle.com',     password => 'the-password'   ) ; end; / Next, create the external table.  Notice, you don’t need to know anything about the structure of the data.  Simply point to the file, and ADW will examine its properties and automatically derive the schema: begin     dbms_cloud.create_external_table (     table_name =>'sales_extended_ext',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'https://swiftobjectstorage.<datacenter>.oraclecloud.com/v1/<obj-store-namespace>/<bucket>/sales_extended.parquet',     format =>  '{"type":"parquet",  "schema": "first"}'     ); end; / A couple of things to be aware of.  First, the URI for the file needs to follow a specific format – and this is well documented here (https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/dbmscloud-reference.html#GUID-5D3E1614-ADF2-4DB5-B2B2-D5613F10E4FA ).  Here, we’re pointing to a particular file.  But, you can also use wildcards (“*” and “?”) or simply list the files using comma separated values. Second, notice the format parameter.  Specify the type of file is “parquet”.  Then, you can instruct ADW how to derive the schema (columns and their data types):  1) analyze the schema of the first parquet file that ADW finds in the file_uri_list or 2) analyze all the schemas for all the parquet files found in the file_uri_list.  Because these are simply files captured in an object store – there is no guarantee that each file’s metadata is exactly the same.  “File1” may contain a field called “address” – while “File2” may be missing that field.  Examining each file to derive the columns is a bit more expensive (but it is only run one time) – but may be required if the first file does not contain all the required fields. The data is now available for query: desc sales_extended_ext; Name           Null? Type            -------------- ----- --------------  PROD_ID              NUMBER(10)      CUST_ID              NUMBER(10)      TIME_ID              VARCHAR2(4000)  CHANNEL_ID           NUMBER(10)      PROMO_ID             NUMBER(10)      QUANTITY_SOLD        NUMBER(10)      AMOUNT_SOLD          NUMBER(10,2)    GENDER               VARCHAR2(4000)  CITY                 VARCHAR2(4000)  STATE_PROVINCE       VARCHAR2(4000)  INCOME_LEVEL         VARCHAR2(4000) select prod_id, quantity_sold, gender, city, income_level from sales_extended_ext where rownum < 10; Query Optimizations with Parquet Files As mentioned at the beginning of this post, parquet files support column pruning and predicate pushdown.  This can drastically reduce the amount of data that is scanned and returned by a query and improve query performance.  Let’s take a look at an example of column pruning.  This file has 11 columns – but imagine there were 911 columns instead and you were interested in querying only one.  Instead of scanning and returning all 911 columns in the file – column pruning will only process the single column that was selected by the query. Here, we’ll query similar data – one file is delimited text while the other is parquet (interestingly, the parquet file is a superset of the text file – yet is one-fourth the size due to compression).  We will vary the number of columns used for each query:  Query a single parquet column Query all parquet columns Query a single text column Query all the text columns The above table was captured from the ADW Monitored SQL Activity page.  Notice that the I/O bytes for text remains unchanged – regardless of the number of columns processed.  The parquet queries on the other hand process the columnar source efficiently – only retrieving the columns that were requested by the query.  As a result, the parquet query eliminated nearly 80% of the data stored in the file.  Predicate pushdown can have similar results with large data sets – filtering the data returned by the query. We know that people will want to query this data frequently and will require optimized access.  After examining the data, we now know it looks good and will load it into a table using another DBMS_CLOUD procedure – COPY_DATA.  First, create the table and load it from the source: CREATE TABLE SALES_EXTENDED ( PROD_ID NUMBER, CUST_ID NUMBER, TIME_ID VARCHAR2(30), CHANNEL_ID NUMBER, PROMO_ID NUMBER, QUANTITY_SOLD NUMBER(10,0), AMOUNT_SOLD NUMBER(10,2), GENDER VARCHAR2(1), CITY VARCHAR2(30), STATE_PROVINCE VARCHAR2(40), INCOME_LEVEL VARCHAR2(30) ); -- Load data begin dbms_cloud.copy_data( table_name => SALES_EXTENDED', credential_name =>'OBJ_STORE_CRED', file_uri_list =>'https://swiftobjectstorage.<datacenter>.oraclecloud.com/v1/<obj-store-namespace>/<bucket>/sales_extended.parquet', format => '{"type":"parquet", "schema": "first"}' ); end; / The data has now been loaded.  There is no mapping between source and target columns required; the procedure will do a column name match.  If a match is not found, the column will be ignored. That’s it!  ADW can now access the data directly from the object store – providing people the ability to access data as soon as it lands – and then for optimized access load it into the database.  

Parquet is a file format that is commonly used by the Hadoop ecosystem.  Unlike CSV, which may be easy to generate but not necessarily efficient to process, parquet is really a “database” file type. ...

Hadoop Best Practices

Big Data Resource management Looks Hard, But it isn't

Hadoop is an ecosystem consisting of multiple different components in which each component (or engine) consumes certain resources. There are a few resource management techniques that allow administrators to define how finite resources are divided across multiple engines. In this post, I'm going to talk about these different techniques in detail. Talking about resource management, I'll divide the topic in these sub-topics: 1. Dealing with low latency engines (realtime/near-realtime) 2. Division of resources across multiple engines within a single cluster 3. Division of resources between different users within a single technology 1. Low latency engines These engines assume low latency response times. Examples are NoSQL databases (HBase, MongoDB, Cassandra...) or message based systems like Kafka. These systems should be placed on a dedicated cluster if you have real and specific low latency Service Level Agreements (SLAs). Yes, for highly utilized HBase or highly utilized Kafka with the strict SLA we do recommend to put it on a dedicated cluster, otherwise the SLAs can't be met. 2. Division of resources across multiple engines in a single cluster It's quite common to put multiple processing engines (such as YARN, Impala, etc.) in a single cluster. As soon as this happens, administrators face the challenge of how to divide resources among these engines. The short answer for Cloudera clusters is "Static Service Pools". In Cloudera Manager you can find the "Static Service Pool" configuration option here: You use this functionality to divide resources between different processing engines, such as: YARN Impala Big Data SQL Etc. When applied, under the hood, these engines are using Linux cgroup to partition resources across these engines.  I've already explained how to setup Static Service Pools in context of Big Data SQL, please review this for more details on configuring Static Resource Pools. 3. Division of resources between different users within a single technology 3.1 Resource division within YARN. Dynamic service pool To work with resource allocation in Yarn, Cloudera Manager offers "Dynamic Service Pools". Its purpose is to divide resources within different user groups inside YARN (a single engine): Because you work on Yarn, many different engines are impacted - e.g. those engines that run inside the Yarn framework. For example Spark, Hive (MapReduce) etc. The following steps are planned to be automated for Big Data Appliance and Big Data Cloud Service in an upcoming but if you want to apply this beforehand or on your own Cloudera clusters, here are the high level steps for this: a) Enable Fair Scheduler Preemption b) Configure example pools: for Low Priority, Medium Priority and High Priority jobs c) Setup placement rules The following sections dive deeper into the details. 3.1.1. Enable Fair Scheduler Preemption To enable fair scheduler preemption go to Cloudera Manager -> YARN -> Configuration. Then set: - yarn.scheduler.fair.preemption=true - yarn.scheduler.fair.preemption.cluster-utilization-threshold=0.7 The utilization threshold after which preemption kicks in. The utilization is computed as the maximum ratio of usage to capacity among all resources. Next we must configure: - yarn.scheduler.fair.allow-undeclared-pools = false When set to true, pools specified in applications but not explicitly configured, are created at runtime with default settings. When set to false, applications specifying pools not explicitly configured run in a pool named default. This setting applies when an application explicitly specifies a pool and when the application runs in a pool named with the username associated with the application - yarn.scheduler.fair.user-as-default-queue = false When set to true, the Fair Scheduler uses the username as the default pool name, in the event that a pool name is not specified. When set to false, all applications are run in a shared pool, called default. Note: parameter "Enable ResourceManager ACLs" should be set to true by default, but its worth checking it, just in case. "yarn.admin.acl" shouldn't be equal '*'. set it equal to "yarn" After modifying these configuration settings you will need to restart the YARN cluster to activate these settings in your cluster. Next step is the configuration of the Dynamic Service Pools. 3.1.2. Configure example pools Go to Cloudera Manager -> Dynamic Resource Pool Configuration. Here we recommend (in future BDA/BDCS versions we will create these by default for you) to create three pools: low medium high We also recommend that you remove the root.default pool as shown below: Different pools will use different resources (CPU and Memory). To illustrate this I'll run 3 jobs and put these into the different pools: high, medium, low: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.medium 1000 1000000000000 p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1000 1000000000000 after running for a while, navigate to Cloudera Manager -> YARN -> Resource Pool and take a look at "Fair Share VCores" (or memory): In this diagram we can see that vCores are allocated according to our configured proportion: 220/147/73 roughly the same as 15/10/5. Second important configuration is limit of maximum pool usage: we recommend to put a cap on the resource pool so small applications can jump into the cluster even if a long-running job has been launched. Here is a small test case: - Run Long running job in root.low pool (which takes days to be done): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 Check resource usage: This graph shows that we have some unused CPU, as we wanted. Also, notice below that we have some pending containers which shows that our application wants to run more jobs, but as expected YARN disallows to do this: So, despite that we have spare resource in our cluster, YARN disallows to use it because of the capping maximum resource usage for certain pools. - Now run some small job, which belong to the different pool: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.medium; hive> select count(1) from date_dim; ... Now, jump to the resource usage page (Cloudera Manager -> YARN -> Resource Pools): Here we can see that the number of pending containers for the first job hasn't changed. This is the reserve we allocated for newly running small jobs. This enables short start up times for small jobs so, no preemption needed and end users will not feel like their jobs hangs. Third key things of Resource Management configuration is Preemption configuration. We recommend to configure different preemption levels for each different pool (double check you've enabled preemption earlier at the cluster level). There are two configuration settings to change: Fair Share Preemption Threshold - This is a value between 0 and 1. If set to x and the fair share of the resource pool is F, we start preempting resources from other resource pools if the allocation is under (x * F). In other words, it defines in lack of which portion of the resources we start to do preemption. Fair Share Preemption Timeout - The number of seconds a resource pool is under its fair share before it will try to preempt containers to take resources from other resource pools. In other words, this setting defines when YARN will Start to preempt.  To configure, go to Cloudera Manager -> Dynamic Service Pools -> Edit for certain pool -> Preemption. We suggest the following settings: For high. Immediately start preemption if job didn't get all requested resources, which is implemented as below: For medium. Wait 60 seconds before starting preemption if the job didn't get at least 80% of the requested resources: And for low. Wait a 180 seconds before starting preemption if a job didn't get at least 50% of its requested resources: These parameters define how aggressively containers will be preempted (how quickly job will get required resources). Here is my test case - I've run some long running job in root.low pool and run some job in parallel, assign it to low, medium and high pool respectively. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.low; hive> select count(1) from store_sales; ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.medium; hive> select count(1) from store_sales; ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.high; hive> select count(1) from store_sales; ... as a measure of the result we could consider elapsed time (which will consist of waiting time, which will be different, according our configuration, plus elapsed time which also will be different because of resource usage). This table shows the result: Pool name Elapsed time, min Low 14.5 Medium 8.8 High 8 In the graph below you can see how preemption was accomplished: There is another aspect of preemption, which is whether a pool can be preempted or not. To set this up, go to: Cloudera Manager -> Dynamic Resource Pool -> root.high pool, there click on "Edit": After this click on "Preemption" and disable preemption from root.high pool. That will mean that nobody can preempt tasks from this pool: Note: this setting makes root.high pool incredibly strong and you may have to consider enabling preemption again. 3.1.3. Setup Placement rules Another key component of Dynamic Resource management is Placement rules. Placement rules define which pool a job will be assigned to. By default, we suggest keeping it simple To configure, go to Cloudera Manager -> Dynamic Service Pools -> Placement Rules: With this configuration your user may belong to one of the secondary groups, which we named low/medium/high. If not, you can define the pool assigned for the job at runtime. If you don't do that, by default the job will be allocated resources in the low pool. So, if administrator knows what to do she will put user in certain pool, if users know what to do, they will specify certain pool (low, medium or high). We recommend administrators defining this for the system. For example, I do have user alex, who belongs to secondary group "medium": p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# id alex uid=1002(alex) gid=1006(alex) groups=1006(alex),1009(medium) so, if I'll try to specify a different group (consider it as user wants to cheat the system settings) at runtime it will not overwrite medium group: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# sudo -u alex hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1000 1000000000000 3.1.4. How to specify resource pool at runtime? While there are a few engines, let's focus on MapReduce (Hive) and Spark. Earlier, in this blog I've showed how to specify a pool for MapReduce job, with mapred.job.queue.name parameter. You can specify it with the -D parameter when you launch the job from the console: [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 Or in case of hive you can set it up as a parameter: hive> set mapred.job.queue.name=root.low;                                             Another engine is Spark, and here you can simply add the "queue" parameter: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# spark-submit --class org.apache.spark.examples.SparkPi --queue root.high spark-examples.jar in Spark2-Shell console you need to specify the same parameter: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 lib]# spark2-shell --queue root.high 3.1.5. Group Mapping The first thing that Resource Manager is checking is user secondary group. How do you define this? I've posted it earlier in my security Blog Post, but in the nutshell it is defined either with LDAP mapping or UnixShell and defined under "hadoop.security.group.mapping" parameter in Cloudera Manager (HDFS -> Configuration): Below is a list of useful commands, which could be used for managing user and groups on BDA/BDCS/BDCC (note all users and groups have to exist on each cluster node and have same id): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} // Add new user # dcli -C "useradd  -u 1102 user1" // Add new group # dcli -C "groupadd -g 1111 high" // Add user to the group # dcli -C "usermod -a -G high user1" // Remove user from the group # dcli -C "gpasswd -d user1 high" // Delete user # dcli -C "userdel user1" // Delete group # dcli -C "groupdel high" Oracle and Cloudera recommend to use "org.apache.hadoop.security.ShellBasedUnixGroupsMapping" plus SSSD to replicate users from Active Directory. From Cloudera's documentation: "The LdapGroupsMapping library may not be as robust a solution needed for large organizations in terms of scalability and manageability, especially for organizations managing identity across multiple systems and not exclusively for Hadoop clusters. Support for the LdapGroupsMapping library is not consistent across all operating systems." 3.1.6. Control user access to certain pool (ACL) You may want to restrict the group of users that has access to certain pools (especially to high and medium). You accomplish this with ACLs in Dynamic Resource Pools. First, just an a reminder, that you have to set up "yarn.admin.acl" to something different than '*'. Set it equal to "yarn". Before setting restrictions for certain pools, you need to set up a restriction for the root pool: due Cloudera Manager bug, you can't leave ACL for submission and admin empty (otherwise everyone will allow to submit jobs in any pool), so set it up equal to desired groups or as workaround set it up to ",". After this you are ready to create rules for the other pools. Let's start from low. Here as soon as it plays role of default pool in our config, we should allow everyone to submit jobs there: next, let's move to medium. There we will config access for users, who belongs to groups medium and etl: and finally, we are ready to config high pool. Here we will allow to submit jobs for users who belong to groups managers or high: let's do a quick test. let's take some user, who is not belongs to privileged groups: medium, high, etl, managers. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# hdfs groups user2 user2 : user2 and run some jobs after: [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi 1 10 ... 18/11/19 21:22:31 INFO mapreduce.Job: Job job_1542663460911_0026 running in uber mode : false 18/11/19 21:22:31 INFO mapreduce.Job:  map 0% reduce 0% 18/11/19 21:22:36 INFO mapreduce.Job:  map 100% reduce 0% after this check that this user was place in root.low queue (everyone is allowed to run jobs there): so far so good. Now, let's try to submit job from the same user to high priority pool: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1 10 ... 18/11/19 21:27:33 WARN security.UserGroupInformation: PriviledgedActionException as:user2 (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1542663460911_0027 to YARN : User user2 cannot submit applications to queue root.high to put this job in pool root.high, we need to add this user to any group, which is listed in ACL for pool root.high, let's use managers (create it first): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# dcli -C "groupadd -g 1112 managers" [root@bdax72bur09node01 hadoop-mapreduce]# dcli -C "usermod -a -G managers user2" second try and validation: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# hdfs groups user2 user2 : user2 managers [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1 10 ... 18/11/19 21:34:00 INFO mapreduce.Job: Job job_1542663460911_0029 running in uber mode : false 18/11/19 21:34:00 INFO mapreduce.Job:  map 0% reduce 0% 18/11/19 21:34:05 INFO mapreduce.Job:  map 100% reduce 0% wonderful! everything works as expected. Once again, here is important to set up root pool to some values. If you don't want put the list of available groups in the root pool and want to put it later or you may want to have a pool, like root.low where everyone could submit their jobs, simply use workaround with "," character. 3.1.7. Analyzing of resource usage The tricky thing with the pools, that sometimes many people or divisions use the same pool and it's hard to define who get which portion.  p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# cat getresource_usage.sh #!/bin/bash   STARTDATE=`date -d " -1 day " +%s%N | cut -b1-13` ENDDATE=`date +%s%N | cut -b1-13` result=`curl -s http://bdax72bur09node04:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` if [[ $result =~ "standby RM" ]]; then result=`curl -s http://bdax72bur09node05:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` fi #echo $result echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "user|coreSeconds" | awk ' /user/ { user = $2 } /vcoreSeconds/ { arr[user]+=$2 ; } END { for (x in arr) {print "yarn." x ".cpums="arr[x]} } '   echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "user|memorySeconds" | awk ' /user/ { user = $2 } /memorySeconds/ { arr1[user]+=$2 ; } END { for (y in arr1) {print "yarn." y ".memorySeconds="arr1[y]} } '   3.2.1. Impala Admission Control Another popular engine in the Hadoop world is Impala. Impala has own mechanism to control resources - called admission control. Many MPP systems recommend queueing queries in case of high concurrency instead of running it in parallel. Impala is not an exception. For config this, go to Cloudera Manager -> Dynamic Resource Pools -> Impala Admission Control: Admission Control has few key parameters for configuring queue: Max Running Queries - Maximum number of concurrently running queries in this pool Max Queued Queries - Maximum number of queries that can be queued in this pool Queue Timeout - The maximum time a query can be queued waiting for execution in this pool before timing out so, you will be able to place up to Max Running Queries queries, after this rest Max Queued Queries queries will be queued. They will stay in the queue for Queue Timeout, after this, it will be canceled. Example. I've config Max Running Queries =3 (allow to run 3 simultaneous SQLs), Max Queued Queries =2, which allows running two simultaneous queries and Queue Timeout is default 60 seconds.  After this, I've run 6 queries and wait for a minute. Three queries were successfully executed, three other queries failed for different reasons: One query been rejected right away, because it were not place for it in a queue (3 running, 2 queued next rejected). Two others were in queue for a 60 seconds, but as soon as rest queries were not executed within this timeout, they were failed as well.

Hadoop is an ecosystem consisting of multiple different components in which each component (or engine) consumes certain resources. There are a few resource management techniques that allow ...

Hadoop Best Practices

Use Big Data Appliance and Big Data Cloud Service High Availability, or You'll Blame Yourself Later

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features are always available out of the box on your systems, and that no extra steps are required from your end. One of the key value-adds of leveraging a hardened system from Oracle. A special shout-out to Sandra and Ravi from our team, for helping with this blog post. For this post on HA, we'll subdivide the content into the following topics: High Availability in the Hardware Components of the system High Availability within a single node Hadoop Components High Availability 1. High Availability in Hardware Components When we are talking about an on-premise solution, it is important to understand the fault tolerance and HA built into the actual hardware you have on the floor. Based on Oracle Exadata and the experience we have in managing mission critical systems, a BDA  is built out of components to handle hardware faults and simply stay up and running. Networking is redundant, power supplies in the racks are redundant, ILOM software tracks the health of the system and ASR pro-actively logs SRs if needed on hardware issues. You can find a lot more information here. 2. High availability within a single node Talking about high availability within a single node, I'd like to focus on disk failures. In large clusters, disk failures do occur but should - in general - nor cause any issues for BDA and BDCS customers. First let's have a look at the disk representation (minus data directories) for the Oracle system: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# df -h|grep -v "/u" Filesystem      Size  Used Avail Use% Mounted on devtmpfs        126G     0  126G   0% /dev tmpfs           126G  8.0K  126G   1% /dev/shm tmpfs           126G   67M  126G   1% /run tmpfs           126G     0  126G   0% /sys/fs/cgroup /dev/md2        961G   39G  874G   5% / /dev/md6        120G  717M  113G   1% /ssddisk /dev/md0        454M  222M  205M  53% /boot /dev/sda1       191M   16M  176M   9% /boot/efi /dev/sdb1       191M     0  191M   0% /boot/rescue-efi cm_processes    126G  309M  126G   1% /run/cloudera-scm-agent/process Next, let's take a look where critical services store their data. - Name Node. Aparently most critical HDFS component. It stores FSimage file and edits on the hard disks, let's check where: [root@bdax72bur09node02 ~]# df -h /opt/hadoop/dfs/nn Filesystem      Size  Used Avail Use% Mounted on /dev/md2        961G   39G  874G   5% / - Journal Node: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} span.s2 {font-variant-ligatures: no-common-ligatures; color: #1effff} span.s3 {font-variant-ligatures: no-common-ligatures; color: #4c7aff} [root@bdax72bur09node02 ~]# df /opt/hadoop/dfs/jn Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk [root@bdax72bur09node02 ~]# ls -l /opt/hadoop/dfs/jn lrwxrwxrwx 1 root root 15 Jul 15 22:58 /opt/hadoop/dfs/jn -> /ssddisk/dfs/jn - Zookeeper: [root@bdax72bur09node02 ~]# df /var/lib/zookeeper Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk all these services store their data on RAIDs /dev/md2 and /dev/md6. Let's take a look on what it consist of: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md2 /dev/md2: ...      Array Size : 1023867904 (976.44 GiB 1048.44 GB)      Used Dev Size : 1023867904 (976.44 GiB 1048.44 GB)       Raid Devices : 2      Total Devices : 2    ...     Active Devices : 2 ...     Number   Major   Minor   RaidDevice State        0       8        3        0      active sync   /dev/sda3        1       8       19        1      active sync   /dev/sdb3 so, md2 is one terabyte mirror RAID. We are save if one of the disks will fail. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md6 /dev/md6: ...      Array Size : 126924800 (121.04 GiB 129.97 GB)      Used Dev Size : 126924800 (121.04 GiB 129.97 GB)       Raid Devices : 2      Total Devices : 2 ...    Active Devices : 2 ...    Number   Major   Minor   RaidDevice State        0       8      195        0      active sync   /dev/sdm3        1       8      211        1      active sync   /dev/sdn3 so, md6 is mirror SSD RAID. We are save if one of the disks will fail. fine, let's go next! 3. High Availability of Hadoop Components 3.1 Default service distribution on BDA/BDCS We briefly took a look at the hardware layout of BDA/BDCS and how we layout data on disk. In this section, let's look at the Hadoop software details. By default, when you deploy BDCS or configure and create a BDA cluster, you will have the following service distribution by default: Node01 Node02 Node03 Node04 Node05 to nn Balancer - Cloudera Manager Server - - Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent DataNode DataNode DataNode DataNode DataNode Failover Controller Failover Controller - Oozie - JournalNode JournalNode JournalNode - - - MySQL Backup MySQL Primary - - NameNode NameNode Navigator Audit Server and Navigator Metadata Server - - NodeManager (in clusters of eight nodes or less) NodeManager (in clusters of eight nodes or less) NodeManager NodeManager NodeManager - - SparkHistoryServer Oracle Data Integrator Agent - - - ResourceManager ResourceManager - ZooKeeper ZooKeeper ZooKeeper - - Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) JobHistory - - Sentry Server (if enabled) Sentry Server (if enabled) - - - Hive Metastore - - Hive Metastore - Active Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) Passive Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) - - - - HttpFS - - - Hue Server - - Hue Server - Hue Load Balancer - - Hue Load Balancer - let me talk about High Availability implementation of some of this services. This configuration may change in the future, you could check some updates here. 3.2 Service with configured High Availability by default As of today (November 2018) we support high availability features for certain Hadoop components: 1) Name Node 2) YARN 3) Kerberos Distribution Center 4) Sentry 5) Hive Metastore Service 6) HUE 3.2.1 Name Node High Availability As you may know Oracle Solutions based on Cloudera Hadoop distribution. Here you could find detailed explanation about how HDFS high availability is achieved, but the good news that all those configuration steps done by default on BDA and BDCS and you simply have it by default. Let me show a small demo for NameNode high availability. First, let's check list of the nodes, which runs this service: [root@bdax72bur09node01 ~]# hdfs getconf -namenodes bdax72bur09node01.us.oracle.com bdax72bur09node02.us.oracle.com the easiest way to determine which node is active is to go to Cloudera Manager -> HDFS -> Instances: in my case bdax72bur09node02 node is active. I'll run hdfs list command in the cycle and reboot active namenode and we will take a look on how will system behave: [root@bdax72bur09node01 ~]# for i in seq {1..100}; do hadoop fs -ls hdfs://gcs-lab-bdax72-ns|tail -1; done; drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks 18/11/01 19:53:53 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 1 fail over attempts. Trying to fail over immediately. ... 18/11/01 19:54:16 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 5 fail over attempts. Trying to fail over after sleeping for 11022ms. java.net.ConnectException: Call From bdax72bur09node01.us.oracle.com/192.168.8.170 to bdax72bur09node02.us.oracle.com:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:786)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2167)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1265)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)     at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)     at org.apache.hadoop.fs.Globber.glob(Globber.java:151)     at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1715)     at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)     at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)     at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:102)     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) Caused by: java.net.ConnectException: Connection timed out ...   drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks so, as we can see due unavailability of one of the name nodes, second one take over it responsibility. Customer will experience short service outage. In Cloudera Manager we can see that NameNode service on node02 is not available: but despite on this, users could keep continue to work with the cluster without outages or any extra actions. 3.2.2 YARN High Availability YARN is another key Hadoop component and it's also highly available by default within Oracle Solution. Cloudera Requires to make some configuration, but with BDA and BDCS all these steps done after service deployment. Let's do the same test with YARN resource manager. In Cloudera Manager we define nodes, which run YARN resource manager service and try to reboot active one (reproduce hardware fail): I'll run some MapReduce code and will restart bdax72bur09node04 node (which contains active resource manager). [root@bdax72bur09node01 hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples.jar pi 1 1 Number of Maps  = 1 Samples per Map = 1 Wrote input for Map #0 Starting Job 18/11/01 20:08:03 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm16 18/11/01 20:08:03 INFO input.FileInputFormat: Total input paths to process : 1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: number of splits:1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541115989562_0002 18/11/01 20:08:04 INFO impl.YarnClientImpl: Submitted application application_1541115989562_0002 18/11/01 20:08:04 INFO mapreduce.Job: The url to track the job: http://bdax72bur09node04.us.oracle.com:8088/proxy/application_1541115989562_0002/ 18/11/01 20:08:04 INFO mapreduce.Job: Running job: job_1541115989562_0002 18/11/01 20:08:07 INFO retry.RetryInvocationHandler: Exception while invoking getApplicationReport of class ApplicationClientProtocolPBClientImpl over rm16. Trying to fail over immediately. java.io.EOFException: End of File Exception between local host is: "bdax72bur09node01.us.oracle.com/192.168.8.170"; destination host is: "bdax72bur09node04.us.oracle.com":8032; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy13.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy14.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408)     at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:302)     at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:154)     at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:323)     at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:423)     at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:698)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:326)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)     at java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)     at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)     at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:621)     at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1366)     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1328)     at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)     at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)     at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)     at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)     at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.io.EOFException     at java.io.DataInputStream.readInt(DataInputStream.java:392)     at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006) 18/11/01 20:08:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15 18/11/01 20:08:09 INFO mapreduce.Job: Job job_1541115989562_0002 running in uber mode : false 18/11/01 20:08:09 INFO mapreduce.Job:  map 0% reduce 0% 18/11/01 20:08:23 INFO mapreduce.Job:  map 100% reduce 0% 18/11/01 20:08:29 INFO mapreduce.Job:  map 100% reduce 100% 18/11/01 20:08:29 INFO mapreduce.Job: Job job_1541115989562_0002 completed successfully well, in the logs we clearly can see that we were failing over to second resource manager. In Cloudera Manager we can see that node03 took over active role: so, looking entire node, which contain Resource Manager users will not lose ability to submit their jobs. 3.2.3 Kerberos Distribution Center (KDC) In fact majority of production Hadoop Clusters running in secure mode, which means Kerberized Clusters. Kerberos Distribution Center is the key component for it. The good news when we install Kerberos with BDA or BDCS, you automatically get standby on your BDA/BDCS. 3.2.4 Sentry High Availability If Kerberos is authentication method (define who you are), that quite frequently users want to use some Authorization tool in couple with it. In case of Cloudera almost default tool is Sentry. Since BDA4.12 software release we have support of Sentry High Availability out of the box. Cloudera has detailed documentation, which explains how it works.  3.2.5 Hive Metastore Service High Availability When we are talking about hive, it's very important to keep in mind that it consist of many components. it's easy to see in Cloudera Manager: and whenever you deal with some hive tables, you have to go through many logical layers: for keep it simple, let's consider one case, when we use beeline to query some hive tables. So, we need to have HiveServer2, Hive Metastore Service and Metastore backend RDBMS available. Let's connect and make sure that data is available: 0: jdbc:hive2://bdax72bur09node04.us.oracle.c (closed)> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Connected to: Apache Hive (version 1.1.0-cdh5.14.2) Driver: Hive JDBC (version 1.1.0-cdh5.14.2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+   Now, let's shut down HiveServer2 and will make sure that we can't connect to database: 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Could not open connection to the HS2 server. Please check the server URI and if the URI is correct, then ask the administrator to check the server status. Error: Could not open client transport with JDBC Uri: jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0) 1: jdbc:hive2://bdax72bur09node04.us.oracle.c>  as we expected we couldn't perform connection. We have to go to Cloudera Manager -> Hive -> Instances -> Add Role and add extra HiveServer2 (add it to node05): After this we will need to install balancer: [root@bdax72bur09node06 ~]# yum -y install haproxy Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package haproxy.x86_64 0:1.5.18-7.el7 will be installed --> Finished Dependency Resolution   Dependencies Resolved   ==========================================================================================================================================================================================================  Package                                        Arch                                          Version                                             Repository                                         Size ========================================================================================================================================================================================================== Installing:  haproxy                                        x86_64                                        1.5.18-7.el7                                        ol7_latest                                        833 k   Transaction Summary ========================================================================================================================================================================================================== Install  1 Package   Total download size: 833 k Installed size: 2.6 M Downloading packages: haproxy-1.5.18-7.el7.x86_64.rpm                                                                                                                                                    | 833 kB  00:00:01      Running transaction check Running transaction test Transaction test succeeded Running transaction   Installing : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Verifying  : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Installed:   haproxy.x86_64 0:1.5.18-7.el7                                                                                                                                                                              Complete! now we will need to config haproxy. Go to configuration file: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node06 ~]# vi /etc/haproxy/haproxy.cfg this is example of my haproxy.cfg: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} global     log         127.0.0.1 local2       chroot      /var/lib/haproxy     pidfile     /var/run/haproxy.pid     maxconn     4000     user        haproxy     group       haproxy     daemon       # turn on stats unix socket     stats socket /var/lib/haproxy/stats   #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults     mode                    http     log                     global     option                  httplog     option                  dontlognull     option http-server-close     option forwardfor       except 127.0.0.0/8     option                  redispatch     retries                 3     timeout http-request    10s     timeout queue           1m     timeout connect         10s     timeout client          1m     timeout server          1m     timeout http-keep-alive 10s     timeout check           10s     maxconn                 3000   #--------------------------------------------------------------------- # main frontend which proxys to the backends #--------------------------------------------------------------------- frontend  main *:5000     acl url_static       path_beg       -i /static /images /javascript /stylesheets     acl url_static       path_end       -i .jpg .gif .png .css .js     use_backend static          if url_static p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} #--------------------------------------------------------------------- # static backend for serving up images, stylesheets and such #--------------------------------------------------------------------- backend static     balance     roundrobin     server      static 127.0.0.1:4331 check   #--------------------------------------------------------------------- # round robin balancing between the various backends #--------------------------------------------------------------------- listen hiveserver2 :10005     mode tcp     option tcplog     balance source server hiveserver2_1 bdax72bur09node04.us.oracle.com:10000 check server hiveserver2_2 bdax72bur09node05.us.oracle.com:10000 check   Then go to Cloudera Manager and setup balancer hostname/port (accordingly how we config it in our previous step): after all these changes been done try to connect again: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} beeline> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ 3 rows selected (2.08 seconds) Great! it work. Try to shutdown one of the HiveServer2: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 0: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ this is works! Now let's move on and let's have a look what do we have for Hive Metastore Service high availability. The really great news, that we do have enable it by default with BDA and BDCS: for showing this, I'll try to shutdown one by one service consequently and will see you connection to beeline would work. Shutdown service on node01 and try to connect/query through beeline: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ works, now i'm going to startup service on node01 and shutdown on the node04: 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ it works again! so, we are safe with Hive Metastore service. BDA and BDCS use MySQL RDBMS as database layout. As of today there is no High Availability for MySQL database, so we are using Master - Slave replication (in future we hope to have HA for MySQL), which allows us switch to Slave in case of Master failing. Today, you will need to perform node migration in case of failing master node (node03 by default), I'll explain this later in this blog. To find out where is MySQL Master, run this: [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node03 to find out slave, run: [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02   3.2.6 HUE High Availability Hue is quite a popular tool for working with Hadoop data. It's also possible to run Hue in HA mode, Cloudera explains it here, but with BDA and BDCS you will have it out of the box since 4.12 software version. By default you have HUE and HUE balancer available on node01 and node04: in case of unavailability of Node01 or Node04, users could easily, without any extra actions keep using HUE, just for switching to different balancer URL. 3.3 Migrate Critical Nodes One of the greatest features of Big Data Appliance is the capability to migrate all roles of critical services. For example, some nodes may contain many critical services, like node03 (Cloudera Manager, Resource Manager, MySQL store...). Fortunately, BDA has the simple way to migrate all roles from critical to a non-critical node. All details you may find in MOS (Node Migration on Oracle Big Data Appliance V4.0 OL6 Hadoop Cluster to Manage a Hardware Failure (Doc ID 1946210.1)). Let's consider a case, when we lose (because of Hardware failing, for example) one of the critical server - node03, which contains MySQL Active RDBMS and Cloudera Manager. For fix this we need to migrate all roles of this node to some other server. For perform migration all roles from node03, just run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# bdacli admin_cluster migrate bdax72bur09node03 all details you could find in the MOS note, but briefly: 1) This is two major type of migrations: - Migration of critical nodes - Reprovisioning of non-critical nodes 2) When you migrate critical nodes, you couldn't choose non-critical on which you will migrate services (mammoth will do this for you, generally it will be the first available non-critical node) 3) after hardware server will be back to cluster (or new one will be added), you should reprovision it as non critical. 4) You don't need to switch services back, just leave it as it is after migration done, the new node will take over all roles from failing one. In my example, I've migrated one of the critical node, which has Active MySQL RDBMS and Cloudera Manager. To check where is active RDBMS, you may run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node05 Note: for find slave RDBMS, run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02 and Cloudera Manager runs on node05: Resource Manager also was migrated to the node05: Migration process does decommission of the node. After failing node will come back to the cluster, we will need to reprovision it (deploy non-critical services). In other words, we need to make re-commision of the node. 3.4 Redundant Services there are certain Hadoop services, which configured on BDA in redundant way. You shouldn't worry about high availability of these services: - Data Node. By default, HDFS configured for being 3 times redundant. If you will lose one node, you will have tow more copies. - Journal Node. By default, you have 3 instances of JN configured. Missing one is not a big deal. - Zookeeper. By default, you have 3 instances of JN configured. Missing one is not a big deal. 4. Services with no configured High Availability by default There are certain services on BDA, which doesn't have High Availability Configuration by default: - Oozie. If you need to have High Availability for Oozie, you may check Cloudera's documentation - Cloudera Manager. It's also possible to config Cloudera Manager for High Availability, like it's explained here, but I'd recommend use node migration, like I show above - Impala. By default, neither BDA nor BDCS don't have Impala configured by default (yet), but it's quite important. All detailed information you could find here, but briefly for config HA for Impala, you need: a. Config haproxy (I've extend existing haproxy config, doen for HiveServer2), by adding: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} listen impala :25003     mode tcp     option tcplog     balance leastconn       server symbolic_name_1 bdax72bur09node01.us.oracle.com:21000 check     server symbolic_name_2 bdax72bur09node02.us.oracle.com:21000 check     server symbolic_name_3 bdax72bur09node03.us.oracle.com:21000 check     server symbolic_name_4 bdax72bur09node04.us.oracle.com:21000 check     server symbolic_name_5 bdax72bur09node05.us.oracle.com:21000 check     server symbolic_name_6 bdax72bur09node06.us.oracle.com:21000 check b. Go to Cloudera Manager -> Impala - Confing -> search for "Impala Daemons Load Balancer" and add haproxy host there: c. Login into Impala, using haproxy host:port: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 bin]# impala-shell -i bdax72bur09node06:25003 ... Connected to bdax72bur09node06:25003 ... [bdax72bur09node06:25003] > talking about Impala, it's worth to mention that there are two more services - Impala Catalog Service and Impala State Store. It's not mission critical services. From Cloudera's documentation: The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons.  and The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. ... Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host. I'll make a quick test: - I've disabled Impala daemon on the node01, disable StateStore, and Catalog id - Connect to loadbalancer and run the query: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# impala-shell -i bdax72bur09node06:25003 .... [bdax72bur09node06:25003] > select count(1) from test_table; ... +------------+ | count(1)   | +------------+ | 6659433869 | +------------+ Fetched 1 row(s) in 1.76s [bdax72bur09node06:25003] > so, as we can see Impala may work even without StateStore and Catalog Service Appendix A. Despite on that BDA has multiple High Availability features, its always useful to make a backup before any significant operations, like un upgrade. In order to get detailed information, please follow My Oracle Support (MOS) note: How to Backup Critical Metadata on Oracle Big Data Appliance Prior to Upgrade V2.3.1 and Higher Releases (Doc ID 1623304.1)       p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures}

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features...

Big Data

Influence Product Roadmap: Introducing the new Big Data Idea Lab

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on ideas submitted by others. Visit the Big Data Idea Lab now. What does the Idea Lab let you do, and how do we use your feedback? For all our products and services we (Product Management) define a set of features and functionality that will enhance the products and solves customer problems. We then set out to prioritize these features and functions, and a big driver of this is the impact said features have on you, our customers. Until now we really used interaction with customers as the yard stick for that impact, or that bit of prioritization. That will change with the Idea Lab, where we will have direct, recorded and scalable input available on features and ideas.  Of course we are also looking for input into new features and things we had not thought about. That is the other part of the Idea Lab: giving us new ideas, new functions and features and anything that you think would help you use our products better in your company. As we progress in releasing new functionality, the Idea Lab will be a running tally of our progress, and we promise to keep you updated on where we are going in roadmap posts on this blog (see this example: Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now), and on the Idea Lab. So, please use this Idea Lab, submit and vote, and visit often to see what is new and keep us tracking towards better products. And thanks in advance for your efforts!

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on...

Autonomous

Thursday at OpenWorld 2018 - Your Must-See Sessions

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of the whole conference is today - Using Analytic Views for Self-Service Business Intelligence, which is at 9:00am in Room 3005, Moscone West. Multi-Dimensional models inside the database are very powerful and totally cool. AVs uniquely deliver sophisticated analytics from very simple SQL. If you only get to one session today then make it this one! Of course, today is your final chance to get some much-needed real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 1:30pm - 2:30 at the Marriott Marquis (Yerba Buena Level) - Salon 9B. The product management team will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   THURSDAY's MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  Technorati Tags: Analytics, Autonomous, Big Data, Cloud, Conference, Data Warehousing, OpenWorld, SQL Analytics

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of...

Autonomous

Wednesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is two additional chances to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 12:45pm - 1:45pm and then again at 3:45pm - 4:45pm, both at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   WEDNESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full...

Autonomous

Tuesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is the chance to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 3:45 PM - 4:45 PM at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   TUESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of...

Autonomous

Managing Autonomous Data Warehouse Using oci-curl

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means you can use OCI REST APIs to manage your ADW instances as an alternative to using the OCI web interface. I want to provide a few examples to do this using the bash function oci-curl provided in the OCI documentation. This was the easiest method for me to use, you can also use the OCI command line interface, or the SDKs to do the same operations. oci-curl oci-curl is a bash function provided in the documentation that makes it easy to get started with the REST APIs. You will need to complete a few setup operations before you can start calling it. Start by copying the function code from the documentation into a shell script on your machine. I saved it into a file named oci-curl.sh, for example. You will see the following section at the top of the file. You need to replace these four values with your own. TODO: update these values to your own local tenancyId="ocid1.tenancy.oc1..aaaaaaaaba3pv6wkcr4jqae5f15p2b2m2yt2j6rx32uzr4h25vqstifsfdsq"; local authUserId="ocid1.user.oc1..aaaaaaaat5nvwcna5j6aqzjcaty5eqbb6qt2jvpkanghtgdaqedqw3rynjq"; local keyFingerprint="20:3b:97:13:55:1c:5b:0d:d3:37:d8:50:4e:c5:3a:34"; local privateKeyPath="/Users/someuser/.oci/oci_api_key.pem"; How to find or generate these values is explained in the documentation here, let's walk through those steps now. Tenancy ID The first one is the tenancy ID. You can find your tenancy ID at the bottom of any page in the OCI web interface as indicated in this screenshot. Copy and paste the tenancy ID into the tenancyID argument in your oci-curl shell script. Auth User ID This is the OCI ID of the user who will perform actions using oci-curl. This user needs to have the privileges to manage ADW instances in your OCI tenancy. You can find your user OCI ID by going to the users screen as shown in this screenshot. Click the Copy link in that screen which copies the OCI ID for that user into the clipboard. Paste it into the authUserId argument in your oci-curl shell script. Key Fingerprint The first step for getting the key fingerprint is to generate an API signing key. Follow the documentation to do that. I am running these commands on a Mac and for demo purposes, I am not using a passphrase, see the documentation for Windows commands and for using a passphrase to encrypt the key file. mkdir ~/.oci openssl genrsa -out ~/.oci/oci_api_key.pem 2048 chmod go-rwx ~/.oci/oci_api_key.pem openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem For your API calls to authenticate against OCI you need to upload the public key file. Go to the user details screen for your user on the OCI web interface and select API keys on the left. Click the Add Public Key button and copy and paste the contents of the file oci_api_key_public.pem into the text field, click Add to finish the upload. After you upload your key you will see the fingerprint of it in the user details screen as shown below. Copy and paste the fingerprint text into the keyFingerprint argument in your oci-curl shell script. Private Key Path Lastly, change the privateKeyPath argument in your oci-curl shell script to the path for the key file you generated in the previous step. For example, I set it as below in my machine. local privateKeyPath="/Users/ybaskan/.oci/oci_api_key.pem"; At this point, I save my updated shell script as oci-curl.sh and I will be calling this function to manage my ADW instances. Create an ADW instance Let's start by creating an instance using the function. Here is my shell script for doing that, createdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com post ./request.json "/20160918/autonomousDataWarehouses" Note that I first source the file oci-curl.sh which contains my oci-curl function updated with my OCI tenancy information as explained previously. I am calling the CreateAutonomousDataWarehouse REST API to create a database. Note that I am running this against the Phoenix data center (indicated by the first argument, database.us-phoenix-1.oraclecloud.com), if you want to create your database in other data centers you need to use the relevant endpoint listed here. I am also referring to a file named request.json which is a file that contains my arguments for creating the database. Here is the content of that file. { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "dbName" : "adwdb1", "displayName" : "adwdb1", "adminPassword" : "WelcomePMADWC18", "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "licenseModel" : "LICENSE_INCLUDED" } As seen in the file I am creating a database named adwdb1 with 1 CPU and 1TB storage. You can create your database in any of your compartments, to find the compartment ID which is required in this file, go to the compartments page on the OCI web interface, find the compartment you want to use and click the Copy link to copy the compartment ID into the clipboard. Paste it into the compartmentId argument in your request.json file. Let's run the script to create an ADW instance. ./createdb.sh { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : null, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "PROVISIONING", "serviceConsoleUrl" : null, "timeCreated" : "2018-09-06T19:56:48.077Z" As you see the lifecycle state is listed as provisioning which indicates the database is being provisioned. If you now go to the OCI web interface you will see the new database as being provisioned. Listing ADW instances Here is the script, listdb.sh, I use to list the ADW instances in my compartment. I use the ListAutonomousDataWarehouses REST API for this. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com get "/20160918/autonomousDataWarehouses?compartmentId=ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a" As you see it has one argument, compartmentId, which I set to the ID of my compartment I used in the previous example when creating a new ADW instance. When you run this script it gives you a list of databases and information about them in JSON which looks pretty ugly. ./listdb.sh [{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"adwdb1","definedTags":{},"displayName":"adwdb1","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW","timeCreated":"2018-09-06T19:56:48.077Z"},{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"testdw","definedTags":{},"displayName":"testdw","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW","timeCreated":"2018-07-31T22:39:14.436Z"}] You can use a JSON beautifier to make it human-readable. For example, I use Python to view the same output in a more readable format. ./listdb.sh | python -m json.tool [ { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "adwdb1", "definedTags": {}, "displayName": "adwdb1", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated": "2018-09-06T19:56:48.077Z" }, { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "testdw", "definedTags": {}, "displayName": "testdw", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW", "timeCreated": "2018-07-31T22:39:14.436Z" } ] Scaling an ADW instance To scale an ADW instance you need to use the UpdateAutonomousDataWarehouse REST API with the relevant arguments. Here is my script, updatedb.sh, I use to do that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com put ./update.json "/20160918/autonomousDataWarehouses/$1" As you see it uses the file update.json as the request body and also uses the command line argument $1 as the database OCI ID. The file update.json has the following argument in it. { "cpuCoreCount" : 2 } I am only using cpuCoreCount as I want to change my CPU capacity, you can use other arguments listed in the documentation if you need to. To find the database OCI ID for your ADW instance you can either look at the output of the list databases API I mentioned above or you can go the ADW details page on the OCI web interface which will show you the OCI ID. Now, I call it with my database ID and the scale operation is submitted. ./updatedb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "SCALE_IN_PROGRESS", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" } If you go to the OCI web interface again you will see that the status for that ADW instance is shown as Scaling in Progress. Stopping and Starting an ADW Instance To stop and start ADW instances you need to use the StopAutonomousDataWarehouse and the StartAutonomousDataWarehouse REST APIs. Here is my stop database script, stopdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/stop As you see it takes one argument, $1, which is the database OCI ID as I used in the scale example before. It also refers to the file empty.json which is an empty JSON file with the below content. { } As you will see this requirement is not mentioned in the documentation, but the call will give an error if you do not provide the empty JSON file as input. Here is the script running with my database OCI ID. ./stopdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STOPPING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Likewise, you can start the database using a similar call. Here is my script, startdb.sh, that does that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/start Here it is running for my database. ./startdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STARTING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Other Operations on ADW Instances I gave some examples of common operations on an ADW instance, to use REST APIs for other operations you can use the same oci-curl function and the relevant API documentation. For demo purposes, as you saw I have hardcoded some stuff like OCIDs, you can further enhance and parameterize these scripts to use them generally for your ADW environment. Next, I will post some examples of managing ADW instances using the command line utility oci-cli.

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means...

See How Easily You Can Query Object Store with Big Data Cloud Service (BDCS)

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to they need. Need some extra space? Simply load data into Object Store - Scalability. It's infinitely scale. At least theoretically :) - Durability and Availability. Object Store is first class citizen in all Cloud Stories, so all vendors do all their best to maintain 100% availability and durability. If some diet will go down, it shouldn't worry you. If some node with OSS software will go down, it shouldn't worry you. As user you have to put data there and read data from Object Store.  - Cost. In a Cloud Object Store is most cost efficient solution. Nothing comes for free and as downside I may highlight: - Performance in comparison with HDFS or local block devices. Whenever you read data from Object Store, you read it over the network. - Inconsistency of performance. You are not alone on object store and obviously under the hood it uses physical disks, which have own throughput. If many users will start to read and write data to/from Object Store, you may get performance which is different with what you use to have a day, week, month ago - Security. Unlike filesystems Object Store has not file grain permissions policies and customers will need to reorganize and rebuild their security standards and policies. based on conclusions that we made above we may conclude, that Object Store is well suitable as way to share data across many systems as well as historical layer for certain Information management systems. If we will compare Object Store with HDFS (they are both are Schema on Read system, which simply store data and define schema on runtime, when user run a query), I'm personally could differentiate it like HDFS is "Write once, read many", Object Store is "Write once, read few". So, it's more historical (cheaper and slower) than HDFS.  In context of Information Data Management Object Store may make place on the bottom of Pyramid: How to copy data to Object Store Well, let's imagine that we do have Big Data Cloud Service (BDCS) and want to archive some data from HDFS to Object Store (for example, because we run out of capacity on HDFS). There are multiple ways to do this (I've wrote about this earlier here), but I'll pick ODCP - oracle build tool for coping data between multiple sources, including HDFS and Object Store. Full documentation you could find here, but I only show a brief example how I did it on my test cluster. First we will need to define Object store on client node(in my case it's one of BDCS node), where we will run a client: [opc@node01 ~]$ export CM_ADMIN=admin [opc@node01 ~]$ export CM_PASSWORD=Welcome1! [opc@node01 ~]$ export CM_URL=https://cmhost.us2.oraclecloud.com:7183 [opc@node01 ~]$ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey@oracle.com" --swift-password "MyPassword-" --swift-storageurl "https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage After this we may check, that it appears: [opc@node01 ~]$ bda-oss-admin list_swift_creds -t PROVIDER  USERNAME                                                    STORAGE URL                              bdcstorage storage-a424392:alexey.filanovskiy@oracle.com               https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens after we will need to copy data from HDFS to Object Store: [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/parq.db/ swift://tpcds-parq.bdcstorage/parq.db ... [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/csv.db/ swift://tpcds-parq.bdcstorage/csv.db now we have data in Object store: [opc@node01 ~]$  hadoop fs -du -h  swift://tpcds-parq.bdcstorage/parq.db ... 74.2 K   74.2 K   swift://tpcds-parq.bdcstorage/parq.db/store 14.4 G   14.4 G   swift://tpcds-parq.bdcstorage/parq.db/store_returns 272.8 G  272.8 G  swift://tpcds-parq.bdcstorage/parq.db/store_sales 466.1 K  466.1 K  swift://tpcds-parq.bdcstorage/parq.db/time_dim ...   good time to define table in Hive Metastore, I'll show example with only one table, rest I did with script: 0: jdbc:hive2://node03:10000/default> CREATE EXTERNAL TABLE store_sales ( ss_sold_date_sk           bigint , ss_sold_time_sk           bigint , ss_item_sk                bigint , ss_customer_sk            bigint , ss_cdemo_sk               bigint , ss_hdemo_sk               bigint , ss_addr_sk                bigint , ss_store_sk               bigint , ss_promo_sk               bigint , ss_ticket_number          bigint , ss_quantity               int , ss_wholesale_cost         double , ss_list_price             double , ss_sales_price            double , ss_ext_discount_amt       double , ss_ext_sales_price        double , ss_ext_wholesale_cost     double , ss_ext_list_price         double , ss_ext_tax                double , ss_coupon_amt             double , ss_net_paid               double , ss_net_paid_inc_tax       double , ss_net_profit             double ) STORED AS PARQUET LOCATION 'swift://tpcds-parq.bdcstorage/parq.db/store_sales'   Make sure that you have required libraries in place for Hive and for Spark: [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/bin/../lib/hadoop-mapreduce/ [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/jars/   Now we are ready for test!   Why should you use smart data formats? Predicate Push Down In Big Data world there are type of file formats called Smart (for example, ORC files and Parquetfiles). They have metadata in file, which allow dramatically speed up query performance for some queries. The most powerful feature is Predicate Push Down. This feature allows to filter data in place where it actually is without moving over the network. Each Parquet file page have Minimum and Maximum value, which allows us to skip entire page. Follow SQL predicates could be used for filtering data: < <= = != >= > so, it's better to see once rather than heat many times.  0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales; ... +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (339.221 seconds)   We could take a look on the resource utilization and we could note, that Network quite heavily utilized. Now, let's try to do the same with csv files: 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales; +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (762.38 seconds) As we can see all the same - high network utilization, but query takes even longer. It's because CSV is row row format and we could not do column pruning.   so, let's try to feel power of Predicate Push Down and let's use some equal predicate in the query: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (74.689 seconds)   Now we can see that in case of parquet files we almost don't utilize network. Let's see how it's gonna be in case of csv files. 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (760.682 seconds) well, as assumed, csv files don't get any benefits out of WHERE predicate. But, not all functions could be offloaded. To illustrate this I run query with cast function over parquet files: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%'; ... +---------+--+ |   _c0   | +---------+--+ | 959269  | +---------+--+ 1 row selected (133.829 seconds)   as we can see, we move part of data set to the BDCS instance and process it there.   Column projection another feature of Parquetfiles is column format, which means that then less columns we are using, then less data we bring back to the BDCS. Let me illustrate this by running same query with one column and with 24 columns (I'll use cast function, which is not pushed down). 0: jdbc:hive2://node03:10000/default> select ss_ticket_number from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (128.887 seconds) now I run the query over same data, but request 24 columns: 0: jdbc:hive2://node03:10000/default> select  . . . . . . . . . . . . . . . . . . . . > ss_sold_date_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_sold_time_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_item_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_customer_sk             . . . . . . . . . . . . . . . . . . . . > ,ss_cdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_hdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_addr_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_store_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_promo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_ticket_number           . . . . . . . . . . . . . . . . . . . . > ,ss_quantity                . . . . . . . . . . . . . . . . . . . . > ,ss_wholesale_cost          . . . . . . . . . . . . . . . . . . . . > ,ss_list_price              . . . . . . . . . . . . . . . . . . . . > ,ss_sales_price             . . . . . . . . . . . . . . . . . . . . > ,ss_ext_discount_amt        . . . . . . . . . . . . . . . . . . . . > ,ss_ext_sales_price         . . . . . . . . . . . . . . . . . . . . > ,ss_ext_wholesale_cost      . . . . . . . . . . . . . . . . . . . . > ,ss_ext_list_price          . . . . . . . . . . . . . . . . . . . . > ,ss_ext_tax                 . . . . . . . . . . . . . . . . . . . . > ,ss_coupon_amt              . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid                . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid_inc_tax        . . . . . . . . . . . . . . . . . . . . > ,ss_net_profit              . . . . . . . . . . . . . . . . . . . . > from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (333.641 seconds)   ​I think after seen these numbers you will always put only columns that you need.   Object store vs HDFS performance Now, I'm going to show example of performance numbers for Object Store and for HDFS. It's not official benchmark, just numbers, which could give you idea how compete performance over Object store vs HDFS. Querying Object Store with Spark SQL as s bonus I'd like to show who to query object store with Spark SQL.   [opc@node01 ~]$ spark2-shell  .... scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession   scala> val warehouseLocation = "file:${system:user.dir}/spark-warehouse" warehouseLocation: String = file:${system:user.dir}/spark-warehouse   scala> val spark = SparkSession.builder().appName("SparkSessionZipsExample").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate() 18/07/09 05:36:32 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@631c244c   scala> spark.catalog.listDatabases.show(false) +----------+---------------------+----------------------------------------------------+ |name      |description          |locationUri                                         | +----------+---------------------+----------------------------------------------------+ |csv       |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv.db       | |csv_swift |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv_swift.db | |default   |Default Hive database|hdfs://bdcstest-ns/user/hive/warehouse              | |parq      |null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq.db      | |parq_swift|null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq_swift.db| +----------+---------------------+----------------------------------------------------+     scala> spark.catalog.listTables.show(false) +--------------------+--------+-----------+---------+-----------+ |name                |database|description|tableType|isTemporary| +--------------------+--------+-----------+---------+-----------+ |customer_demographic|default |null       |EXTERNAL |false      | |iris_hive           |default |null       |MANAGED  |false      | +--------------------+--------+-----------+---------+-----------+     scala> val resultsDF = spark.sql("select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%' ") resultsDF: org.apache.spark.sql.DataFrame = [count(1): bigint]   scala> resultsDF.show() [Stage 1:==>                                                  (104 + 58) / 2255]   in fact there is no difference for Spark SQL between SWIFT and HDFS. All performance considerations which I've motion above.    Parquet files. Warning! After looking on these results, you may want to convert everything in parquet files, but don't rush to do so. Parquet files is schema-on-write, which means that you do ETL when convert data to it. ETL means optimization as well as probability to make a mistake during this transformation. This is the example. I have table which has timestamps, which obviously couldn't be less than 0: hive> create table tweets_parq  ( username  string,    tweet     string,    TIMESTAMP smallint    )  STORED AS PARQUET;   hive> INSERT OVERWRITE TABLE tweets_parq select * from  tweets_flex;  we defined timestamp as smallint, which is not enough for some data: hive> select TIMESTAMP from tweets_parq ... ------------  1472648470 -6744 and as consequence we got overflow and get negative timestamp. Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format. Conclusion 1) Object Store is not competitor for HDFS. HDFS is schema on read system, which could give you good performance (but definitely lower than schema on write system, such as database). Object Store itself could give you elasticity. It's a good option for historical data, which you plan to use really not frequently. 2) Object Store add significant startup overhead, so it's not suitable for interactive queries. 3) If you put data on Object Store consider to use smart file formats such as Parquet. It could give you benefits of Predicates Push Down as well as column projection 4) Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format.

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to...

Big Data

Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should provide significant benefits for customer applications. This post describes Oracle's strategy to support our customers in up-taking C6 quickly and efficiently, with minimal disruption to their infrastructure. The Basics One of the key differences with C6 are its core versions, which are summarized here for everyones benefit: Apache Hadoop 3.0 Apache Hive 2.1 Apache Parquet 1.9 Apache Spark 2.2 Apache Solr 7.0 Apache Kafka 1.0 Apache Sentry 2.0 Cloudera Manager 6.0 Cloudera Navigator 6.0 and much more... for full details, always check the Cloudera download bundles or Oracle's documentation. Now what that this all mean for Oracle's Big Data platform (cloud and on-premises) customers? Upgrading the Platform This is the part where running Big Data Cloud Service, Big Data Appliance and Big Data Cloud at Customer makes a big difference. As with minor updates, where we move the entire stack (OS, JDK, MySQL, Cloudera CDH and everything else), we will also do this for your CDH 5.x to CDH 6.x move. What to expect: Target Version: CDH 6.0.1, which at the point of writing this post, has not been released Target Dates: November 2018 with a dependency on the actual 6.0.1 release date Automated Upgrade: Yes - as with minor releases, CDH and the entire stack (OS, MySQL, JDK) will be upgraded using the Mammoth Utility As always, Oracle is building this all in house, and we will are testing the migration across a number scenarios for technical correctness.  Application Impact The first thing to start planning for is what a version uptick like this means for your applications. Will everything work nicely as before? Well, that is where the hard work comes in: testing the actual applications on a C6 version. In general, we would recommend to configure a small BDA/BDCS/BDCC cluster and load some data (also note the paragraph below on Erasure Coding in that respect) and then do the appropriate functional testing. Once that is all running satisfactorily and per your expectations, you would start to upgrade existing clusters. What about Erasure Coding? This is the big feature that will become available in the 6.1 timeframe. Just to be clear, Erasure Coding is not in the first versions supported by Cloudera. Therefore it will also not be supported on the Oracle platforms, which is based on 6.0.1 (note the 0 in the middle :-) ). As usual, once the 6.1 is available, Oracle will offer that as a release to upgrade too, and we will at that time address the details around Erasure Coding, how to get there, and how to leverage this on the Oracle Big Data solutions. To give everyone a quick 10,000 foot guideline, keep using regular block encoding (the current HDFS structure) for best performance, and use Erasure Coding for storage savings, while understanding that more network traffic can impact raw performance. Do I have to Move? No. You do not have to move to CDH 6, nor do you need to switch to Erasure Coding. We do expect one more 5.x release, most likely 5.16, and will release this on our platforms as well. That is of course a fully supported release. It is then - generally speaking - up to your timelines to move to the C6 platform. As we move closer to the C6 release on BDA, BDCS and BDCC we will provide updates on specific versions to migrate from, dates and timelines etc. Should you have questions, contact us in the big data community. The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.  

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should...

Big Data

Roadmap Update: BDA 4.13 with Cloudera CDH 5.15.x

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned date, and so, we may choose to pick up that version We are adding the following features: Support for SHA-2 certificates with Kerberos Upgrade and Expand for Kafka clusters (the create was introduced in BDA 4.12) A disk shredding utility, where you easily "erase" data on the disks. We expect most customers to use this in cloud nodes Support for Active Directory in Big Data Manager We obviously will update the JDK and the Linux OS to the latest version, as well as apply the latest security updates. Same for MySQL Then there is of course the important question on timelines. Right now - subject to change and the below mentioned safe harbor statement - we are looking at mid August as the planned date, assuming we are going with 5.15.0. If you are interested in discussing or checking up on the dates, features or have other questions, see our new community: Or visit the same community using the direct link to our community home. As always, feedback and comments welcome. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.In our continued quest to keep all of you informed about the latest versions, releases and approximate timelines, here is the next update.

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned...

Big Data

Need Help with Oracle Big Data: Here is Where to Go!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered and comments are taken on board.  How do you find us? Easy, go to the community main page. Then do the following to start asking your questions on big data:   Once you are here, click on the thing you want from us, in this case I would assume that you want some answers on some of your questions. So, click on Answers in the menu bar, and then on Platform (PaaS): From there, just look in the data management section and choose Big Data: All set... now you are ready - provided you are a member of course - to start both asking questions, and if you know some answers helping others in the community. What do we cover in this community? Great question. Since the navigation and the title elude to Cloud, you would expect we cover our cloud service. And that is correct. But, because we are Oracle, we do have a wide portfolio and you will have questions around an entire ecosystem of tools, utilities and solutions, as well as potentially architecture questions and ideas. So, rather then limiting questions, ideas and thoughts, we figured to actually broaden the scope to what we think the community will be discussing. And so here are some of the things we hope we can cover: Big Data Cloud Service (BDCS) - of course The Cloudera stack included Specific cloud features like: Bursting/shrinking One-click Secure Clusters Easy Upgrade  Networking / Port Management  and more... Big Data Spatial and Graph, which is included in BDCS Big Data Connectors and ODI, also included in BDCS Big Data Manager and its notebook feature (Zeppelin based) and other cool features Big Data SQL Cloud Service and of course the general software features in Big Data SQL Big Data Best Practices Architecture Patterns and Reference Architectures Configuration and Tuning / Setup When to use what tools or technologies Service and Product roadmaps and announcements And more Hopefully that will trigger all of you (and us) to collaborate, discuss and get our community to be a fun and helpful one. Who is on here from Oracle? Well, hopefully a lot of people will join us, both from Oracle and from customers, partners and universities/schools. But we, as the product development team will be manning the front lines. So you have product management resources, some architects and some developers working in the community. And with that, see you all soon in the community!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered...

Big Data SQL

Big Data SQL Quick Start. Kerberos - Part 26

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data SQL over Kerberized cluster, but today, I'd like to show couple typical steps how to test and debug Kerberized installation. First of all, let me tell about test environment. it's a 4 nodes: 3 nodes for Hadoop cluster (vm0[1-3]) and one for Database (vm04). Kerberos tickets should be initiated from keytab file, which should be on the database side (in case of RAC on each database node) and on each Hadoop node. Let's check that on the database node we have valid Kerberos ticket: [oracle@vm04 ~]$ id uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba) [oracle@scaj0602bda09vm04 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:58  07/24/18 01:15:58  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:01 let's check that we have access to HDFS from database host: [oracle@vm04 ~]$ cd $ORACLE_HOME/bigdatasql [oracle@vm04 bigdatasql]$ ls -l|grep hadoop*env -rw-r--r-- 1 oracle oinstall 2249 Jul 12 15:41 hadoop_martybda.env [oracle@vm04 bigdatasql]$ source hadoop_martybda.env  [oracle@vm04 bigdatasql]$ hadoop fs -ls ... Found 4 items drwx------   - oracle hadoop          0 2018-07-13 06:00 .Trash drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:10 .sparkStaging drwx------   - oracle hadoop          0 2018-07-12 05:17 .staging drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:14 oozie-oozi [oracle@vm04 bigdatasql]$  seems everything is ok. let's do the same from Hadoop node: [root@vm01 ~]# su - oracle [oracle@scaj0602bda09vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_1000 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:02  07/24/18 01:15:02  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:02 let's check that we have assess for the environment and also create test file on HDFS: [oracle@vm01 ~]$ echo "line1" >> test.txt [oracle@vm01 ~]$ echo "line2" >> test.txt [oracle@vm01 ~]$ hadoop fs -mkdir /tmp/test_bds [oracle@vm01 ~]$ hadoop fs -put test.txt /tmp/test_bds   now, let's jump to Database node and create external table for this file: [root@vm04 bin]# su - oracle [oracle@vm04 ~]$ . oraenv <<< orcl ORACLE_SID = [oracle] ? The Oracle base has been set to /u03/app/oracle [oracle@vm04 ~]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 06:39:06 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> CREATE TABLE bds_test (line VARCHAR2(4000))    ORGANIZATION EXTERNAL  (   TYPE ORACLE_HDFS       DEFAULT DIRECTORY       DEFAULT_DIR LOCATION ('/tmp/test_bds')   )    REJECT LIMIT UNLIMITED;      Table created.   SQL>  and for sure this is our two row file which we created on the previous step: SQL> select * from bds_test;   LINE ------------------------------------ line1 line2 Now let's go through some typical cases with Kerberos and let's talk about how to catch it.   Kerberos ticket missed on the database side Let's simulate case when Kerberos ticket is missed on the database side. it's pretty easy and for doing this we will use kdestroy command: [oracle@vm04 ~]$ kdestroy  [oracle@vm04 ~]$ klist  klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_500) extproc cache Kerberos ticket, so to apply our changes, you will need to restart extproc. First, we will need to obtain name of the extproc: [oracle@vm04 admin]$ cd $ORACLE_HOME/hs/admin [oracle@vm04 admin]$ ls -l total 24 -rw-r--r-- 1 oracle oinstall 1170 Mar 27 01:04 extproc.ora -rw-r----- 1 oracle oinstall 3112 Jul 12 15:56 initagt.dat -rw-r--r-- 1 oracle oinstall  190 Jul 12 15:41 initbds_orcl_martybda.ora -rw-r--r-- 1 oracle oinstall  489 Mar 27 01:04 initdg4odbc.ora -rw-r--r-- 1 oracle oinstall  406 Jul 12 15:11 listener.ora.sample -rw-r--r-- 1 oracle oinstall  244 Jul 12 15:11 tnsnames.ora.sample name consist of database SID and Hadoop Cluster name. So, seems our extproc name is bds_orcl_martybda. let's stop and start it: [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   oracle 16776 1 0 Jul12 ? 00:49:25 extprocbds_orcl_martybda -mt Stopping MTA process "extprocbds_orcl_martybda -mt"...   MTA process "extprocbds_orcl_martybda -mt" stopped!   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   MTA process "extprocbds_orcl_martybda -mt" is not running!   Checking MTA init parameters...   [O]  INIT_LIBRARY=$ORACLE_HOME/lib/libkubsagt12.so [O]  INIT_FUNCTION=kubsagtMTAInit [O]  BDSQL_CLUSTER=martybda [O]  BDSQL_CONFIGDIR=/u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/databases/orcl/bigdata_config   MTA process "extprocbds_orcl_martybda -mt" started! oracle 19498 1 4 06:58 ? 00:00:00 extprocbds_orcl_martybda -mt now we reset Kerberos ticket cache. Let's try to query HDFS data: [oracle@vm04 admin]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 07:00:26 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test; select * from bds_test * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-11504: error from external driver: java.lang.Exception: Error initializing JXADProvider: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "m04.vm.oracle.com/192.168.254.5"; destination host is: "vm02.vm.oracle.com":8020; remember this error. If you see it it means that you don't have valid Kerberos ticket on the database side. Let's bring everything back and make sure that our environment again works properly. [oracle@vm04 admin]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" [oracle@vm04 admin]$ /usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab [oracle@vm04 admin]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 07:03:46  07/24/18 07:03:46  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 07:03:46 [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ...   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ...   [oracle@scaj0602bda09vm04 admin]$ sqlplus / as sysdba   ...   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>    Kerberos ticket missed on the Hadoop side Another case when Kerberos ticket is misses on the Hadoop side (for Oracle user). Let's take a look what is going to be if we have such case. For this I also will use kdestroy command tool on each Hadoop node: [oracle@vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ kdestroy after perform all this steps, let's go to the database side and run the query again: [oracle@vm04 bigdata_config]$ sqlplus / as sysdba   ...     SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>  from the first look everything looks ok, but, let's take a look what is the execution statistics:   SQL> select n.name, s.value /* , s.inst_id, s.sid */ from v$statname n, gv$mystat s where n.name like '%XT%' and s.statistic# = n.statistic#;     NAME                                      VALUE ---------------------------------------------------------------- ---------- cell XT granules requested for predicate offload                1 cell XT granule bytes requested for predicate offload           12 cell interconnect bytes returned by XT smart scan               8192 cell XT granule predicate offload retries                       3 cell XT granule IO bytes saved by storage index                 0 cell XT granule IO bytes saved by HDFS tbs extent map scan      0 and we see that "cell XT granule predicate offload retries" is not equal to 0, which means that all real processing in happens on the database side. If you query 10TB table on HDFS, you will briuse Multi-user ng back all 10TB and will process it all on the database side. Not good. So, if you missed Kerberos ticket on the Hadoop side query will finish, but SmartScan will not work.   Renewal of Kerberos tickets One of the key Kerberos pillar is that tickets have expiration time and user have to renew it. During installation Big Data SQL creates crontab job, which does this on the database side as well as on the Hadoop side. If you miss it for some reasons you could use this one as an example: [oracle@vm04 ~]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" one note, that you always will use Oracle principal for Big Data SQL, but if you want to have fine grained  control over access to HDFS, you have to use Multiuser Authorization feature, as explained here.   Conclusion 1) Big Data SQL works over Kerberized clusters 2) You have to have Kerberos tickets on the Database side as well as on the Hadoop side 3) If you miss Kerberos ticket on the Database side query will fail 4) If you miss Kerberos ticket on the Hadoop side, query will not fail, but it will work on failback mode, when you move all blocks over the wire on the database node and process it there. You don't want to do so :)

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data...

Hadoop Best Practices

Secure Kafka Cluster

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic deserved dedicated article. Now it's time to do this and this blog will be devoted by Kafka security only.  Kafka Security challenges 1) Encryption in motion. By default you communicate with Kafka cluster over unsecured network and everyone, who can listen network between your client and Kafka cluster, can read message content. the way to avoid this is use some on-wire encryption technology - SSL/TLS. Using SSL/TLS you encrypt data on a wire between your client and Kafka cluster. Communication without SSL/TLS: SSL/TLS communication:   After you enable SSL/TLS communication, you will have follow consequence of steps for write/read message to/from Kafka cluster: 2) Authentication. Well, now when we encrypt traffic between client and server, but here is another challenge - server doesn't know with whom it communicate. In other words, you have to enable some mechanisms, which will not allow to work with cluster for UNKNOWN users. The default authentication mechanism in Hadoop world is Kerberos protocol. Here is the workflow, which shows sequence of steps to enable secure communication with Kafka:   Kerberos is the trusted way to authenticate user on cluster and make sure, that only known users can access it.  3) Authorization. Next step when you authenticate user on your cluster (and you know that you are working as a Bob or Alice), you may want to apply some authorization rules, like setup permissions for certain users or groups. In other words define what user can do and what user can't do. Sentry may help you with this. Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions. 4) Rest Encryption. Another one security aspect is rest encryption. It's when you want to protect data, stored on the disk. Kafka is not purposed for long term storing data, but it could store data for a days or even weeks. We have to make sure that all data, stored on the disks couldn't be stolen and them read with out encryption key. Security implementation. Step 1 - SSL/TLS There is no any strict steps sequence for security implementation, but as a first step I will recommend to do SSL/TLS configuration. As a baseline I took Cloudera's documentation. For structuring all your security setup, create a directory on your Linux machine where you will put all files (start with one machine, but later on you will need to do the same on other's Kafka servers): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ sudo chown -R kafka:kafka /opt/kafka/security $ sudo mkdir -p /opt/kafka/security A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption. We will need to generate a key pair (a public key and associated private key). Wraps the public key into an X.509 self-signed certificate, which is stored as a single-element certificate chain. This certificate chain and the private key are stored in a new keystore entry identified by selfsigned. # keytool -genkeypair -keystore keystore.jks -keyalg RSA -alias selfsigned -dname "CN=localhost" -storepass 'welcome2' -keypass 'welcome3' if you want to check content of keystore, you may run follow command: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} # keytool -list -v -keystore keystore.jks ... Alias name: selfsigned Creation date: May 30, 2018 Entry type: PrivateKeyEntry Certificate chain length: 1 Certificate[1]: Owner: CN=localhost Issuer: CN=localhost Serial number: 2065847b Valid from: Wed May 30 12:59:54 UTC 2018 until: Tue Aug 28 12:59:54 UTC 2018 ... As the next step we will need to extract a copy of the cert from the java keystore that was just created: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # keytool -export -alias selfsigned -keystore keystore.jks -rfc -file server.cer Enter keystore password: welcome2 Then create a trust store by making a copy of the default java trust store.  Main difference between trustStore vs keyStore is that trustStore (as name suggest) is used to store certificates from trusted Certificate authorities(CA) which is used to verify certificate presented by Server in SSL Connection while keyStore is used to store private key and own identity certificate which program should present to other party (Server or client) to verify its identity. Some more details you could find here. In my case on Big Data Cloud Service I've performed follow command: # cp /usr/java/latest/jre/lib/security/cacerts /opt/kafka/security/truststore.jks put it into truststore: # ls -lrt -rw-r--r-- 1 root root 113367 May 30 12:46 truststore.jks -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer put the certificate that was just extracted from the keystore into the trust store (note: "changeit" is standard password): # keytool -import -alias selfsigned -file server.cer -keystore truststore.jks -storepass changeit check file size after (it's bigger, because includes new certificate): # ls -let -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer -rw-r--r-- 1 root root 114117 May 30 13:06 truststore.jks It may seems too complicated and I decided to depict all those steps in one diagram: so far, all those steps been performed on the single (some random broker) machine. But you will need to have keystore and trustore files on each Kafka broker, let's copy It (note, current syntax is working on Big Data Appliance, Big Data Cloud Service or Big Data Cloud at Customer): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # dcli -C "mkdir -p /opt/kafka/security" # dcli -C "chown kafka:kafka /opt/kafka/security" # dcli -C -f /opt/kafka/security/keystore.jks -d /opt/kafka/security/keystore.jks # dcli -C -f /opt/kafka/security/truststore.jks -d /opt/kafka/security/truststore.jks after doing all these steps, you need to make some configuration changes in Cloudera Manager for each node (go to Cloudera Manager -> Kafka -> Configuration): In addition to this, on each node, you have to change listeners in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" Also, make sure, that in Cloudera Manager, you have security.inter.broker.protocol equal to SSL: After node restart, when all brokers up and running, let's test it: # openssl s_client -debug -connect kafka1.us2.oraclecloud.com:9093 -tls1_2 ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Certificate chain 0 s:/CN=localhost    i:/CN=localhost --- Server certificate -----BEGIN CERTIFICATE----- MIICxzCCAa+gAwIBAgIEIGWEezANBgkqhkiG9w0BAQsFADAUMRIwEAYDVQQDEwls b2NhbGhvc3QwHhcNMTgwNTMwMTI1OTU0WhcNMTgwODI4MTI1OTU0WjAUMRIwEAYD VQQDEwlsb2NhbGhvc3QwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCI 53T82eoDR2e9IId40UPTj3xg3khl1jdjNvMiuB/vcI7koK0XrZqFzMVo6zBzRHnf zaFBKPAQisuXpQITURh6jrVgAs1V4hswRPrJRjM/jCIx7S5+1INBGoEXk8OG+OEf m1uYXfULz0bX9fhfl+IdKzWZ7jiX8FY5dC60Rx2RTpATWThsD4mz3bfNd3DlADw2 LH5B5GAGhLqJjr23HFjiTuoQWQyMV5Esn6WhOTPCy1pAkOYqX86ad9qP500zK9lA hynyEwNHWt6GoHuJ6Q8A9b6JDyNdkjUIjbH+d0LkzpDPg6R8Vp14igxqxXy0N1Sd DKhsV90F1T0whlxGDTZTAgMBAAGjITAfMB0GA1UdDgQWBBR1Gl9a0KZAMnJEvxaD oY0YagPKRTANBgkqhkiG9w0BAQsFAAOCAQEAaiNdHY+QVdvLSILdOlWWv653CrG1 2WY3cnK5Hpymrg0P7E3ea0h3vkGRaVqCRaM4J0MNdGEgu+xcKXb9s7VrwhecRY6E qN0KibRZPb789zQVOS38Y6icJazTv/lSxCRjqHjNkXhhzsD3tjAgiYnicFd6K4XZ rQ1WiwYq1254e8MsKCVENthQljnHD38ZDhXleNeHxxWtFIA2FXOc7U6iZEXnnaOM Cl9sHx7EaGRc2adIoE2GXFNK7BY89Ip61a+WUAOn3asPebrU06OAjGGYGQnYbn6k 4VLvneMOjksuLdlrSyc5MToBGptk8eqJQ5tyWV6+AcuwHkTAnrztgozatg== -----END CERTIFICATE----- subject=/CN=localhost issuer=/CN=localhost --- No client certificate CA names sent Server Temp Key: ECDH, secp521r1, 521 bits --- SSL handshake has read 1267 bytes and written 441 bytes --- New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384 Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session:     Protocol  : TLSv1.2     Cipher    : ECDHE-RSA-AES256-GCM-SHA384     Session-ID: 5B0EAC6CA8FB4B6EA3D0B4A494A4660351A4BD5824A059802E399308C0B472A4     Session-ID-ctx:     Master-Key: 60AE24480E2923023012A464D16B13F954A390094167F54CECA1BDCC8485F1E776D01806A17FB332C51FD310730191FE     Key-Arg   : None     Krb5 Principal: None     PSK identity: None     PSK identity hint: None     Start Time: 1527688300     Timeout   : 7200 (sec)     Verify return code: 18 (self signed certificate) Well, seems our SSL connection is up and running. Time try to put some messages into the topic: #  kafka-console-producer  --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar ... 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. reason of this error, that we don't have properly configured clients. We will need to create and use client.properties and jaas.conf files. # cat /opt/kafka/security/client.properties security.protocol=SSL ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit -bash-4.1# cat jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useTicketCache=true;     }; # export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/security/jaas.conf"  now you could try again to produce messages: # kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar --producer.config client.properties ... Hello SSL world no any errors - already good! Let's try to consume message: # kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... Hello SSL world Bingo! So, we created secure communication between Kafka Cluster and Kafka Client and write a message there. Security implementation. Step 2 - Kerberos So, we up and run Kafka on Kerberized cluster and write and read data from a cluster without Kerberos ticket. $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) This is not how it's suppose to work. We assume that if we protect cluster by Kerberos it's impossible to do something without ticket. Fortunately, it's relatively easy to config communications with Kerberized Kafka cluster. First, make sure that you have enabled Kerberos authentification in Cloudera Manager (Cloudera Manager -> Kafka -> Configuration): second, go again to Cloudera Manager and change value of "security.inter.broker.protocol" to SASL_SSL: Note: Simple Authentication and Security Layer (SASL) is a framework for authentication and data security in Internet protocols. It decouples authentication mechanisms from application protocols, in theory allowing any authentication mechanism supported by SASL to be used in any application protocol that uses SASL. Very roughly - in this blog post you may think that SASL is equal to Kerberos. After this change, you will need to modify listeners protocol on each broker (to SASL_SSL) in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" setting: you ready for restart Kafka Cluster and write/read data from/to it.  Before doing this, you will need to modify Kafka client credentials: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit after this you may try to read data from Kafka cluster: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user ... Error may miss-lead you, but the the real reason is absence of Kerberos ticket: $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) $ kinit oracle Password for oracle@BDACLOUDSERVICE.ORACLE.COM: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Hello SSL world Great, it works! now we have to run kinit everytime before read/write data from Kafka cluster. Instead of this for convenience we may use keytab. For doing this you will need go to KDC server and generate keytab file there: # kadmin.local Authenticating as principal hdfs/admin@BDACLOUDSERVICE.ORACLE.COM with password. kadmin.local: xst -norandkey -k testuser.keytab testuser Entry for principal oracle with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:oracle.keytab. kadmin.local:  quit # ls -l ... -rw-------  1 root root    436 May 31 14:06 testuser.keytab ... now, when we have keytab file, we could copy it to the client machine and use it for Kerberos Authentication. don't forget to change owner of keytab file to person, who will run the script: $ chown opc:opc /opt/kafka/security/testuser.keytab Also, we will need to modify jaas.conf file: $ cat /opt/kafka/security/jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useKeyTab=true       keyTab="/opt/kafka/security/testuser.keytab"       principal="testuser@BDACLOUDSERVICE.ORACLE.COM";     }; seems we are fully ready to consumption of messages from topic. Despite on we have oracle as kerberos principal on a OS, we connect to the cluster as testuser (according jaas.conf): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... 18/05/31 15:04:45 INFO authenticator.AbstractLogin: Successfully logged in. 18/05/31 15:04:45 INFO kerberos.KerberosLogin: [Principal=testuser@BDACLOUDSERVICE.ORACLE.COM]: TGT refresh thread started. ... Hello SSL world Security Implementation Step 3 - Sentry One step before we configured Authentication, which answers on question - who am I. Now is the time to set up some Authorization mechanism, which will answer on question - what am I allow to do. Sentry became very popular engine in Hadoop world and we will use it for Kafka's Authorization. As I posted earlier Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions: And we will need to follow this with Kafka as well. But we will start with some service configurations first (Cloudera Manager -> Kafka -> Configuration): Also, it's very important to add in Sentry config (Cloudera Manager -> Sentry -> Config) user kafka in "sentry.service.admin.group":  Well, when we know who connects to the cluster, we may restrict he/she from reading some particular topics (in other words perform some Authorization).  Note: for perform administrative operations with Sentry, you have to work as Kafka user. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ id uid=1001(opc) gid=1005(opc) groups=1005(opc) $ sudo find /var -name kafka*keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2 /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab $ sudo cp /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab /opt/kafka/security/kafka.keytab $ sudo chown opc:opc /opt/kafka/security/kafka.keytab obtain Kafka ticket: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 05/31/18 15:52:28  06/01/18 15:52:28  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/05/18 15:52:28 Before configuring and testing Sentry with Kafka, we will need to create unprivileged user, who we will give grants (Kafka user is privileged and it bypassed Sentry). there are few simple steps, create test user (unprivileged) on each Hadoop node (this syntax will work on Big Data Appliance, Big Data Cloud Service and Big Data Cloud at Customer): # dcli -C "useradd testsentry -u 1011" we should remember that Sentry heavily relies on the Groups and we have to create it and put "testsentry" user there: # dcli -C "groupadd testsentry_grp -g 1017" after group been created, we should put user there: # dcli -C "usermod -g testsentry_grp testsentry" check that everything is how we expect: # dcli -C "id testsentry" 10.196.64.44: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.60: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.64: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.65: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.61: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) Note: you have to have same userID and groupID on each machine. Now verify that Hadoop can lookup group: # hdfs groups testsentry testsentry : testsentry_grp All this steps you have to perform as root. Next you should create testsentry principal in KDC (it's not mandatory, but more organize and easy to understand). Go to the KDC host and run follow commands: # kadmin.local  Authenticating as principal root/admin@BDACLOUDSERVICE.ORACLE.COM with password.  kadmin.local:  addprinc testsentry WARNING: no policy specified for testsentry@BDACLOUDSERVICE.ORACLE.COM; defaulting to no policy Enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Re-enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Principal "testsentry@BDACLOUDSERVICE.ORACLE.COM" created. kadmin.local:  xst -norandkey -k testsentry.keytab testsentry Entry for principal testsentry with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des3-cbc-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type arcfour-hmac added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-hmac-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-cbc-md5 added to keytab WRFILE:testsentry.keytab. Now we have all setup for unprivileged user. Time to start configure Sentry policies. As soon as Kafka is superuser we may run admin commands as Kafka user. For managing sentry settings we will need to use Kafka user. To obtain Kafka credentials we need to run: $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:37:53  06/16/18 01:37:53  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/20/18 01:37:53 First we need to create role. Let's call it testsentry_role: $ kafka-sentry -cr -r testsentry_role let's check, that role been created: $ kafka-sentry -cr -r testsentry_role ... admin_role testsentry_role [opc@cfclbv3872 ~]$  as soon as role created, we will need to give some permissions to this role for certain topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=write" and also describe: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=describe" next step, we have to allow some consumer group to read and describe from this topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=read" $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=describe" next step is linking role and groups, we will need to assign testsentry_role to testsentry_grp (group automatically inherit all role's permissions): $ kafka-sentry -arg -r testsentry_role -g testsentry_grp after this, let's check that our mapping worked fine: $ kafka-sentry -lr -g testsentry_grp ... testsentry_role now let's review list of the permissions, which have our certain role: $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=read it's also very important to have consumer group in client properties file: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit group.id=testconsumergroup after all set, we will need to switch to testsentry user for testing: $ kinit -kt /opt/kafka/security/testsentry.keytab testsentry $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: testsentry@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:38:49  06/16/18 01:38:49  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/22/18 01:38:49 test writes: $ kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093 --topic testTopic --producer.config /opt/kafka/security/client.properties ... > testmessage1 > testmessage2 > seems everything is ok, now let's test a read: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic testTopic --from-beginning  --consumer.config /opt/kafka/security/client.properties ... testmessage1 testmessage2 now, for showing Sentry in action, I'll try to read messages from other topic, which is outside of allowed topics for our test group. $ kafka-console-consumer --from-beginning --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --consumer.config /opt/kafka/security/client.properties ... 18/06/15 02:54:54 INFO internals.AbstractCoordinator: (Re-)joining group testconsumergroup 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 13 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 15 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 16 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 17 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} so, as we can see we could not read from Topic, which we don't authorize to read. Systemizing all this, I'd like to put user-group-role-privilegies flow on one picture: And also, I'd like to summarize steps, required for getting list of privileges for certain user (testsentry in my example): // Run as superuser - Kafka $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/cfclbv3872.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/19/18 02:38:26  06/20/18 02:38:26  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/24/18 02:38:26 // Get list of the groups which belongs certain user $ hdfs groups testsentry testsentry : testsentry_grp // Get list of the role for certain group $ kafka-sentry -lr -g testsentry_grp ...   testsentry_role // Get list of permissions for certain role $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=read HOST=*->CONSUMERGROUP=testconsumergroup->action=describe Based on what we saw above - our user testsentry could read and write to topic testTopic. For reading data he should to belong to the consumergroup "testconsumergroup". Security Implementation Step 4 - Encryption At Rest Last part of security journey is Encryption of Data, which you store on the disks. Here there are multiple ways, one of the most common is Navigator Encrypt.

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic...

Big Data

Big Data SQL 3.2.1 is Now Available

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: Existing customers using Big Data SQL 3.2 do not need to take this update; Oracle Database 12.2.0.1 support is the reason for the update. Big Data SQL 3.2.1 can be used for both Oracle Database 12.1.0.2 and 12.2.0.1 deployments For Oracle Database 12.2.0.1, Big Data SQL 3.2.1 requires the April Release Update plus the Big Data SQL 3.2.1 one-off patch The software is available on ARU.  The Big Data SQL 3.2.1 installer will be available on edelivery soon  Big Data SQL 3.2.1 Installer ( Patch 28071671).  Note, this is the complete installer; it is not a patch. Oracle Database 12.2.0.1 April Release Update (Patch 27674384).  Ensure your Grid Infrastructure is also on the 12.2.0.1 April Release Update (if you are using GI) Big Data SQL 3.2.1 one-off on top of April RU (Patch 26170659).  Ensure you pick the appropriate release in the download page.  This patch must be applied to each database server and Grid Infrastructure. Also, check out this new Big Data SQL Tutorial series on Oracle Learning Library.  The series includes numerous videos that helps you understand Big Data SQL capabilities.  It includes: Introducing the Oracle Big Data Lite Virtual Machine and Hadoop Introduction to Oracle Big Data SQL Hadoop and Big Data SQL Architectures Oracle Big Data SQL Performance Features Information Lifecycle Management 

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: E...

Event Hub Cloud Service. Hello world

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it is, without any transformation and preserve it for a long period of time.  At the same time you have two types of data - Streaming Data and Batch. Batch could be log files, RDBMS archives. Streaming data could be IoT, Sensors, Golden Gate replication logs. Apache Kafka is very popular engine for acquiring streaming data. It has multiple advantages, like scalability, fault tolerance and high throughput. Unfortunately, Kafka is hard to manage. Fortunately, Cloud simplifies many routine operations. Oracle Has three options for deploy Kafka in the Cloud: 1) Use Big Data Cloud Service, where you get full Cloudera cluster and there you could deploy Apache Kafka as part of CDH. 2) Event Hub Cloud Service Dedicated. Here you have to specify server shapes and some other parameters, but rest done by Cloud automagically.  3) Event Hub Cloud Service. This service is fully managed by Oracle, you even don't need to specify any compute shapes or so. Only one thing to do is tell for how long you need to store data in this topic and tell how many partitions do you need (partitions = performance). Today, I'm going to tell you about last option, which is fully managed cloud service. It's really easy to provision it, just need to login into your Cloud account and choose "Event Hub" Cloud service. after this go and choose open service console: Next, click on "Create service": Put some parameters - two key is Retention period and Number of partitions. First defines for how long will you store messages, second defines performance for read and write operations. Click next after: Confirm and wait a while (usually not more than few minutes): after a short while, you will be able to see provisioned service:     Hello world flow. Today I want to show "Hello world" flow. How to produce (write) and consume (read) message from Event Hub Cloud Service. The flow is (step by step): 1) Obtain OAuth token 2) Produce message to a topic 3) Create consumer group 4) Subscribe to topic 5) Consume message Now I'm going to show it in some details. OAuth and Authentication token (Step 1) For dealing with Event Hub Cloud Service you have to be familiar with concept of OAuth and OpenID. If you are not familiar, you could watch the short video or go through this step by step tutorial.  In couple words OAuth token authorization (tells what I could access) method to restrict access to some resources. One of the main idea is decouple Uses (real human - Resource Owner) and Application (Client). Real man knows login and password, but Client (Application) will not use it every time when need to reach Resource Server (which has some info or content). Instead of this, Application will get once a Authorization token and will use it for working with Resource Server. This is brief, here you may find more detailed explanation what is OAuth. Obtain Token for Event Hub Cloud Service client. As you could understand for get acsess to Resource Server (read as Event Hub messages) you need to obtain authorization token from Authorization Server (read as IDCS). Here, I'd like to show step by step flow how to obtain this token. I will start from the end and will show the command (REST call), which you have to run to get token: #!/bin/bash curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json as you can see there are many parameters required for obtain OAuth token. Let's take a looks there you may get it. Go to the service and click on topic which you want to work with, there you will find IDCS Application, click on it: After clicking on it, you will go be redirected to IDCS Application page. Most of the credentials you could find here. Click on Configuration: On this page right away you will find ClientID and Client Secret (think of it like login and password):   look down and find point, called Resources: Click on it and you will find another two variables, which you need for OAuth token - Scope and Primary Audience. One more required parameter - IDCS_URL, you may find in your browser: you have almost everything you need, except login and password. Here implies oracle cloud login and password (it what you are using when login into http://myservices.us.oraclecloud.com): Now you have all required credential and you are ready to write some script, which will automate all this stuff: #!/bin/bash export CLIENT_ID=7EA06D3A99D944A5ADCE6C64CCF5C2AC_APPID export CLIENT_SECRET=0380f967-98d4-45e9-8f9a-45100f4638b2 export THEUSERNAME=john.dunbar export THEPASSWORD=MyPassword export SCOPE=/idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export PRIMARY_AUDIENCE=https://7EA06D3A99D944A5ADCE6C64CCF5C2AC.uscom-central-1.oraclecloud.com:443 export THESCOPE=$PRIMARY_AUDIENCE$SCOPE export IDCS_URL=https://idcs-1d6cc7dae45b40a1b9ef42c7608b9afe.identity.oraclecloud.com curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json after running this script, you will have new file called access_token.json. Field access_token it's what you need: $ cat access_token.json {"access_token":"eyJ4NXQjUzI1NiI6InVUMy1YczRNZVZUZFhGbXFQX19GMFJsYmtoQjdCbXJBc3FtV2V4U2NQM3MiLCJ4NXQiOiJhQ25HQUpFSFdZdU9tQWhUMWR1dmFBVmpmd0UiLCJraWQiOiJTSUdOSU5HX0tFWSIsImFsZyI6IlJTMjU2In0.eyJ1c2VyX3R6IjoiQW1lcmljYVwvQ2hpY2FnbyIsInN1YiI6ImpvaG4uZHVuYmFyIiwidXNlcl9sb2NhbGUiOiJlbiIsInVzZXJfZGlzcGxheW5hbWUiOiJKb2huIER1bmJhciIsInVzZXIudGVuYW50Lm5hbWUiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwic3ViX21hcHBpbmdhdHRyIjoidXNlck5hbWUiLCJpc3MiOiJodHRwczpcL1wvaWRlbnRpdHkub3JhY2xlY2xvdWQuY29tXC8iLCJ0b2tfdHlwZSI6IkFUIiwidXNlcl90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsImNsaWVudF9pZCI6IjdFQTA2RDNBOTlEOTQ0QTVBRENFNkM2NENDRjVDMkFDX0FQUElEIiwiYXVkIjpbInVybjpvcGM6bGJhYXM6bG9naWNhbGd1aWQ9N0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMiLCJodHRwczpcL1wvN0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMudXNjb20tY2VudHJhbC0xLm9yYWNsZWNsb3VkLmNvbTo0NDMiXSwidXNlcl9pZCI6IjM1Yzk2YWUyNTZjOTRhNTQ5ZWU0NWUyMDJjZThlY2IxIiwic3ViX3R5cGUiOiJ1c2VyIiwic2NvcGUiOiJcL2lkY3MtMWQ2Y2M3ZGFlNDViNDBhMWI5ZWY0MmM3NjA4YjlhZmUtb2VodGVzdCIsImNsaWVudF90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsInVzZXJfbGFuZyI6ImVuIiwiZXhwIjoxNTI3Mjk5NjUyLCJpYXQiOjE1MjY2OTQ4NTIsImNsaWVudF9ndWlkIjoiZGVjN2E4ZGRhM2I4NDA1MDgzMjE4NWQ1MzZkNDdjYTAiLCJjbGllbnRfbmFtZSI6Ik9FSENTX29laHRlc3QiLCJ0ZW5hbnQiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwianRpIjoiMDkwYWI4ZGYtNjA0NC00OWRlLWFjMTEtOGE5ODIzYTEyNjI5In0.aNDRIM5Gv_fx8EZ54u4AXVNG9B_F8MuyXjQR-vdyHDyRFxTefwlR3gRsnpf0GwHPSJfZb56wEwOVLraRXz1vPHc7Gzk97tdYZ-Mrv7NjoLoxqQj-uGxwAvU3m8_T3ilHthvQ4t9tXPB5o7xPII-BoWa-CF4QC8480ThrBwbl1emTDtEpR9-4z4mm1Ps-rJ9L3BItGXWzNZ6PiNdVbuxCQaboWMQXJM9bSgTmWbAYURwqoyeD9gMw2JkwgNMSmljRnJ_yGRv5KAsaRguqyV-x-lyE9PyW9SiG4rM47t-lY-okMxzchDm8nco84J5XlpKp98kMcg65Ql5Y3TVYGNhTEg","token_type":"Bearer","expires_in":604800} Create Linux variable for it: #!/bin/bash export TOKEN=`cat access_token.json |jq .access_token|sed 's/\"//g'` Well, now we have Authorization token and may work with our Resource Server (Event Hub Cloud Service).  Note: you also may check documentation about how to obtain OAuth token. Produce Messages (Write data) to Kafka (Step 2) The first thing that we may want to do is produce messages (write data to a Kafka cluster). To make scripting easier, it's also better to use some environment variables for common resources. For this example, I'd recommend to parametrize topic's end point, topic name, type of content to be accepted and content type. Content type is completely up to developer, but you have to consume (read) the same format as you produce(write). The key parameter to define is REST endpoint. Go to PSM, click on topic name and copy everything till "restproxy": Also, you will need topic name, which you could take from the same window: now we could write a simple script for produce one message to Kafka: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export CONTENT_TYPE=application/vnd.kafka.json.v2+json curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"records":[{"value":{"foo":"bar"}}]}' \ $OEHCS_ENDPOINT/topics/$TOPIC_NAME if everything will be fine, Linux console will return something like: {"offsets":[{"partition":1,"offset":8,"error_code":null,"error":null}],"key_schema_id":null,"value_schema_id":null} Create Consumer Group (Step 3) The first step to read data from OEHCS is create consumer group. We will reuse environment variables from previous step, but just in case I'll include it in this script: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export CONTENT_TYPE=application/vnd.kafka.json.v2+json export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"format": "json", "auto.offset.reset": "earliest"}' \ $OEHCS_ENDPOINT/consumers/oehcs-consumer-group \ -o consumer_group.json this script will generate output file, which will contain variables, that we will need to consume messages. Subscribe to a topic (Step 4) Now you are ready to subscribe for this topic (export environment variable if you didn't do this before): #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ -d "{\"topics\": [\"$TOPIC_NAME\"]}" \ $BASE_URI/subscription If everything fine, this request will not return something.  Consume (Read) messages (Step 5) Finally, we approach last step - consuming messages. and again, it's quite simple curl request: #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export H_ACCEPT=application/vnd.kafka.json.v2+json curl -X GET \ -H "Authorization: Bearer $TOKEN" \ -H "Accept: $H_ACCEPT" \ $BASE_URI/records if everything works, like it supposed to work, you will have output like: [{"topic":"idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest","key":null,"value":{"foo":"bar"},"partition":1,"offset":17}] Conclusion Today we saw how easy to create fully managed Kafka Topic in Event Hub Cloud Service and also we made a first steps into it - write and read message. Kafka is really popular message bus engine, but it's hard to manage. Cloud simplifies this and allow customers concentrate on the development of their applications. here I also want to give some useful links: 1) If you are not familiar with REST API, I'd recommend you to go through this blog 2) There is online tool, which helps to validate your curl requests 3) Here you could find some useful examples of producing and consuming messages 4) If you are not familiar with OAuth, here is nice tutorial, which show end to end example

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it...

Data Warehousing

Autonomous Data Warehouse is LIVE!

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference Center We had a major launch event on Thursday last week at the Oracle Conference center in Redwood Shores which got a huge amount of press coverage. Larry Ellison delivered the main keynote covering how our next-generation cloud service is built on the self-driving Oracle Autonomous Database technology which leverages machine learning to deliver unprecedented performance, reliability and ease of deployment for data warehouses. As an autonomous cloud service, it eliminates error-prone manual management tasks and, most importantly for a lot readers of this blog, frees up DBA resources, which can now be applied to implementing more strategic business projects. The key highlights of our Oracle Autonomous Data Warehouse Cloud include: Ease of Use: Unlike traditional cloud services with complex, manual configurations that require a database expert to specify data distribution keys and sort keys, build indexes, reorganize data or adjust compression, Oracle Autonomous Data Warehouse Cloud is a simple "load and go" service. Users specify tables, load data and then run their workloads in a matter of seconds-no manual tuning is needed. Industry-Leading Performance: Unlike traditional cloud services, which use generic compute shapes for database cloud services, Oracle Autonomous Data Warehouse Cloud is built on the high-performance Oracle Exadata platform. Performance is further enhanced by fully-integrated machine learning algorithms which drive automatic caching, adaptive indexing and advanced compression. Instant Elasticity: Oracle Autonomous Data Warehouse Cloud allocates new data warehouses of any size in seconds and scales compute and storage resources independently of one another with no downtime. Elasticity enables customers to pay for exactly the resources that the database workloads require as they grow and shrink. To highlight these three unique aspects of Autonomous Data Warehouse Cloud the launch included a live, on-stage demo of ADWC and Oracle Analytics Cloud. If you have never seen a new data warehouse delivered in seconds rather than days then pay careful attention to the demo video below where George Lumpkin creates a new fully autonomous data warehouse with a few mouse clicks and then starts to query one of the sample schemas, shipped with ADWC, using OAC. Probably the most important section was the panel discussion with a handful of our early adopter customers which was hosted by Steve Daheb, Senior Vice President, Oracle Cloud. As always. it’s great to hear customers talk about how the simplicity and speed of ADWC are bringing about significant changes to the way our customers think about their data. I you missed all the excitement, the keynote, demos and discussions then here is some great news….we recorded everything for you so can watch it from the comfort of your desk. Below are the links to the three main parts of the launch:     Video: Larry Ellison, CTO and Executive Chairman, Oracle, introduces Oracle Autonomous Database Cloud. Oracle Autonomous Database Cloud eliminates complexity and human error, helping to ensure higher reliability, security, and efficiency at the lowest cost.     Video: Steve Daheb, Senior Vice President, Oracle Cloud, discusses the benefits of Oracle Autonomous Cloud Platform with Oracle customers: - Paul Daugherty, Accenture - Benjamin Arnulf, Hertz - Michael Morales, QMP Health - Al Cordoba, QLX   Video: George Lumpkin, Vice President of Product Management, Oracle, demonstrates the self-driving, self-securing, and self-repairing capabilities of Oracle Autonomous Data Warehouse Cloud.   So what's next? So you are all fired up and you want to learn more about Autonomous Data Warehouse Cloud! Where do you go? First place to visit is the ADWC home page on cloud.oracle.com: https://cloud.oracle.com/datawarehouse Can I Try It? Yes you can! We have a great program that let's you get started with Oracle Cloud for free with $300 in free credits. Using your credits (which will probably last you around 30 days depending on how you configure your ADWC)you will be able to get valuable hands-on time to try loading some your own workloads and testing integration with our other cloud services such as Analytics Cloud and Data Integration Cloud. Are there any tutorials to help me get started? Yes there are! We have quick start tutorials covering both Autonomous Data Warehouse Cloud and our bundled SQL notebook application called Oracle Machine Learning, just click here: Provisioning Autonomous Data Warehouse Cloud Connecting SQL Developer and Creating Tables Loading Your Data Running a Query on Sample Data Creating Projects and Workspaces in OML Creating and Running Notebooks Collaborating in OML Creating a SQL Script Running SQL Statements Is the documentation available? Yes it is! The documentation set for ADWC is right here and the documentation set for Oracle Machine Learning is right here. Anything else I need to know? Yes there is! Over the next few weeks I will be posting links to more videos where our ADWC customers will take about their experiences of using ADWC during the last couple of months. There will be information about some deep-dive online tutorials that you can use as part of your free $300 trial along with lots of other topics that re too numerous to list. If you have a burning question about Oracle Autonomous Data Warehouse Cloud then feel free to reach out to me via email:keith.laker@oracle.com

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference CenterWe had a major launch event on Thursday last week at the...

Object Store Service operations. Part 1 - Loading data

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud providers, including Oracle, assumes, that data lifecycle starts from Object Store: You land it there and then either read or load it by different services, such as ADWC or BDCS, for example. Oracle has two flavors of Object Store Services (OSS), OSS on OCI (Oracle Cloud Infrastructure) and OSS on OCI -C (Oracle Cloud  Infrastructure Classic).  In this post, I'm going to focus on OSS on OCI-C, mostly because OSS on OCI, was perfectly explained by Hermann Baer here and by Rachna Thusoo here. Upload/Download files. As in Hermann's blog, I'll focus on most frequent operations Upload and Download. There are multiple ways to do so. For example: - Oracle Cloud WebUI - REST API - FTM CLI tool - Third Part tools such as CloudBerry - Big Data Manager (via ODCP) - Hadoop client with Swift API - Oracle Storage Software Appliance Let's start with easiest one - Web Interface. Upload/Download files. WebUI. For sure you have to start with Log In to cloud services: then, you have to go to Object Store Service: after this drill down into Service Console and you will be able to see list of the containers within your OSS: To create a new container (bucket in OCI terminology), simply click on "Creare Container" and give a name to it: After it been created, click on it and go to "Upload object" button: Click and Click again and here we are, file in the container: Let's try to upload a bigger file, ops... we got an error: So, seems we have 5GB limitations. Fortunitely, we could have "Large object upload": Which will allow us to uplod file bigger than 5GB: so, and what about downloading? It's easy, simply click download and land file on local files system. Upload/Download files. REST API. WebUI maybe a good way to upload data, when a human operates with it, but it's not too convenient for scripting. If you want to automate your file uploading, you may use REST API. You may find all details regarding REST API here, but alternatively, you may use this script, which I'm publishing below and it could hint you some basic commands:​ #!/bin/bash shopt -s expand_aliases alias echo="echo -e" USER="alexey.filanovskiy@oracle.com" PASS="MySecurePassword" OSS_USER="storage-a424392:${USER}" OSS_PASS="${PASS}" OSS_URL="https://storage-a424392.storage.oraclecloud.com/auth/v1.0" echo "curl -k -sS -H \"X-Storage-User: ${OSS_USER}\" -H \"X-Storage-Pass:${OSS_PASS}\" -i \"${OSS_URL}\"" out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` while [ $? -ne 0 ]; do echo "Retrying to get token\n" sleep 1; out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` done AUTH_TOKEN=`echo "${out}" | grep "X-Auth-Token" | sed 's/X-Auth-Token: //;s/\r//'` STORAGE_TOKEN=`echo "${out}" | grep "X-Storage-Token" | sed 's/X-Storage-Token: //;s/\r//'` STORAGE_URL=`echo "${out}" | grep "X-Storage-Url" | sed 's/X-Storage-Url: //;s/\r//'` echo "Token and storage URL:" echo "\tOSS url: ${OSS_URL}" echo "\tauth token: ${AUTH_TOKEN}" echo "\tstorage token: ${STORAGE_TOKEN}" echo "\tstorage url: ${STORAGE_URL}" echo "\nContainers:" for CONTAINER in `curl -k -sS -u "${USER}:${PASS}" "${STORAGE_URL}"`; do echo "\t${CONTAINER}" done FILE_SIZE=$((1024*1024*1)) CONTAINER="example_container" FILE="file.txt" LOCAL_FILE="./${FILE}" FILE_AT_DIR="/path/file.txt" LOCAL_FILE_AT_DIR=".${FILE_AT_DIR}" REMOTE_FILE="${CONTAINER}/${FILE}" REMOTE_FILE_AT_DIR="${CONTAINER}${FILE_AT_DIR}" for f in "${LOCAL_FILE}" "${LOCAL_FILE_AT_DIR}"; do if [ ! -e "${f}" ]; then echo "\nInfo: File "${f}" does not exist. Creating ${f}" d=`dirname "${f}"` mkdir -p "${d}"; tr -dc A-Za-z0-9 </dev/urandom | head -c "${FILE_SIZE}" > "${f}" #dd if="/dev/random" of="${f}" bs=1 count=0 seek=${FILE_SIZE} &> /dev/null fi; done; echo "\nActions:" echo "\tListing containers:\t\t\t\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/\"" echo "\tCreate container \"oss://${CONTAINER}\":\t\tcurl -k -vX PUT -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}\"" echo "\tListing objects at container \"oss://${CONTAINER}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\n\tUpload \"${LOCAL_FILE}\" to \"oss://${REMOTE_FILE}\":\tcurl -k -vX PUT -T \"${LOCAL_FILE}\" -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\tDownload \"oss://${REMOTE_FILE}\" to \"${LOCAL_FILE}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\" > \"${LOCAL_FILE}\"" echo "\n\tDelete \"oss://${REMOTE_FILE}\":\tcurl -k -vX DELETE -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\"" echo "\ndone" I put the content of this script into file oss_operations.sh, give execute permission and run it: $ chmod +x oss_operations.sh $ ./oss_operations.sh the output will look like: curl -k -sS -H "X-Storage-User: storage-a424392:alexey.filanovskiy@oracle.com" -H "X-Storage-Pass:MySecurePass" -i "https://storage-a424392.storage.oraclecloud.com/auth/v1.0" Token and storage URL: OSS url: https://storage-a424392.storage.oraclecloud.com/auth/v1.0 auth token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage url: https://storage.us2.oraclecloud.com/v1/storage-a424392 Containers: 123_OOW17 1475233258815 1475233258815-segments Container ... Actions: Listing containers: curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/" Create container "oss://example_container": curl -k -vX PUT -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container" Listing objects at container "oss://example_container": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Upload "./file.txt" to "oss://example_container/file.txt": curl -k -vX PUT -T "./file.txt" -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Download "oss://example_container/file.txt" to "./file.txt": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" > "./file.txt" Delete "oss://example_container/file.txt": curl -k -vX DELETE -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" Upload/Download files. FTM CLI. REST API may seems a bit cumbersome and quite hard to use. But there is a good news that there is kind of intermediate solution Command Line Interface - FTM CLI. Again, here is the full documentation available here, but I'd like briefly explain what you could do with FTM CLI. You could download it here and after unpacking, it's ready to use! $ unzip ftmcli-v2.4.2.zip ... $ cd ftmcli-v2.4.2 $ ls -lrt total 120032 -rwxr-xr-x 1 opc opc 1272 Jan 29 08:42 README.txt -rw-r--r-- 1 opc opc 15130743 Mar 7 12:59 ftmcli.jar -rw-rw-r-- 1 opc opc 107373568 Mar 22 13:37 file.txt -rw-rw-r-- 1 opc opc 641 Mar 23 10:34 ftmcliKeystore -rw-rw-r-- 1 opc opc 315 Mar 23 10:34 ftmcli.properties -rw-rw-r-- 1 opc opc 373817 Mar 23 15:24 ftmcli.log You may note that there is file ftmcli.properties, it may simplify your life if you configure it once. Documentation you may find here and my example of this config: $ cat ftmcli.properties #saving authkey #Fri Mar 30 21:15:25 UTC 2018 rest-endpoint=https\://storage-a424392.storage.oraclecloud.com/v1/storage-a424392 retries=5 user=alexey.filanovskiy@oracle.com segments-container=all_segments max-threads=15 storage-class=Standard segment-size=100 Now we have all connection details and we may use CLI as simple as possible. There are few basics commands available with FTMCLI, but as a first step I'd suggest to authenticate a user (put password once): $ java -jar ftmcli.jar list --save-auth-key Enter your password: if you will use "--save-auth-key" it will save your password and next time will not ask you for a password: $ java -jar ftmcli.jar list 123_OOW17 1475233258815 ... You may refer to the documentation for get full list of the commands or simply run ftmcli without any arguments: $ java -jar ftmcli.jar ... Commands: upload Upload a file or a directory to a container. download Download an object or a virtual directory from a container. create-container Create a container. restore Restore an object from an Archive container. list List containers in the account or objects in a container. delete Delete a container in the account or an object in a container. describe Describes the attributes of a container in the account or an object in a container. set Set the metadata attribute(s) of a container in the account or an object in a container. set-crp Set a replication policy for a container. copy Copy an object to a destination container. Let's try to accomplish standart flow for OSS - create container, upload file there, list objects in container,describe container properties and delete it. # Create container $ java -jar ftmcli.jar create-container container_for_blog Name: container_for_blog Object Count: 0 Bytes Used: 0 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Upload file to container $ java -jar ftmcli.jar upload container_for_blog file.txt Uploading file: file.txt to container: container_for_blog File successfully uploaded: file.txt Estimated Transfer Rate: 16484KB/s # List files into Container $ java -jar ftmcli.jar list container_for_blog file.txt # Get Container Metadata $ java -jar ftmcli.jar describe container_for_blog Name: container_for_blog Object Count: 1 Bytes Used: 434 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Delete container $ java -jar ftmcli.jar delete container_for_blog ERROR:Delete failed. Container is not empty. # Delete with force option $ java -jar ftmcli.jar delete -f container_for_blog Container successfully deleted: container_for_blog Another great thing about FTM CLI is that allows easily manage uploading performance out of the box. In ftmcli.properties there is the property called "max-threads". It may vary between 1 and 100. Here is testcase illustrates this: -- Generate 10GB file $ dd if=/dev/zero of=file.txt count=10240 bs=1048576 -- Upload file in one thread (has around 18MB/sec rate $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 18381KB/s -- Change number of thrads from 1 to 99 in config file $ sed -i -e 's/max-threads=1/max-threads=99/g' ftmcli.properties -- Upload file in 99 threads (has around 68MB/sec rate) $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 68449KB/s so, it's very simple and at the same time powerful tool for operations with Object Store, it may help you with scripting of operations.  Upload/Download files. CloudBerry. Another way to interact with OSS use some application, for example, you may use CloudBerry Explorer for OpenStack Storage. There is a great blogpost, which explains how to configure CloudBerry for Oracle Object Store Service Classic and I will start from the point where I already configured it. Whenever you log in it looks like this:     You may easily create container in CloudBerry: And for sure you may easily copy data from your local machine to OSS: It's nothing to add here, CloudBerry is convinient tool for browsing Object Stores and do a small copies between local machine and OSS. For me personally, it looks like TotalCommander for a OSS.  Upload/Download files. Big Data Manager and ODCP. Big Data Cloud Service (BDCS) has great component called Big Data Manager. This is tool developed by Oracle, which allows you to manage and monitor Hadoop Cluster. Among other features Big Data Manager (BDM) allows you to register Object Store in Stores browser and easily drug and drop data between OSS and other sources (Database, HDFS...). When you copy data to/from HDFS you use optimized version of Hadoop Distcp tool ODCP. This is very fast way to copy data back and forth. Fortunitely, JP already wrote about this feature and I could just simply give a link. If you want to see concreet performance numbers, you could go here to a-team blog page. Without Big Data Manager, you could manually register OSS on Linux machine and invoke copy command from bash. Documentation will show you all details and I will show just one example: # add account: $ export CM_ADMIN=admin $ export CM_PASSWORD=SuperSecurePasswordCloderaManager $ export CM_URL=https://cfclbv8493.us2.oraclecloud.com:7183 $ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey.filanovskiy@oracle.com" --swift-password "SecurePasswordForSwift" --swift-storageurl "https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage # list of credentials: $ bda-oss-admin list_swift_creds Provider: bdcstorage     Username: storage-a424392:alexey.filanovskiy@oracle.com     Storage URL: https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens # check files on OSS swift://[container name].[Provider created step before]/: $ hadoop fs -ls swift://alextest.bdcstorage/ 18/03/31 01:01:13 WARN http.RestClientBindings: Property fs.swift.bdcstorage.property.loader.chain is not set Found 3 items -rw-rw-rw- 1 279153664 2018-03-07 00:08 swift://alextest.bdcstorage/bigdata.file.copy drwxrwxrwx - 0 2018-03-07 00:31 swift://alextest.bdcstorage/customer drwxrwxrwx - 0 2018-03-07 00:30 swift://alextest.bdcstorage/customer_address Now you have OSS, configured and ready to use. You may copy data by ODCP, here you may find entire list of the sources and destinations. For example, if you want to copy data from hdfs to OSS, you have to run: $ odcp hdfs:///tmp/file.txt swift://alextest.bdcstorage/ ODCP is a very efficient way to move data from HDFS to Object Store and back. if you are from Hadoop world and you use to Hadoop fs API, you may use it as well with Object Store (configuring it before), for example for load data into OSS, you need to run: $ hadoop fs -put /home/opc/file.txt swift://alextest.bdcstorage/file1.txt Upload/Download files. Oracle Storage Cloud Software Appliance. Object Store is a fairly new concept and for sure there is a way to smooth this migration. Years ago, when HDFS was new and undiscovered, many people didn't know how to work with it and few technologies, such as NFS-Gateway and HDFS-fuse appears. Both these technology allowed to mount HDFS on Linux filesystem and work with it as with normal filesystem. Something like this allows doing Oracle Cloud Infrastructure Storage Software Appliance. All documentation you could find here, brief video here, download software here. In my blog I just show one example of its usage. This picture will help me to explain how works Storage Cloud Software Appliance: you may see that customer need to install on-premise docker container, which will have all required stack. I'll skip all details, which you may find in the documentation above, and will just show a concept. # Check oscsa status [on-prem client] $ oscsa info Management Console: https://docker.oracleworld.com:32769 If you have already configured an OSCSA FileSystem via the Management Console, you can access the NFS share using the following port. NFS Port: 32770 Example: mount -t nfs -o vers=4,port=32770 docker.oracleworld.com:/<OSCSA FileSystem name> /local_mount_point # Run oscsa [on-prem client] $ oscsa up There (on the docker image, which you deploy on some on-premise machine) you may find WebUI, where you can configure Storage Appliance: after login, you may see a list of configured Objects Stores: In this console you may connect linked container with this on-premise host: after it been connected, you will see option "disconnect" After you connect a device, you have to mount it: [on-prem client] $ sudo mount -t nfs -o vers=4,port=32770 localhost:/devoos /oscsa/mnt [on-prem client] $ df -h|grep oscsa localhost:/devoos 100T 1.0M 100T 1% /oscsa/mnt Now you could upload a file into Object Store: [on-prem client] $ echo "Hello Oracle World" > blog.file [on-prem client] $ cp blog.file /oscsa/mnt/ This is asynchronous copy to Object Store, so after a while, you will be able to find a file there: Only one restriction, which I wasn't able to overcome is that filename is changing during the copy. Conclusion. Object Store is here and it will became more and more popular. It means there is no way to escape it and you have to get familiar with it. Blogpost above showed that there are multiple ways to deal with it, strting from user friendly (like CloudBerry) and ending on the low level REST API.

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud...

Data Warehousing

Loading Data to the Object Store for Autonomous Data Warehouse Cloud

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and are getting ready to rock-n-roll. But the one thing you’re not sure about is this Object Store. Yes, you used it successfully as described in the tutorial, but what’s next?. And what else is there to know about the Object Store? First and foremost, if you are interested in understanding a bit more about what this Object Store is, you should read the following blog post from Rachna, the Product Manager for the Object Store among other things. It introduces the Object Store, how to set it up and manage files with the UI, plus a couple of simple command line examples (don’t get confused by the term ‘BMC’, that’s the old name of Oracle’s Cloud Infrastructure; that’s true for the command line utility as well, which is now called oci). You should read that blog post to get familiar with the basic concepts of the Object Store and a cloud account (tenant). The documentation and blog posts are great, but now you actually want to do use it to load data into ADWC.  This means loading more (and larger) files, more need for automation, and more flexibility.  This post will focus on exactly that: to become productive with command line utilities without being a developer, and to leverage the power of the Oracle Object Store to upload more files in one go and even how to upload larger files in parallel without any major effort. The blog post will cover both: The Oracle oci command line interface for managing files The Swift REST interface for managing files   Using the oci command line interface The Oracle oci command line interface (CLI) is a tool that enables you to work with Oracle Cloud Infrastructure objects and services. It’s a thin layer on top of the oci APIs (typically REST) and one of Oracle’s open source project (the source code is on GitHub). Let’s quickly step through what you have to do for using this CLI. If you do not want to install anything, that is fine, too. In that case feel free to jump to the REST section in this post right away, but you’re going to miss out on some cool stuff that the CLI provides you out of the box. To get going with the utility is really simple, as simple as one-two-three Install oci cli following the installation instructions on github. I just did this on an Oracle Linux 7.4 VM instance that I created in the Oracle Cloud and had the utility up and running in no time.   Configure your oci cli installation. You need a user created in the Oracle Cloud account that you want to use, and that user must have the appropriate privileges to work with the object store. A keypair is used for signing API requests, with the public key uploaded to Oracle. Only the user calling the API should possess the private key. All this is described in the configuration section of the CLI.  That is probably the part that takes you the most time of the setup. You have to ensure to have UI console access when doing this since you have to upload the public key for your user.   Use oci cli. After successful setup you can use the command line interface to manage your buckets for storing all your files in the Cloud, among other things.   First steps with oci cli The focus of the command line interface is on ease-of-use and to make its usage as self-explaining as possible, with a comprehensive built-in help system in the utility. Whenever you want to know something without looking around, use the --help, -h, or -? Syntax for a command, irrespective of how many parameters you have already entered. So you can start with oci -h and let the utility guide you. For the purpose of file management the important category is the object store category, with the main tasks of: Creating, managing, and deleting buckets This task is probably done by an administrator for you, but we will cover it briefly nevertheless   Uploading, managing, and downloading objects (files) That’s your main job in the context of the Autonomous Data Warehouse Cloud That’s what we are going to do now.   Creating a bucket Buckets are containers that store objects (files). Like other resources, buckets belong to a compartment, a collection of resources in the Cloud that can be used as entity for privilege management. To create a bucket you have to know the compartment id to create a bucket. That is the only time we have to deal with this cloud-specific unique identifiers. All other object (file) operations use names. So let’s create a bucket. The following creates a bucket named myFiles in my account ADWCACCT in a compartment given to me by the Cloud administrator. $ oci os bucket create --compartment-id ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq --namespace-name adwcaact --name myFiles {   "data": {     "compartment-id": "ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq",     "created-by": "ocid1.user.oc1..aaaaaaaaomoqtk3z7y43543cdvexq3y733pb5qsuefcbmj2n5c6ftoi7zygq",     "etag": "c6119bd6-98b6-4520-a05b-26d5472ea444",     "metadata": {},     "name": "myFiles",     "namespace": "adwcaact",     "public-access-type": "NoPublicAccess",     "storage-tier": "Standard",     "time-created": "2018-02-26T22:16:30.362000+00:00"   },   "etag": "c6119bd6-98b6-4520-a05b-26d5472ea733" } The operation returns with the metadata of the bucket after successful creation. We’re ready to upload and manage files in the object store.   Upload your first file with oci cli You can upload a single file very easily with the oci command line interface. And, as promised before, you do not even have to remember any ocid in this case … . $ oci os object put --namespace adwcacct --bucket-name myFiles --file /stage/supplier.tbl Uploading object  [####################################]  100% {   "etag": "662649262F5BC72CE053C210C10A4D1D",   "last-modified": "Mon, 26 Feb 2018 22:50:46 GMT",   "opc-content-md5": "8irNoabnPldUt72FAl1nvw==" } After successful upload you can check the md5 sum of the file; that’s basically the fingerprint that the data on the other side (in the cloud) is not corrupt and the same than local (on the machine where the data is coming from). The only “gotcha” is that OCI is using base64 encoding, so you cannot just do a simple md5. The following command solves this for me on my Mac: $ openssl dgst -md5 -binary supplier.tbl |openssl enc -base64 8irNoabnPldUt72FAl1nvw== Now that’s a good start. I can use this command in any shell program, like the following which loads all files in a folder sequentially to the object store:  for i in `ls *.tbl` do   oci os object put --namespace adwcacct --bucket-name myFiles --file $i done You can write it to load multiple files in parallel, load only files that match a specific name pattern, etc. You get the idea. Whatever you can do with a shell you can do. Alternatively, if it's just about loading all the files in  you can achieve the same with the oci cli as well by using its bulk upload capabilities. The following shows briefly oci os object bulk-upload -ns adwcacct -bn myFiles --src-dir /MyStagedFiles {   "skipped-objects": [],   "upload-failures": {},   "uploaded-objects": {     "chan_v3.dat": {       "etag": "674EFB90B1A3CECAE053C210D10AC9D9",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "/t4LbeOiCz61+Onzi/h+8w=="     },     "coun_v3.dat": {       "etag": "674FB97D50C34E48E053C230C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "sftu7G5+bgXW8NEYjFNCnQ=="     },     "cust1v3.dat": {       "etag": "674FB97D52274E48E053C210C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:44:06 GMT",       "opc-content-md5": "Zv76q9e+NTJiyXU52FLYMA=="     },     "sale1v3.dat": {       "etag": "674FBF063F8C50ABE053C250C10AE3D3",       "last-modified": "Tue, 13 Mar 2018 17:44:52 GMT",       "opc-content-md5": "CNUtk7DJ5sETqV73Ag4Aeg=="     }   } } Uploading a single large file in parallel  Ok, now we can load one or many files to the object store. But what do you do if you have a single large file that you want to get uploaded? The oci command line offers built-in multi-part loading where you do not need to split the file beforehand. The command line provides you built-in capabilities to (A) transparently split the file into sized parts and (B) to control the parallelism of the upload. $ oci os object put -ns adwcacct -bn myFiles --file lo_aa.tbl --part-size 100 --parallel-upload-count 4 While the load is ongoing you can all in-progress uploads, but unfortunately without any progress bar or so; the progress bar is reserved for the initiating session:  $ oci os multipart list -ns adwcacct -bn myFiles {   "data":    [         {       "bucket": "myFiles",       "namespace": "adwcacct",       "object": "lo_aa.tbl",       "time-created": "2018-02-27T01:19:47.439000+00:00",       "upload-id": "4f04f65d-324b-4b13-7e60-84596d0ef47f"     }   ] }   While a serial process for a single file gave me somewhere around 35 MB/sec to upload on average, the parallel load sped up things quite a bit, so it’s definitely cool functionality (note that your mileage will vary and is probably mostly dependent on your Internet/proxy connectivity and bandwidth.  If you’re interested in more details about how that works, here is a link from Rachna who explains the inner details of this functionality in more detail.   Using the Swift REST interface Now after having covered the oci utility, let’s briefly look into what we can do out of the box, without the need to install anything. Yes, without installing anything you can leverage REST endpoints of the object storage service. All you need to know is your username/SWIFT password and your environment details, e.g. which region your uploading to, the account (tenant) and the target bucket.  This is where the real fun starts, and this is where it can become geeky, so we will focus only on the two most important aspects of dealing with files and the object store: uploading and downloading files.   Understanding how to use Openstack Swift REST File management with REST is equally simple than it is with the oci cli command. Similar to the setup of the oci cli, you have to know the basic information about your Cloud account, namely:  a user in the cloud account that has the appropriate privileges to work with a bucket in your tenancy. This user also has to be configured to have a SWIFT password (see here how that is done). a bucket in one of the object stores in a region (we are not going to discuss how to use REST to do this). The bucket/region defines the rest endpoint, for example if you are using the object store in Ashburn, VA, the endpoint is https://swiftobjectstorage.us-ashburn-1.oraclecloud.com) The URI for accessing your bucket is built as follows: <object store rest endpoint>/v1/<tenant name>/<bucket name> In my case for the simple example it would be https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/myFiles If you have all this information you are set to upload and download files.   Uploading an object (file) with REST Uploading a file is putting a file into the Cloud, so the REST command is a PUT. You also have to specify the file you want to upload and how the file should be named in the object store. With this information you can write a simple little shell script like the following that will take both the bucket and file name as input: # usage: upload_oss.sh <file> <bucket> file=$1 bucket=$2   curl -v -X PUT  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.! '  \  --upload-file ${file} \  https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/${bucket}/${file} So if you want to upload multiple files in a directory, similar to what we showed for the oci cli command, you just save this little script, say upload_oss.sh, and call it just like you called oci cli: for i in `ls *.tbl` do   upload_oss.sh myFiles $i done   Downloading an object (file) with REST  While we expect you to upload data to the object store way more often than downloading information, let’s quickly cover that, too. So you want to get a file from the object store? Well, the REST command GET will do this for you. It is similarly intuitive than uploading, and you might be able to guess the complete syntax already. Yes, it is:  curl -v -X GET  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.!'  \ https://swiftobjectstorage.us-ashburn1.oraclecloud.com/v1/adwcacct/myFiles/myFileName \ --output myLocalFileName That’s about all you need to get started uploading all your files to the Oracle Object Store so that you then can consume them from within the Autonomous Data Warehouse Cloud.  Happy uploading!

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and...

The Data Warehouse Insider

Roadmap Update: What you need to know about Big Data Appliance 4.12

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As discussed the 4.11 release delivered the following: Released with updates CDH and now deliver 5.13.1 Updates to the Operating System (OL6) with security updates to said OS Java updates The release consciously pushed back some of the features to ensure the Oracle environments pick up the latest CDH releases within our (roughly) 4 week goal. Next up is BDA 4.12 As part of the longer development time we carved out by doing 4.12 with features we are able to schedule a set of very interesting components into this release. At a high level, the following are planned to be in 4.12: Configure a Kafka cluster on dedicated nodes on the BDA Set up (and include) Big Data Manager on BDA. For more information on Big Data Manager, see these videos (or click the one further down) on what cool things you can do with the Zeppelin Notebooks, ODCP and drag-and-drop copying of data  Full BDA clusters on OL7. After we enabled the edge nodes for OL7 to support Cloudera Data Science Workbench, we are now delivering full clusters on OL7. Note that we have not yet delivered an in-place upgrade path to migrate from an OL6 based cluster to an OL7 cluster High Availability for more services in CDH, by leveraging and pre-configuring best practices. These new HA set up steps are obviously updated regularly and are fully supported as part of the system going forward: Hive Service Sentry Service Hue Service On BDA X7-2 hardware 2 SSDs are included. When running on X7, the Journal Node metadata and Zookeeper data is put onto these SSDs instead of the regular OS disks. This ensures better performance for highly loaded master nodes.   Of course the software will have undergone testing and we do run infrastructure security scans on the system. We include any Linux updates that available when we freeze the image and ship those. Any violation that crops up after the release can, no should, be updated using the official OL Repo to update the OS. Lastly, we are looking to release early April, and are finalizing the actual Cloudera CDH release. We may use 5.14.1, but there is chance that we switch and jump to 5.15.0 depending on timings. And one more Thing Because Big Data Appliance is an engineered system, customers expect robust movement between versions. Upgrading the entire system, where BDA is different from just a Cloudera cluster, is an important part of the value proposition but is also fairly complex. With 4.12 we place additional emphasis on addressing previously seen upgrade issues and we will be doing this as an ongoing priority on all BDA software releases. So expect even more robust upgrades going forward. Lastly Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As...

Big Data SQL

Big Data SQL Quick Start. Multi-user Authorization - Part 25

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side). But it's also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala...) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others.  Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here. Big Data SQL and Kerberos. Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question - which principal do you need to have with Big Data SQL?  Answer is easy - oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences: - Unable to authorize access to data based on the user that is running a query - Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature - Multi-User Authorization. Hadoop impersonalization. In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here: "A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe." at the same manner, "oracle" is the superuser and other users are impersonalized. Multi-User Authorization key concepts. 1) Big Data SQL will identify the trusted user that is accessing data on the cluster.  By executing the query as the trusted user: - Authorization rules specified in Hadoop will be respected - Authorization rules specified in Hadoop do not need to be replicated in the database - Hadoop cluster audits identify the actual Big Data SQL query user 2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop 3) Must map the database user that is running a query in Oracle Database to a Hadoop user 4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop  - This may be an Oracle Database user (i.e. schema) - Lightweight user comes from session-based contexts (see SYS_CONTEXT) - User/Group map must be available thru OS lookup in Hadoop Demonstration. Full documentation for this feature, you may find here and now I'm going to show few most popular cases with code examples. For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table: SQL> grant select on BDSQL_USER_MAP to bikes; SQL> grant execute on DBMS_BDSQL to bikes; SQL> grant BDSQL_ADMIN to bikes; In my cases, this is user "bikes". Just in case clean up permissions for user BIKES: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / check that the mapping table is empty: SQL> select * from SYS.BDSQL_USER_MAP; and after this run a query: SQL> select /*+ MONITOR */ * from bikes.weather_ext; this is the default mode, without any mapping, so I assume that I'll contact HDFS as oracle user. For double check this, I review audit files: $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=oracle ... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here is clear, that user Oracle reads the file (ugi=oracle). Let's check permissions for given file (which represents this external table): $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r--r-- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv so, everybody may read it. Remember this and let's try to create the first mapping. SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'user1' ); end; / this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let's find this in file permission table: Run query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. It's interesting that user1 doesn't exist on the Hadoop OS: # id user1 id: user1: No such user if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again: $ sudo -u hdfs hadoop fs -chmod 640 /data/weather/central_park_weather.csv $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r----- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv Now it failed. For make it works I may create "user1" account on each Hadoop node and add it to oinstall group. $ useradd user1 $ usermod -a -G oinstall user1 Run the query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here we are! We could read the file because of group permissions. What if I want to map this schema to HDFS or some other powerful user? Let's try: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'hdfs' ); end; / the reason why we got this exception is that hdfs user is on the blacklist for impersonation. $ cat $ORACLE_HOME/bigdatasql/databases/orcl/bigdata_config/bigdata.properties| grep impersonation .... # Impersonation properties impersonation.enabled=true impersonation.blacklist='hue','yarn','oozie','smon','mapred','hdfs','hive','httpfs','flume','HTTP','bigdatamgr','oracle' ... the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema. Bellow is the example, which illustrates this: we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here). So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop. Bellow is a test case, which illustrates this. -- Remove previous rule, related with BIKES user -- SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / -- Add a new rule, which tells that if database user is BIKES, Hadoop user have to be taken from USERENV as CLIENT_IDENTIFIER -- SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'CLIENT_IDENTIFIER' ); end; --Check current database user (schema) -- SQL> select user from dual; BIKES -- Check CLIENT_IDENTIFIER from USERENV -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; NULL -- Run any query aginst Hadoop -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:14:40 ... ugi=oracle ... src=/data/weather/central_park_weather.csv -- Set CLIENT_IDENTIFIER -- SQL> begin dbms_session.set_IDENTIFIER('Alexey'); end; / -- Check CLIENT_IDENTIFIER for current session -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; Alexey -- Run query agin over HDFS data -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs: -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:17:43 ... ugi=Alexey ... src=/data/weather/central_park_weather.csv the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc...) have their authenticated identity passed to Hadoop. To make it work, simply run: SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user => '*' , syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'AUTHENTICATED_IDENTITY'); end; / and after this your user on HDFS will be that returned by: SQL> select SYS_CONTEXT('USERENV', 'AUTHENTICATED_IDENTITY') from dual; BIKES for example, if I logged on to the database as BIKES (as database user), on HDFS I'll be authenticated as bikes user -bash-4.1 $ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:23:23 ... ugi=bikes... src=/data/weather/central_park_weather.csv for checking all rules, which you have for multi-user authorization you may run follow query: SQL> select * from SYS.BDSQL_USER_MAP; Hope that this feature may allow you to create robust security bastion around your data in HDFS.

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features,...

Big Data

Advanced Data Protection using Big Data SQL and Database Vault - Introduction

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new privacy regulations are being implemented such as the European Union (EU) General Data Protection Regulation (GDPR), are also being enforced and the increasing adoption of Public Cloud also legitimate these new Cyber Security requirements. Data Lake/Hub environments can be a treasure trove of sensitive data, so data protection must be considered in almost all Big Data SQL implementations. Fortunately, Big Data SQL is able to propagate several of the data protection capabilities of the Oracle Multi-Model Database such as Virtual Private Database (aka Row Level Security) or Data Redaction described in a previous post (see Big Data SQL Quick Start. Security - Part4.).  But now is the time to speak about one of the most powerful ones: Database Vault. Clearly, databases are a common target and 81% of 2017 hacking-related breaches leveraged either stolen and/or weak passwords.  So, once legitimate internal credentials are acquired (and preferably those for system accounts), then accessing interesting data is just a matter of time. Hence, while Alexey described all the security capabilities you could put in place to Secure your Hadoop Cluster: ...once hackers get legitimate database credentials, it's done... unless you add another Cyber Security layer to manage fine grained accesses. And here comes Database Vault1. This Introductory post is the first of a series where we'll illustrate the security capabilities that can be combined with Big Data SQL in order to propagate these protections to Oracle and non-Oracle data stores: NoSQL clusters (Oracle NoSQL DB, HBase, Apache Cassandra, MongoDB...), Hadoop (Hortonworks and Cloudera), Kafka (Confluent and Apache, with the 3.2 release of Big Data SQL). In essence, Database Vault allows to separation of duties between the operators (DBA) and application users. As a result, data are protected from users with system privileges (SYSTEM (which should never be used and locked), DBA named accounts...) - but they can still continue to do their job: Moreover Database Vault has the ability to add fine grained security layers to control precisely who accesses which objects (tables, view, PL/SQL code...), from where (e.g. edges nodes only), and when (e.g. application maintenance window solely):   As explained in the previous figure, Database Vault introduces the concepts of Realms and Command Rules. From documentation: A realm is a grouping of database schemas, database objects, and/or database roles that must be secured for a given application. Think of a realm as a zone of protection for your database objects. A schema is a logical collection of database objects such as tables (including external tables, hence allowing to work with Big Data SQL), views, and packages, and a role is a collection of privileges. By arranging schemas and roles into functional groups, you can control the ability of users to use system privileges against these groups and prevent unauthorized data access by the database administrator or other powerful users with system privileges. Oracle Database Vault does not replace the discretionary access control model in the existing Oracle database. It functions as a layer on top of this model for both realms and command rules. Oracle Database Vault provides two types of realms: regular and mandatory. A regular realm protects an entire database object (such as a schema). This type of realm restricts all users except users who have direct object privilege grants. With regular realms, users with direct object grants can perform DML operations but not DDL operations. A mandatory realm restricts user access to objects within a realm. Mandatory realms block both object privilege-based and system privilege-based access. In other words, even an object owner cannot access his or her own objects without proper realm authorization if the objects are protected by mandatory realms. After you create a realm, you can register a set of schema objects or roles (secured objects) for realm protection and authorize a set of users or roles to access the secured objects. For example, you can create a realm to protect all existing database schemas that are used in an accounting department. The realm prohibits any user who is not authorized to the realm to use system privileges to access the secured accounting data. And also: A command rule protects Oracle Database SQL statements (SELECT, ALTER SYSTEM), database definition language (DDL), and data manipulation language (DML) statements. To customize and enforce the command rule, you associate it with a rule set, which is a collection of one or more rules. The command rule is enforced at run time. Command rules affect anyone who tries to use the SQL statements it protects, regardless of the realm in which the object exists. One important point to emphasize is that Database Vault will audit any access violation to protected objects ensuring governance and compliance over time. To summarize:   In the next parts of this series, I'll present 3 use cases as following in order to demonstrate some of Database Vault capabilities in a context of Big Data SQL: Protect data from users with system privileges (DBA…) Access data only if super manager is connected too Prevent users from creating EXTERNAL tables for Big Data SQL And in the meantime, you shall discover practical information by reading one of our partner white-papers. 1: Database Vault is a database option and has to be licensed accordingly on the Oracle Database Enterprise Edition only. Notice that Database Cloud Service High Performance and Extreme Performance as well as Exadata Cloud Service and Exadata Cloud at Customer have this capability included into the cloud subscription.   Thanks to Alan, Alexey and Martin for their helpful reviews!

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new...

Learn more about using Big Data Manager - importing data, notebooks and other useful things

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the distributed (Spark based) copy utility. Since then a lot of useful features have been added to Big Data Manager, and to share with the world, these are now recorded and published on YouTube. The library consists of a number of videos with the following topics (video library is here): Working with Archives File Imports Working with Remote Data Importing Notebooks from GitHub For some background, Big Data Manager is a utility that is included with Big Data Cloud Service, Big Data Cloud at Customer and soon with Big Data Appliance. It's primary goal is to enable users to quickly achieve tasks like copying files, and publishing data via a Notebook interface. In this case, the interface is based on / leverages Zeppelin notebooks. The notebooks run on a node within the cluster and have direct access to the local data elements. As is shown in some of the videos, Big Data Manager enables easy file transport between Object Stores (incl. Oracle's and Amazon's) and HDFS. This transfer is based on ODCP, which leverages Apache Spark in the cluster to enable high volume and high performance file transfers. You can see more here: Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the...

Big Data

Oracle Big Data Lite 4.11 is Available

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more. Cloudera Distribution including Apache Hadoop (CDH5.13.1) Cloudera Manager (5.13.1) Oracle Big Data Spatial and Graph 2.4 Oracle Big Data Connectors 4.11 Oracle SQL Connector for HDFS 3.8.1 Oracle Loader for Hadoop 3.9.1 Oracle Data Integrator 12c (12.2.1.3.0) Oracle R Advanced Analytics for Hadoop 2.7.1 Oracle XQuery for Hadoop 4.9.1 Oracle Data Source for Apache Hadoop 1.2.1 Oracle Shell for Hadoop Loaders 1.3.1 Oracle NoSQL Database Enterprise Edition 12cR1 (4.5.12) Oracle JDeveloper 12c (12.2.1.2.0) Oracle SQL Developer and Data Modeler 17.3.1 with Oracle REST Data Services 3.0.7 Oracle Data Integrator 12cR1 (12.2.1.3.0) Oracle GoldenGate 12c (12.3.0.1.2) Oracle R Distribution 3.3.0 Oracle Perfect Balance 2.10.0 Check out the download page for the latest samples and useful links to help you get started with Oracle's Big Data platform. Enjoy!

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c...

Big Data

Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need to do: analyzing that data. This new free tutorial shows you how to easily and quickly do the grunt work with Big Data Manager in Big Data Cloud Service (learn more here) enabling you to worry about analytics, not moving files. The approach taken here is to take a file that resides on your desktop, and drag and drop that into HDFS on Oracle Big Data Cloud Service... as easy as that, and you are now off doing analytics by right clicking and adding the data into a Zeppelin Notebook. Within the notebook, you get to see how Big Data Manager enables you to quickly generate a Hive schema definition from the data set and then start to do some analytics. Mechanics made easy! You can, and always should look at leveraging Object Storage as you entry point for data, as discussed in this other Big Data Manager How To article:  See How Easily You Can Copy Data Between Object Store and HDFS. For more advanced analytics, have a look at Oracle wide ranging set of cloud services or open source tools like R, and the high performance version of R: Oracle R Advanced Analytics for Hadoop.  

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need...

Big Data

New Release: BDA 4.11 is now Generally Available

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software on MyOracleSupport. So what is new: BDA Software 4.11.0 contains few new things, but is intended to keep our software releases close to the Cloudera releases, as discussed in this roadmap post. This latest version uptakes: Cloudera CDH 5.13.1 and Cloudera Manager 5.13.1 Parcels for Kafka 3.0, Spark 2.2 and Key Trustee Server 5.13 are included in the BDA Software Bundle. Kudu is now included in the CDH parcel The team did a number of small but significant updates: Cloudera Manager cluster hosts now configured with TLS Level 3 - This includes encrypted communication with certificate verification of both Cloudera Manager Server and Agents  to verify identity and prevent spoofing by untrusted Agents running on hosts. Update to ODI Agent 12.2.1.3.0 Updates to Oracle Linux 6, JDK 8u151 and MySQL 5.7.20 It is important to remember that with 4.11.0 we no longer support upgrading OL5 based clusters. Review  New Release: BDA 4.10 is now Generally Available for some details on this. Links: Documentation: http://www.oracle.com/technetwork/database/bigdata-appliance/documentation/index.html Configurator: http://www.oracle.com/technetwork/database/bigdata-appliance/downloads/index.html That's all folks, more new releases, features and good stuff to come in 2018.   

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software...

Data Warehousing

SQL Pattern Matching Deep Dive - the book

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User Group Conference. The title of each blog post started with SQL Pattern Matching Deep Dive...and covered a set of 6 posts: Part 1 - Overview Part 2 - Using MATCH_NUMBER() and CLASSIFIER() Part 3 - Greedy vs. reluctant quantifiers Part 4 - Empty matches and unmatched rows? Part 5 - SKIP TO where exactly? Part 6 - State machines There are a lot of related posts derived from that core set of 6 posts along with other presentations and code samples. One of the challenges, even when searching via Google, was tracking down all the relevant content. Therefore, I have spent the last 6-9 months converting all my deep dive content into a book - an Apple iBook. I have added a lot of new content based discussions I have had at user conferences, questions posted on the developer SQL forum, discussions with my development team and some new presentations developed for the OracleCode series events. To make it life easier for everyone I have split the content into two volumes and just in time for Thanksgiving Volume 1 is now available in the iBook Store - it's free to download! This first volume covers the following topics: Chapter 1: Introduction/br> Background to the book and explanation of how some of the features with the book are expected to work Chapter 2: Industry specific use cases In this section we will review a series of uses cases and provide conceptual simplified SQL to solve these business requirements using the new SQL pattern matching functionality. Chapter 3: Syntax for MATCH_RECOGNIZE The easiest way to explore the syntax of 12c’s new MATCH_RECOGNIZE clause is to look at a simple example... Chapter 4: How to use built-in measures for debugging In this section I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to your pattern. Chapter 5: Patterns and Predicates This chapter looks at how predicates affect the results returned by MATCH_RECOGNIZE. Chapter 6: Next Steps This final section provides links to additional information relating to SQL pattern matching. Chapter 7: Credits   My objective is that by the end of this two-part series you will have a good, solid understanding of how MATCH_RECOGNIZE works, how it can be used to simplify your application code and how to test your code to make sure it is working correctly. In a couple of weeks I will publish information about the contents of Volume 2 and when I hope to have it finished! As usual, if you have any comments about the contents of the book then let please email me directly at keith.laker@oracle.com

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User...

Big Data SQL

Using Materialized Views with Big Data SQL to Accelerate Performance

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll review a performance optimization that has been around for quite a while and is used at thousands of customers:  Materialized Views (MVs). For those of you who are unfamiliar with MVs – an MV is a precomputed summary table.  There is a defining query that describes that summary.  Queries that are executed against the detail tables comprising the summary will be automatically rewritten to the MV when appropriate: In the diagram above, we have a 1B row fact table stored in HDFS that is being accessed thru a Big Data SQL table called STORE_SALES.  Because we know that users want to query the data using a product hierarchy (by Item), a geography hierarchy (by Region) and a mix (by Class & QTR) – we created three summary tables that are aggregated to the appropriate levels. For example, the “by Item” MV has the following defining query: CREATE MATERIALIZED VIEW mv_store_sales_item ON PREBUILT TABLE ENABLE QUERY REWRITE AS (   select ss_item_sk,          sum(ss_quantity) as ss_quantity,          sum(ss_ext_wholesale_cost) as ss_ext_wholesale_cost,          sum(ss_net_paid) as ss_net_paid,          sum(ss_net_profit) as ss_net_profit   from bds.store_sales   group by ss_item_sk ); Queries executed against the large STORE_SALES that can be satisfied by the MV will now be automatically rewritten: SELECT i_category,        SUM(ss_quantity) FROM bds.store_sales, bds.item_orcl WHERE ss_item_sk = i_item_sk   AND i_size in ('small', 'petite')   AND i_wholesale_cost > 80 GROUP BY i_category; Taking a look at the query’s explain plan, you can see that even though store_sales is the table being queried – the table that satisfied the query is actually the MV called mv_store_sales_item.  The query was automatically rewritten by the optimizer. Explain plan with the MV: Explain plan without the MV: Even though Big Data SQL optimized the join and pushed the predicates and filtering down to the Hadoop nodes – the MV dramatically improved query performance: With MV:  0.27s Without MV:  19s This is to be expected as we’re querying a significantly smaller and partially aggregated data.  What’s nice is that query did not need to change; simply the introduction of the MV sped up the processing. What is interesting here is that the query selected data at the Category level – yet the MV is defined at the Item level.  How did the optimizer know that there was a product hierarchy?  And that Category level data could be computed from Item level data?  The answer is metadata.  A dimension object was created that defined the relationship between the columns: CREATE DIMENSION BDS.ITEM_DIM LEVEL ITEM IS (ITEM_ORCL.I_ITEM_SK) LEVEL CLASS IS (ITEM_ORCL.I_CLASS) LEVEL CATEGORY IS (ITEM_ORCL.I_CATEGORY) HIERARCHY PROD_ROLLUP ( ITEM CHILD OF CLASS CHILD OF   CATEGORY  )  ATTRIBUTE ITEM DETERMINES ( ITEM_ORCL.I_SIZE, ITEM_ORCL.I_COLOR, ITEM_ORCL.I_UNITS, ITEM_ORCL.I_CURRENT_PRICE,I_WHOLESALE_COST ); Here, you can see that Items roll up into Class, and Classes roll up into Category.  The optimizer used this information to allow the query to be redirected to the Item level MV. A good practice is to compute these summaries and store them in Oracle Database tables.  However, there are alternatives.  For example, you may have already computed summary tables and stored them in HDFS.  You can leverage these summaries by creating an MV over a pre-built Big Data SQL table.  Consider the following example where a summary table was defined in Hive and called csv.mv_store_sales_qtr_class.  There are two steps required to leverage this summary: Create a Big Data SQL table over the hive source Create an MV over the prebuilt Big Data SQL table Let’s look at the details.  First, create the Big Data SQL table over the Hive source (and don’t forget to gather statistics!):   CREATE TABLE MV_STORE_SALES_QTR_CLASS     (       I_CLASS VARCHAR2(100)     , SS_QUANTITY NUMBER     , SS_WHOLESALE_COST NUMBER     , SS_EXT_DISCOUNT_AMT NUMBER     , SS_EXT_TAX NUMBER     , SS_COUPON_AMT NUMBER     , D_QUARTER_NAME VARCHAR2(30)     )     ORGANIZATION EXTERNAL     (       TYPE ORACLE_HIVE       DEFAULT DIRECTORY DEFAULT_DIR       ACCESS PARAMETERS       (         com.oracle.bigdata.tablename: csv.mv_store_sales_qtr_class       )     )     REJECT LIMIT UNLIMITED; -- Gather statistics exec  DBMS_STATS.GATHER_TABLE_STATS ( ownname => '"BDS"', tabname => '"MV_STORE_SALES_QTR_CLASS"', estimate_percent => dbms_stats.auto_sample_size, degree => 32 ); Next, create the MV over the Big Data SQL table: CREATE MATERIALIZED VIEW mv_store_sales_qtr_class ON PREBUILT TABLE WITH REDUCED PRECISION ENABLE QUERY REWRITE AS (     select i.I_CLASS,     sum(s.ss_quantity) as ss_quantity,        sum(s.ss_wholesale_cost) as ss_wholesale_cost, sum(s.ss_ext_discount_amt) as ss_ext_discount_amt,        sum(s.ss_ext_tax) as ss_ext_tax,        sum(s.ss_coupon_amt) as ss_coupon_amt,        d.D_QUARTER_NAME     from DATE_DIM_ORCL d, ITEM_ORCL i, STORE_SALES s     where s.ss_item_sk = i.i_item_sk       and s.ss_sold_date_sk = date_dim_orcl.d_date_sk     group by d.D_QUARTER_NAME,            i.I_CLASS     ); Queries against STORE_SALES that can be satisfied by the MV will be rewritten: Here, the following query used the MV: - What is the quarterly performance by category with yearly totals? select          i.i_category,        d.d_year,        d.d_quarter_name,        sum(s.ss_quantity) quantity from bds.DATE_DIM_ORCL d, bds.ITEM_ORCL i, bds.STORE_SALES s where s.ss_item_sk = i.i_item_sk   and s.ss_sold_date_sk = d.d_date_sk   and d.d_quarter_name in ('2005Q1', '2005Q2', '2005Q3', '2005Q4') group by rollup (i.i_category, d.d_year, d.D_QUARTER_NAME) And, the query returned in a little more than a second: Looking at the explain plan, you can see that the query is executed against the MV – and the EXTERNAL TABLE ACCESS (STORAGE FULL) indicates that Big Data SQL Smart Scan kicked in on the Hadoop cluster. MVs within the database can be automatically updated by using change tracking.  However, in the case of Big Data SQL tables, the data is not resident in the database – so the database does not know that the summaries are changed.  Your ETL processing will need to ensure that the MVs are kept up to date – and you will need to set query_rewrite_integrity=stale_tolerated. MVs are an old friend.  They have been used for years to accelerate performance for traditional database deployments.  They are a great tool to use for your big data deployments as well!  

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll...

Big Data SQL

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse: You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here: Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop).  In this architecture, Kafka contains stream data and it's able to answer the question "what is going on right now", whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question "how it use to be". Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical. Example of using Big Data SQL over Kafka and other sources. So, above I've explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example.  Input for demo example: - We have company, called MoviePlex, which sells video content all around the world - There are two stream datasets - network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales.  - Both stream data in real-time in Kafka - Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well. Based on this we have a business case - monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example). Using Oracle Data Visualization Desktop, we've created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries: The blue line is a historical benchmark. Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be. After a while revenue starts going down. This trend keeps going. A lot of network errors in France. Let's drill down into itemized traffic: Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors. Conclusion: 1) Kafka stores real-time data  and answers on question "what is going on right now" 2) Database and Hadoop stores historical data and answers on the question: "how it use to be" 3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets) 4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in the nutshell what Kafka is. What is Kafka? The full scope of the information about Kafka you may find here, but in the nutshell, it's distributed fault tolerant message system. It allows you to connect many systems in an organized fashion. Instead, connect each system peer to peer: you may land all your messages company wide on one system and consume it from there, like this: Kafka is kind of Data Hub system, where you land the messages and serve it after. More technical details. I'd like to introduce a few key Kafka's terms. 1) Kafka Broker. This is Kafka service, which you run on each server and which operates all read and write request 2) Kafka Producer. The process which writes data in Kafka 3) Kafka Consumer. The process, which reads data from Kafka. 4) Message. The name describes itself, I just want to add that messages have key and value. In comparison to NoSQL databases key Kafka's key is not indexed. It has application purposes (you may put some application logic in Key) and administrative purposes (each message with the same key goes to the same partition). 5) Topic. Set of the messages organized into topics. Database guys would compare it with a table. 6) Partition. It's a good practice to divide the topic into partitions for performance and maintenance purposes. Messages within the same key go to the same partition. If a key is absent, messages are distributing in round - robin fashion. 7) Offset. The offset is the position of each message in the topic. The offset is indexed and it allows you quickly access your particular message. When do you delete data? One of the basic Kafka concepts is that of retention - Kafka does not keep data forever, nor does it wait for all consumers to read a message before deleting a message. Instead, the Kafka administrator configures a retention period for each topic - either amount of time for which to store messages before deleting them or how much data to store older messages are purged. This two parameters control this: log.retention.ms and log.retention.bytes. The amount of data to retain in the log for each topic-partition. This is the limit per partition: multiply by the number of partitions to get the total data retained for the topic.  How to query Kafka data with Big Data SQL? for query the Kafka data you need to create hive table first. let me show an ent-to-end example. I do have a JSON file: $ cat web_clicks.json { click_date: "38041", click_time: "67786", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "396439", web_page: "646"} { click_date: "38041", click_time: "41831", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "90714", web_page: "804"} { click_date: "38041", click_time: "60334", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "afternoon", item_sk: "151944", web_page: "867"} { click_date: "38041", click_time: "53225", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "175796", web_page: "563"} { click_date: "38041", click_time: "47515", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "186943", web_page: "777"} { click_date: "38041", click_time: "73633", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "118004", web_page: "647"} { click_date: "38041", click_time: "43133", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "148210", web_page: "930"} { click_date: "38041", click_time: "80675", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "380306", web_page: "484"} { click_date: "38041", click_time: "21847", date: "2004-02-26", am_pm: "AM", shift: "third", sub_shift: "morning", item_sk: "55425", web_page: "95"} { click_date: "38041", click_time: "35131", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "185071", web_page: "118"} and I'm going to load it into Kafka with standard Kafka tool "kafka-console-producer": $ cat web_clicks.json|kafka-console-producer --broker-list bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092 --topic json_clickstream for a check that messages have appeared in the topic you may use the following command: $ kafka-console-consumer --zookeeper bds1:2181,bds2:2181,bds3:2181 --topic json_clickstream --from-beginning after I've loaded this file into Kafka topic, I create a table in Hive. Make sure that you have oracle-kafka.jar and kafka-clients*.jar in your hive.aux.jars.path: and here: after this you may run follow DDL in the hive: hive> CREATE EXTERNAL TABLE json_web_clicks_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092', 'oracle.kafka.table.topics'='json_clickstream' ); hive> describe json_web_clicks_kafka; hive> select * from json_web_clicks_kafka limit 1; and as soon as hive table been created I create ORACLE_HIVE table in Oracle: SQL> CREATE TABLE json_web_clicks_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=CLUSTER com.oracle.bigdata.tablename=default.json_web_clicks_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; here you also have to keep in mind that you need to add oracle -kafka.jar and  kafka -clients*.jar in your bigdata.properties file on the database and on the Hadoop side. I have dedicated the blog about how to do this here. Now we are ready to query: SQL> SELECT * FROM json_web_clicks_kafka WHERE ROWNUM<3; json_clickstream 209 { click_date: "38041", click_time: "43213"..."} 0 26-JUL-17 05.55.51.762000 PM 1 json_clickstream 209 { click_date: "38041", click_time: "74669"... } 1 26-JUL-17 05.55.51.762000 PM 1 Oracle 12c provides powerful capabilities for working with JSON, such as dot API. It allows us to easily query the JSON data as a structure:  SELECT t.value.click_date, t.value.click_time FROM json_web_clicks_kafka t WHERE ROWNUM < 3; 38041 40629 38041 48699 Working with AVRO messages. In many cases, customers are using AVRO as flexible self-described format and for exchanging messages through the Kafka. For sure we do support it and doing this in very easy and flexible way. I do have a topic, which contains AVRO messages and I define Hive table over it: CREATE EXTERNAL TABLE web_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='avro', 'oracle.kafka.table.value.schema'='{"type":"record","name":"avro_table","namespace":"default","fields": [{"name":"ws_sold_date_sk","type":["null","long"],"default":null}, {"name":"ws_sold_time_sk","type":["null","long"],"default":null}, {"name":"ws_ship_date_sk","type":["null","long"],"default":null}, {"name":"ws_item_sk","type":["null","long"],"default":null}, {"name":"ws_bill_customer_sk","type":["null","long"],"default":null}, {"name":"ws_bill_cdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_hdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_addr_sk","type":["null","long"],"default":null}, {"name":"ws_ship_customer_sk","type":["null","long"],"default":null} ]}', 'oracle.kafka.bootstrap.servers'='bds2:9092', 'oracle.kafka.table.topics'='web_sales_avro' ); describe web_sales_kafka; select * from web_sales_kafka limit 1; Here I define 'oracle.kafka.table.value.type'='avro' and also I have to specify 'oracle.kafka.table.value.schema'. After this we have structure. In a similar way I define a table in Oracle RDBMS: SQL> CREATE TABLE WEB_SALES_KAFKA_AVRO ( "WS_SOLD_DATE_SK" NUMBER, "WS_SOLD_TIME_SK" NUMBER, "WS_SHIP_DATE_SK" NUMBER, "WS_ITEM_SK" NUMBER, "WS_BILL_CUSTOMER_SK" NUMBER, "WS_BILL_CDEMO_SK" NUMBER, "WS_BILL_HDEMO_SK" NUMBER, "WS_BILL_ADDR_SK" NUMBER, "WS_SHIP_CUSTOMER_SK" NUMBER topic varchar2(50), partitionid integer, KEY NUMBER, offset integer, timestamp timestamp, timestamptype INTEGER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.tablename: web_sales_kafka ) ) REJECT LIMIT UNLIMITED ; And we good to query the data! Performance considerations. 1) Number of Partitions. This is the most important thing to keep in mind there is a nice article about how to choose a right number of partitions. For Big Data SQL purposes I'd recommend using a number of partitions a bit more than you have CPU cores on your Big Data SQL cluster. 2) Query fewer columns Use column pruning feature. In other words list only necessary columns in your SELECT and WHERE statements. Here is the example. I've created void PL/SQL function, which does nothing. But PL/SQL couldn't be offloaded to the cell side and we will move all the data towards the database side: SQL> create or replace function fnull(input number) return number is Result number; begin Result:=input; return(Result); end fnull; after this I ran the query, which requires one column and checked how much data have been returned to the DB side: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  5741.81 MB after this I repeat the same test case with 10 columns: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)), MIN(fnull(WS_SOLD_TIME_SK)), MIN(fnull(WS_SHIP_DATE_SK)), MIN(fnull(WS_ITEM_SK)), MIN(fnull(WS_BILL_CUSTOMER_SK)), MIN(fnull(WS_BILL_CDEMO_SK)), MIN(fnull(WS_BILL_HDEMO_SK)), MIN(fnull(WS_BILL_ADDR_SK)), MIN(fnull(WS_SHIP_CUSTOMER_SK)), MIN(fnull(WS_SHIP_CDEMO_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  32193.98 MB so, hopefully, this test case clearly shows that you have to use only useful columns 3) Indexes There is no Indexes rather than Offset columns. The fact that you have key column doesn't have to mislead you - it's not indexed. The only offset allows you have a quick random access 4) Warm up your data If you want to read data faster many times, you have to warm it up, by running "select *" type of the queries. Kafka relies on Linux filesystem cache, so for reading the same dataset faster many times, you have to read it the first time. Here is the example - I clean up the Linux filesystem cache dcli -C "sync; echo 3 > /proc/sys/vm/drop_caches" - I tun the first query: SELECT COUNT(1) FROM WEB_RETURNs_JSON_KAFKA t it took 278 seconds. - Second and third time took 92 seconds only. 5) Use bigger Replication Factor Use bigger replication factor. Here is the example. I do have two tables one is created over the Kafka topic with Replication Factor  = 1, second is created over Kafka topic with ith Replication Factor  = 3. SELECT COUNT(1) FROM JSON_KAFKA_RF1 t this query took 278 seconds for the first run and 92 seconds for the next runs SELECT COUNT(1) FROM JSON_KAFKA_RF3 t This query took 279 seconds for the first run, but 34 seconds for the next runs. 6) Compression considerations Kafka supports different type of compressions. If you store the data in JSON or XML format compression rate could be significant. Here is the examples of the numbers, that could be: Data format and compression type Size of the data, GB JSON on HDFS, uncompressed 273.1 JSON in Kafka, uncompressed 286.191 JSON in Kafka, Snappy 180.706 JSON in Kafka, GZIP 52.2649 AVRO in Kafka, uncompressed 252.975 AVRO in Kafka, Snappy 158.117 AVRO in Kafka, GZIP 54.49 This feature may save some space on the disks, but taking into account, that Kafka primarily used for the temporal store (like one week or one month), I'm not sure that it makes any sense. Also, you will pay some performance penalty, querying this data (and burn more CPU).  I've run a query like: SQL> select count(1) from ... and had followed results: Type of compression Elapsed time, sec uncompressed 76 snappy 80 gzip 92 so, uncompressed is the leader. Gzip and Snappy slower (not significantly, but slow). taking into account this as well as fact, that Kafka is a temporal store, I wouldn't recommend using compression without any exeptional need.  7) Use parallelize your processing. If for some reasons you are using a small number of partitions, you could use Hive metadata parameter "oracle.kafka.partition.chunk.size" for increase parallelism. This parameter defines a size of the input Split. So, if you set up this parameter equal 1MB and your topic has 4MB total, you will proceed it with 4 parallel threads. Here is the test case: - Drop Kafka topic $ kafka-topics --delete --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales - Create again with only one partition $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic store_sales - Check it $ kafka-topics --describe --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales ... Topic:store_sales PartitionCount:1 ReplicationFactor:3 Configs: Topic: store_sales Partition: 0 Leader: 79 Replicas: 79,76,77 Isr: 79,76,77 ... - Check the size of input file: $ du -h store_sales.dat 19G store_sales.dat - Load data to the Kafka topic $ cat store_sales.dat|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic store_sales --request-timeout-ms 30000 --batch-size 1000000 - Create Hive External table hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Create Oracle external table SQL> CREATE TABLE STORE_SALES_KAFKA ( TOPIC VARCHAR2(50), PARTITIONID NUMBER, VALUE VARCHAR2(4000), OFFSET NUMBER, TIMESTAMP TIMESTAMP, TIMESTAMPTYPE NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.store_sales_kafka ) ) REJECT LIMIT UNLIMITED PARALLEL ; - Run test query SQL> SELECT COUNT(1) FROM store_sales_kafka; it took 142 seconds - Re-create Hive external table with 'oracle.kafka.partition.chunk.size' parameter equal 1MB hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.chop.partition'='true', 'oracle.kafka.partition.chunk.size'='1048576', 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Run query again: SQL> SELECT COUNT(1) FROM store_sales_kafka; Now it took only 7 seconds One MB split is quite low, and for big topics we recommend to use 256MB. 8) Querying small topics. Sometimes it happens that you need to query really small topics (few hundreds of messages, for example), but very frequently. At this case, it makes sense to create a topic with fewer paritions. Here is the test case example: - Create topic with 1000 partitions $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1000 --topic small_topic - Load only one message there $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic - Create hive external table hive> CREATE EXTERNAL TABLE small_topic_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='small_topic' ); - Create Oracle external table SQL> CREATE TABLE small_topic_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.small_topic_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; - Query all rows from it SQL> SELECT * FROM small_topic_kafka it took 6 seconds - Create topic with only one partition and put only one message there and run same SQL query over it $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic small_topic   $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic SQL> SELECT * FROM small_topic_kafka now it takes only 0.5 second 9) Type of data in Kafka messages. You have few options for storing data in Kafka messages and for sure you want to do pushdown processing. Big Data SQL supports pushdown operations only for JSONs. This means that everything that you could expose thought the JSON will be pushed down to the cell side and will be prosessed there. Example - The query which could be pushed down to the cell side (JSON): SQL> SELECT COUNT(1) FROM WEB_RETURN_JSON_KAFKA t WHERE t.VALUE.after.WR_ORDER_NUMBER=233183247; - The query which could not be pushed down to the cell side (XML): SQL> SELECT COUNT(1) FROM WEB_RETURNS_XML_KAFKA t WHERE XMLTYPE(t.value).EXTRACT('/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') .getNumberVal() = 233183247; If amounts of data is not significant, you could use Big Data SQL for processing. If we are talking about big data volumes, you could process it once and convert into different file formats on HDFS, with Hive query: hive> select xpath_int(value,'/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') from WEB_RETURNS_XML_KAFKA limit 1 ; 10) JSON vs AVRO format in the Kafka topics In continuing to the previous point, you may be wondering which semi-structured format use? The answer is easy - use what your data source produce there is no significant performance difference between Avro and JSON. For example, a query like: SQL> SELECT COUNT(1) FROM WEB_RETURNS_avro_kafka t WHERE t.WR_ORDER_NUMBER=233183247; Will be done in 112 seconds in case of JSON and in 105 seconds in case of Avro. and JSON topic will take 286.33 GB and Avro will take 202.568 GB. There is some difference, but not worth for converting the original format. How to bring data from OLTP databases in Kafka? Use Golden Gate! Oracle Golden Gate is the well-known product for capturing commit logs on the database side and bring the changes into a target system. The good news that Kafka may play a role in the target system. I'd like to skip the detailed explanation of this feature, because it's already explained in very deep details here. Known Issue. Running Kafka broker on wildcard By default, Kafka doesn't use wildcard address (0.0.0.0) for brokers and pick some IP address. it maybe a problem in case of multi-network Kafka cluster. One network could be used for interconnect, second for external connection. Luckily, there is easy way to solve this issue and start Kafka Broker on Wildcard address. 1) go to: Kafka > Instances (Select Instance)  > Configuration > Kafka Broker  > Advanced > Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties 2) and add: listeners=PLAINTEXT://0.0.0.0:9092 advertised.listeners=PLAINTEXT://server.example.com:9092

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in...

Oracle Big Data SQL 3.2 is Now Available

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new data sources, enhanced security and improved performance. Big Data SQL has expanded its data source support to now include querying data streams – specifically Kafka topics: This enables streaming data to be joined with dimensions and facts in Oracle Database or HDFS.  It’s never been easier to combine data from streams, Hadoop and Oracle Database. New security capabilities enable Big Data SQL to automatically leverage underlying authorization rules on source data (i.e. ACLs on HDFS data) and then augment that with Oracle’s advanced security policies.  In addition, to prevent impersonation, Oracle Database servers now authenticate against Big Data SQL Server cells. Finally, secure Big Data SQL installations have become much easier to set up; Kerberos ticket renewals are now automatically configured. There has been significant performance improvements as well.  Oracle now provides its own optimized Parquet driver which delivers a significant performance boost – both in terms of speed and the ability to query many columns.  Support for CLOBs is also now available – which facilitates efficient processing of large JSON and XML data documents. Finally, there has been significant enhancements to the out-of-box experience.  The installation process has been simplified, streamlined and made much more robust.

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new...

Big Data

Roadmap Update for Big Data Appliance Releases (4.11 and beyond)

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To understand what we do before we ship the latest CDH on BDA and why we think we should spend that time, review this post. That all said, we have decide to rejigger the releases and do the following: Focus BDA 4.11 solely on up taking the latest CDH 5.13.1 and related OS and Java updates, thus catching up in timelines to the CDH releases Move all features that were planned for 4.11 to the next release, which will then be on track to uptake CDH 5.14 on our regular schedule So what does this mean in terms of release timeframes, and what does it mean for what we talked about at Openworld for BDA (shown as a image below, review full slide deck incl. our cloud updates at the Openworld site)? BDA version 4.11.0 will have the following updates: Uptake of CDH 5.13.1 - as usual, because we will be very close to the first update to 5.13, we will include that and time our BDA release as close to that as possible. This would get us to BDA 4.11.0 around mid December, assuming the CDH update retains it dates Update the latest OS versions, Kernel etc. to update to state of the art on Oracle Linux 6, and include all security patches Update MySQL, Java and again ensure all security patches are included BDA version 4.12.0 will have the following updates: Uptake of CDH 5.14.x - we are still evaluating the dates and timing for this CDH release and whether or not we go with the .0 or .1 version here. Goal here is to deliver this release 4 weeks or so after CDH drops. Expect early calendar 2018, with more precise updates coming to this forum as we know more. Include roadmap features as follows: Dedicated Kafka cluster on BDA nodes Full cluster on OL7 (aligning with the OL7 edge nodes) Big Data Manager available on BDA Non-stop Hadoop, and proceeding on making more and more components HA out of the box Fully managed BDA edge nodes The usual OS, Java and MySQL updates per the normal release cadence Updates to related components like Big Data Connectors etc. All of this means that we pulled in the 4.11.0 version to the mid December time frame, while we pushed out the 4.12.0 version by no more then maybe a week or so... So, this looks like a win-win on all fronts. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To...

Big Data

Review of Big Data Warehousing at OpenWorld 2017 - Now Available

Did you miss OpenWorld 2017? Then my latest book is definitely something you will want to download! If you went to OpenWorld this book is also for you because it covers all the most important big data warehousing messages and sessions during the five days of OpenWorld. Following on from OpenWorld 2017 I have put together a comprehensive review of all the big data warehousing content from OpenWorld 2017. This includes all the key sessions and announcements from this year's Oracle OpenWorld conference. This review guide contains the following information: Chapter 1 Welcome - an overview of the contents.   Chapter 2 Let’s Go Autonomous - containing all you need to know about Oracle’s new, fully-managed Autonomous Data Warehouse Cloud. This was the biggest announcement at OpenWorld so this chapter contains videos, presentations and podcasts to get you up to speed on this completely new data warehouse cloud service. Chapter 3 Keynotes - Relive OpenWorld 2017 by watching the most important highlights from this year’s OpenWorld conference with our on demand video service which covers all the major keynote sessions. Chapter 4 Key Presenters - a list of the most important speakers by product area such as database, cloud, analytics, developer and big data. Each biography includes all relevant social media sites and pages. Chapter 5 Key Sessions - a list of all the most important sessions with links to download the related presentations organized Chapter 6 Staying Connected - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data.  This covers all our websites, blogs and social media pages. This review is available in three formats: 1) For highly evolved users, i.e. Apple users, who understand the power of Apple’s iBook format, your multi-media enabled iBook version is available here. 2) For Windows users who are forced to endure a 19th-Century style technological experience, your PDF version is available here. 3) For Linux users, Oracle DBAs and other IT dinosaurs, all of whom are allergic to all graphical user interfaces, the basic version of this comprehensive review is available here. I hope you enjoy this review and look forward to seeing you next year at OpenWorld 2018, October 28 to November 1. If you’d like to be notified when registration opens for next year’s Oracle OpenWorld then register your email address here.  

Did you miss OpenWorld 2017? Then my latest book is definitely something you will want to download! If you went to OpenWorld this book is also for you because it covers all the most important big data...

New Release: BDA 4.10 is now Generally Available

As of today, BDA version 4.10 is Generally Available. As always, please refer to  If You Struggle With Keeping your BDAs up to date, Then Read This to learn about the innovative release process we do for BDA software. This new release includes a number of features and updates: Support for Migration From Oracle Linux 5 to Oracle Linux 6 - Clusters on Oracle Linux 5 must first be upgraded to v4.10.0 on Oracle Linux 5 and can then be migrated to Oracle Linux 6. This process must be done one server at a time. HDFS data and Cloudera Manager roles are retained. Please review the documentation for the entire process carefully before starting. BDA v4.10 is the last release built for Oracle Linux 5 and no further upgrades for Oracle Linux 5 will be released. Updates to NoSQL DB, Big Data Connectors, Big Data Spatial & Graph Oracle NoSQL Database 4.5.12 Oracle Big Data Connectors 4.10.0 Oracle Big Data Spatial & Graph 2.4.0 Support for Oracle Big Data Appliance X-7 systems - Oracle Big Data Appliance X7 is based on the X7–2L server. The major enhancements in Big Data Appliance X7–2 hardware are: CPU update: 2 24–core Intel Xeon processor Updated disk drives: 12 10TB 7,200 RPM SAS drives 2 M.2 150GB SATA SSD drives (replacing the internal USB drive) Vail Disk Controller (HBA) Cisco 93108TC-EX–1G Ethernet switch (replacing the Catalyst 4948E). Spark 2 Deployed by Default - Spark 2 is now deployed by default on new clusters and also during upgrade of clusters where it is not already installed. Oracle Linux 7 can be Installed on Edge Nodes - Oracle Linux 7 is now supported for installation on Oracle Big Data Appliance edge nodes running on X7–2L, X6–2L or X5–2L servers. Support for Oracle Linux 7 in this release is limited to edge nodes. Support for Cloudera Data Science Workbench - Support for Oracle Linux 7 on edge nodes provides a way for customers to host Cloudera Data Science Workbench (CDSW) on Oracle Big Data Appliance. CDSW is a web application that enables access from a browser to R, Python, and Scala on a secured cluster. Oracle Big Data Appliance does not include licensing or official support for CDSW. Contact Cloudera for licensing requirements.   Scripts for Download & Configuration of Apache Zeppelin, Jupyter Notebook, and RStudio -  This release includes scripts to assist in download and configuration of these commonly used tools. The scripts are provided as a convenience to users. Oracle Big Data Appliance does not include official support for the installation and use of Apache Zeppelin, Jupyter Notebook, or RStudio.   Improved Configuration of Oracle's R Distribution and ORAAH -  For these tools, much of the environment configuration that was previous done by the customer is now automated. Node Migration Optimization - Node migration time has been improved by eliminating some steps. Support for Extending Secure NoSQL DB clusters This release is based on Cloudera Enterprise (CDH 5.12.1 & Cloudera Manager 5.12.1) as well as Oracle NoSQL Database (4.5.12). Cloudera 5 Enterprise includes CDH (Core Hadoop), Cloudera Manager, Apache Spark, Apache HBase, Impala, Cloudera Search and Cloudera Navigator The BDA continues to support all security options for CDH Hadoop clusters : Kerberos authentication - MIT or Microsoft Active Directory, Sentry Authorization, HTTPS/Network encryption, Transparent HDFS Disk Encryption, Secure Configuration for Impala, HBase , Cloudera Search and all Hadoop services configured out-of-the-box. Parcels for Kafka 2.2, Spark 2.2, Kudu 1.4 and Key Trustee Server 5.12 are included in the BDA Software Bundle  

As of today, BDA version 4.10 is Generally Available. As always, please refer to  If You Struggle With Keeping your BDAs up to date, Then Read This to learn about the innovative release process we...

Announcing: Big Data Appliance X7-2 - More Power, More Capacity

Big Data Appliance X7-2 is the 6th hardware generation of Oracle's leading Big Data platform continuing the platform evolution from Hadoop workloads to Big Data, SQL, Analytics and Machine Learning workloads. Big Data Appliance combines dense IO with dense Compute in a single server form factor. The single form factor enables our customers to build a single data lake, rather then replicating data across more specialized lakes.  What is New? The current X7-2 generation is based on the latest Oracle Sun X7-2L servers, and leverages that infrastructure to deliver enterprise class hardware for big data workloads. The latest generation sports more cores, more disk space and the same level of memory per server. Big Data Appliance retains its InfiniBand internal network, support by a multi-homed Cloudera CDH cluster set up. The details can be found in the updated data sheet. Why a Single Form Factor? Many customers are embarking on a data unification effort, and the main data management concept used in that effort is the data lake. Within this data lake, we see and recommend a set of workloads to be run as is shown in this logical architecture:   In essence what we are saying is that the data lake will host the Innovation or Discovery Lab workloads as well as the Execution or production workloads on the same systems. This means that we need an infrastructure to both deal with large data volumes in a cost effective manner and deal with high compute volumes on a regular basis. Leveraging the hardware footprint in BDA, enables us to run both these workloads. The servers come with 2 * 24 cores AND 12 * 10TB drives enabling very large volumes of data and CPUs spread across a number of workloads. So rather then dealing with various form factors, and copying data from the main data lake to a side show Discovery Lab, BDA X7-2 consolidates these workloads. The other increasingly important data set in the data lake is streaming into the organization, typically via Apache Kafka. Both the CPU counts and the memory footprints can provide a great Kafka cluster, connecting it over InfiniBand to the main HDFS data stores. Again, while these nodes are very IO dense for Kafka, the simplicity of using the same nodes for any of the workloads makes Big Data Appliance a great Big Data platform choice. What is in the Box? Apart from the hardware specs, the software that is included in Big Data Appliance enables the data lake creation in a single software & hardware combination. Big Data Appliance comes with the full Cloudera stack, enabling the data lake as drawn above, with Kafka, HDFS, Spark all included in the cost of the system. The specific licensing for Big Data Appliance makes the implementation cost effective, and added to the simplicity of a single form factor makes Big Data Appliance an ideal platform to implement and grow the data lake into a successful venture.    

Big Data Appliance X7-2 is the 6th hardware generation of Oracle's leading Big Data platform continuing the platform evolution from Hadoop workloads to Big Data, SQL, Analytics and Machine Learning...

Hadoop Best Practices

Secure your Hadoop Cluster

Security is a very important aspect of many projects and you must not underestimate it, Hadoop security is very complex and consist of many components, it's better to enable one by one security features. Before starting the explanation of different security options, I'll share some materials that will help you to get familiar with the foundation of algorithms and technologies that underpin many security features in Hadoop. Before you begin First of all, I recommend that you watch this excellent video series, which explains how asymmetric key works and how RSA algorithm works (this is the basis for SSL/TLS).  Then, read this this blog about TLS/SSL principals. Also, if you mix up terms such as Active Directory, LDAP, OpenLDAP and so, it will be useful to check this page. After you get familiar with the concepts, you can concentrate on the implementation scenarios with Hadoop. Security building blocks There are few building blocks of secure system: - Authentication. Answers on questions who you are. It's like a passport validation. For example, if someone says that he is Bill Smith, he should prove this (pass Authentication) with certain document (like a passport) - Authorization. After passing Authentication, user could be trusted (we checked his passport and make sure that he is Bill Smith) and next question to be asked - what this user allowed to do on the cluster (usually cluster shared between multiple users and multiple groups and not each dataset should be available for everyone)? This is Authorization - Encryption in motion. Hadoop is distributed system and definitely there is data movement (over the network) happens. This traffic could be intercepted. To prevent this we have to encrypt it. This called encryption in motion - Encryption at REST. Data stored on hard disks and and some privileged user (like root) may have access to this disks and read any directories, including those that store Hadoop data. Also, disks could be physically stolen and mounted on any machine for future access. To protect you data from such kind of vulnerability, you have to encrypt data. This called encryption at REST - Audit. Sometimes data breach is happens and only one what you can do it's define the channel of this breach. For this you have to use Audit tools Step1. Authentication in Hadoop. Motivation By design, Hadoop doesn't have any security. So, if you spin up a cluster - by default it's insecure. It assumes a level of trust; and assumes that only trusted users have access to the cluster. In HDFS, files and folders have permissions - similar to Linux - and users access files based on access control lists - or ACLs.  See the diagram below that highlights different access paths into the cluster:  shell/CLI, JDBC, and tools like Hue.   As you can see below, each file/folder has access privileges assigned.  The oracle user is the owner of the items - and other users can access the items based on the access control definition (e.g. -rw-r--r-- means that Oracle can read/write to the file, users in the oinstall group can read the file, and the rest of the world can also read the file). $ hadoop fs -ls Found 57 items drwxr-xr-x   - oracle hadoop          0 2017-05-25 15:20 binTest drwxr-xr-x   - oracle hadoop          0 2017-05-30 14:04 clustJobJune drwxr-xr-x   - oracle hadoop          0 2017-05-26 11:47 exercise0 drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:07 hierarchyIndex drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:15 hierarchyIndexWithCities drwxr-xr-x   - oracle hadoop          0 2017-05-22 16:46 hive_examples ... In this vary "relaxed" security level - it is very easy to subvert these ACLs.  Because there is no authentication, a user can impersonate someone else; the identity is determined by the current login identity on the client machine.  So, as shown below, a malicious user can define the "hdfs" user (a power user in Hadoop) on their local machine - access the cluster - and then delete import financial and healthcare data.  Additionally, accessing data thru tools like Hive are also completely open. The user that is passed as part of the JDBC connection will be used for data authorization.  Note, you can specify any(!) user that you want - there is no authentication!  So, all data in Hive is open for query. If you care about the data, then this is a real problem. Step1. Authentication in Hadoop. Edge node Instead of enabling connectivity from any client, a Edge node (you may think of it like client node) created that users log into it and has access to the cluster. Access to a Hadoop cluster is prohibited from other servers rather than this Edge node. Edge node is used for: - Run jobs and interact with the Hadoop Cluster - Run all gateway services of Hadoop components - User identity established on edge node (trusted Authentication happens during log in into this node) Because a user logged into this edge node and did not have the ability to alter his or her identity, the identity can be trusted.  This means that HDFS ACLs now have some meaning  - User identity established on edge node  - Connect only thru known access paths and hosts   Note: in HDFS there is feature extended ACL, which allow to have extended security lists, so you could grant permissions outside of the group. $ hadoop fs -mkdir /user/oracle/test_dir $ hdfs dfs -getfacl /user/oracle/test_dir # file: /user/oracle/test_dir # owner: oracle # group: hadoop user::rwx group::r-x other::r-x   $ hdfs dfs -setfacl -m user:ben:rw- /user/oracle/test_dir $ hdfs dfs -getfacl /user/oracle/test_dir # file: /user/oracle/test_dir # owner: oracle # group: hadoop user::rwx user:ben:rw- group::r-x mask::rwx other::r-x Challenge: JDBC is still insecure  - User identified in JDBC connect string not authenticated   Here is the example how I can use beeline tool from cli for work on behalf of "superuser" who may do whatever he wants on the cluster: $ beeline ... beeline> !connect jdbc:hive2://localhost:10000/default; ... Enter username for jdbc:hive2://localhost:10000/default;: superuser Enter password for jdbc:hive2://localhost:10000/default;: * ... 0: jdbc:hive2://localhost:10000/default> select current_user(); ... +------------+--+ |    _c0     | +------------+--+ | superuser  | +------------+--+ To ensure that identities are trusted, we need to introduce a capability that you probably use all the time and didn't even know it:  Kerberos. Step1. Authentication in Hadoop. Kerberos Kerberos ensures that both users and services are authenticated. Kerberos is *the* authentication mechanism for Hadoop deployments: Before interacting with cluster user have to obtain Kerberos ticket (think of it like a passport). The two most common ways to use Kerberos with Oracle Big Data Appliance is using it with: a) Active Directory Kerberos b) MIT Kerberos.   On the Oracle support site you will find the step by step instruction for enabling both of this configuration. Check for “BDA V4.2 and Higher Active Directory Kerberos Install and Upgrade Frequently Asked Questions (FAQ) (Doc ID 2013585.1)”  for the Active Directory implementation and “Instructions to Enable Kerberos on Oracle Big Data Appliance with Mammoth V3.1/V4.* Release (Doc ID 1919445.1)” for MIT Kerberos. Oracle recommends using MIT local Kerberos for system services like HDFS, YARN and AD Kerberos for the human users (like user John Smith). Also, if you want to set up trusted relationships between them (to make possible for AD users work with Cluster), you have to follow support note “How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)”. Note: Big Data Appliance greatly simplifies the implementation of highly available Kerberos deployment on a Hadoop cluster.  You do not need to do the manual setup (and should not) of Kerberos settings. Use tools, which is provided by BDA. Using MOS 1919445.1 I've enabled MIT Kerberos on my BDA cluster. So, what has been changed in my daily life with Hadoop cluster? First of all, I'm trying to list files in HDFS: # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # hadoop fs -ls / ... Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]   ... Oops... seems something is missing. I can not access the data in HDFS - "No valid credentials provided".  In order to gain access to HDFS, I must first obtain a Kerberos ticket: # kinit oracle/scajvm1bda01.vm.oracle.com kinit: Client not found in Kerberos database while getting initial credentials Still not able to access HDFS!  That's because the user principal must be added to the Key Distribution Center - or KDC.  As the Kerberos admin, add the principal: # kadmin.local -q "addprinc oracle/scajvm1bda01.vm.oracle.com" Now, I can successfully obtain the Kerberos ticket: # kinit oracle/scajvm1bda01.vm.oracle.com Password for oracle@ORACLE.TEST: Practical Tip: Use keytab files Here we go! I'm ready to work with my Hadoop cluster. But I don't want to enter the password every single time when I obtaining the ticket (this is important for services as well). For this, I need to create keytab file. # kadmin.local kadmin.local: xst -norandkey -k oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com and I can obtain a new Kerberos ticket without the password: # kinit -kt oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: oracle/scajvm1bda01.vm.oracle.com@ORALCE.TEST Valid starting     Expires            Service principal 07/28/17 12:59:28  07/29/17 12:59:28  krbtgt/ORALCE.TEST@ORALCE.TEST         renew until 08/04/17 12:59:28 Now I can work with Hadoop cluster on behalf of Oracle user: # hadoop fs -ls /user/   Found 7 items drwx------   - hdfs   supergroup          0 2017-07-27 18:21 /user/hdfs drwxrwxrwx   - mapred hadoop              0 2017-07-27 00:32 /user/history drwxr-xr-x   - hive   supergroup          0 2017-07-27 00:33 /user/hive drwxrwxr-x   - hue    hue                 0 2017-07-27 00:32 /user/hue drwxr-xr-x   - oozie  hadoop              0 2017-07-27 00:34 /user/oozie drwxr-xr-x   - oracle hadoop              0 2017-07-27 18:57 /user/oracle drwxr-x--x   - spark  spark               0 2017-07-27 00:34 /user/spark # hadoop fs -ls /user/oracle Found 4 items drwx------   - oracle hadoop          0 2017-07-27 19:00 /user/oracle/.Trash drwxr-xr-x   - oracle hadoop          0 2017-07-27 18:54 /user/oracle/.sparkStaging drwx------   - oracle hadoop          0 2017-07-27 18:57 /user/oracle/.staging drwxr-xr-x   - oracle hadoop          0 2017-07-27 18:57 /user/oracle/oozie-oozi # hadoop fs -ls /user/spark ls: Permission denied: user=oracle, access=READ_EXECUTE, inode="/user/spark":spark:spark:drwxr-x--x WARNING: please keep in mind that you have to keep it in a safe directory and set permissions carefully on it (because anyone who can read it can impersonate them). It's interesting to note, that if you work on Hadoop servers, you already have many keytab files and for example, if you want to get an HDFS tickets you may easily do this. For getting the list of the principals for certain keytab file just run: # klist -ket `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` and for obtaining the ticket, run: # kinit -kt `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` hdfs/`hostname` Practical Tip: Debug kinit To debug kinit, you should export Linux environment variable KRB5_TRACE=/dev/stdout. Here is an example: $ kinit -kt /opt/kafka/security/testuser.keytab testuser@BDACLOUDSERVICE.ORACLE.COM $ export KRB5_TRACE=/dev/stdout $ kinit -kt /opt/kafka/security/testuser.keytab testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.290407: Getting initial credentials for testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.292692: Looked up etypes in keytab: aes256-cts, aes128-cts, des3-cbc-sha1, rc4-hmac, des-hmac-sha1, des, des-cbc-crc [88092] 1529001733.292734: Sending request (230 bytes) to BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.292863: Resolving hostname cfclbv3870.us2.oraclecloud.com [88092] 1529001733.293130: Sending initial UDP request to dgram 10.196.64.44:88 [88092] 1529001733.293587: Received answer from dgram 10.196.64.44:88 [88092] 1529001733.293663: Response was not from master KDC [88092] 1529001733.293732: Processing preauth types: 19 [88092] 1529001733.293773: Selected etype info: etype aes256-cts, salt "(null)", params "" [88092] 1529001733.293802: Produced preauth for next request: (empty) [88092] 1529001733.293830: Salt derived from principal: BDACLOUDSERVICE.ORACLE.COMtestuser [88092] 1529001733.293857: Getting AS key, salt "BDACLOUDSERVICE.ORACLE.COMtestuser", params "" [88092] 1529001733.293958: Retrieving testuser@BDACLOUDSERVICE.ORACLE.COM from FILE:/opt/kafka/security/testuser.keytab (vno 0, enctype aes256-cts) with result: 0/Success [88092] 1529001733.294013: AS key obtained from gak_fct: aes256-cts/7606 [88092] 1529001733.294095: Decrypted AS reply; session key is: aes256-cts/D28B [88092] 1529001733.294214: FAST negotiation: available [88092] 1529001733.294264: Initializing FILE:/tmp/krb5cc_1001 with default princ testuser@BDACLOUDSERVICE.ORACLE.COM [88092] 1529001733.294408: Removing testuser@BDACLOUDSERVICE.ORACLE.COM -> krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM from FILE:/tmp/krb5cc_1001 [88092] 1529001733.294450: Storing testuser@BDACLOUDSERVICE.ORACLE.COM -> krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM in FILE:/tmp/krb5cc_1001 [88092] 1529001733.294534: Storing config in FILE:/tmp/krb5cc_1001 for krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM: fast_avail: yes [88092] 1529001733.294584: Removing testuser@BDACLOUDSERVICE.ORACLE.COM -> krb5_ccache_conf_data/fast_avail/krbtgt\/BDACLOUDSERVICE.ORACLE.COM\@BDACLOUDSERVICE.ORACLE.COM@X-CACHECONF: from FILE:/tmp/krb5cc_1001 [88092] 1529001733.294618: Storing testuser@BDACLOUDSERVICE.ORACLE.COM -> krb5_ccache_conf_data/fast_avail/krbtgt\/BDACLOUDSERVICE.ORACLE.COM\@BDACLOUDSERVICE.ORACLE.COM@X-CACHECONF: in FILE:/tm/krb5cc_1001 Practical Tip: Obtaining Kerberos Ticket without acsess to KDC In my expirience it could be a cases when client machine could not acsess KDC directy, but need to work with Kerberos protected resources. Here is workaround for it. First go to machine which has acsess to KDC and generate cache ticket: $ cp /etc/krb5.conf /tmp/TMP_TICKET_CACHE/krb5.conf $ export KRB5_CONFIG=/tmp/TMP_TICKET_CACHE/krb5.conf $ export KRB5CCNAME=DIR:/tmp/TMP_TICKET_CACHE/ $ kinit oracle Password for oracle@BDACLOUDSERVICE.ORACLE.COM:  Copy to client machine: afilanov:ssh afilanov$ scp -i id_rsa_new.dat opc@cfclbv3872.us2.oraclecloud.com:/tmp/TMP_TICKET_CACHE/* /tmp/ Enter passphrase for key 'id_rsa_new.dat':  krb5.conf                                                                                                                                                               100%  795    12.3KB/s   00:00     primary                                                                                                                                                                 100%   10     0.2KB/s   00:00     tkt0kVvY6                                                                                                                                                               100%  874    13.6KB/s   00:00     afilanov:ssh afilanov$ rename ticket cache file and check that current user has it: afilanov:ssh afilanov$ export KRB5_CONFIG=/tmp/krb5.conf afilanov:ssh afilanov$ export KRB5CCNAME=/tmp/tkt0kVvY6  afilanov:ssh afilanov$ klist  Credentials cache: FILE:/tmp/tkt0kVvY6         Principal: oracle@BDACLOUDSERVICE.ORACLE.COM     Issued                Expires               Principal Jun 18 09:33:10 2018  Jun 19 09:33:10 2018  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM afilanov:ssh afilanov$    Practical Tip: Access to WebUI of Kerberized cluster from Windows machine In case if you want (and certainly you do) access to WebUI of your cluster (Resource Manager, JobHistory) from your windows browser, you will need to obtain kerberos ticket on your Windows machine. Here you could find great step by step instructions. Step1. Authentication in Hadoop. Integration MIT Kerberos with Active Directory It is quite common that companies use the Active Directory server to manage users and groups and want to provide them access to the secure Hadoop cluster accordingly their roles and permissions. For example, I have my corporate login afilnov and I want to work with Hadoop cluster as afilanov. For doing this you have to build trusted relationships between Active Directory and MIT KDC on BDA. All details you could find in the MOS: "How to Set up a Cross-Realm Trust to Configure a BDA MIT Kerberos Enabled Cluster with Active Directory on BDA V4.5 and Higher (Doc ID 2198152.1)", but here I'll show a quick example how it works. First, I will log in to the Active Directory server and configure the trusted relationships with my BDA KDC: C:\Users\Administrator> netdom trust ORALCE.TEST /Domain:BDA.COM /add /realm /passwordt:welcome1 C:\Users\Administrator> ksetup /SetEncTypeAttr ORALCE.TEST AES256-CTS-HMAC-SHA1-96 After this I'm going to create user in AD: I skipped all explanations here because you may find all details in MOS and here I just wanted to show that I create a user on AD (not on Hadoop) side. On the KDC side you have to create one more principal as well: # kadmin.local Authenticating as principal hdfs/admin@ORALCE.TEST with password. kadmin.local:  addprinc -e "aes256-cts:normal" krbtgt/ORALCE.TEST@BDA.COM # direction from MIT to AD After this we are ready to use our corporate login/password to work with Hadoop on behalf of this user: # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # hadoop fs -put file.txt /tmp/ # hadoop fs -ls /tmp/file.txt -rw-r--r--   3 afilanov supergroup          5 2017-08-08 18:00 /tmp/file.txt Step1. Authentication in Hadoop. SSSD integration Well, now we can obtain a Kerberos ticket and work with Hadoop as a certain user. It's important to note  that on the OS we can be logged in as any user (e.g. we could be login as root and work with Hadoop cluster as a user from AD), here an example: # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM   Valid starting     Expires            Service principal 09/04/17 17:29:24  09/05/17 03:29:16  krbtgt/BDA.COM@BDA.COM         renew until 09/11/17 17:29:24 # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023   For Hadoop, it's important to have users (and their groups) available thru the OS (user have to exist on each node in the cluster). Services like HDFS perform lookups at the OS level to determine what groups a user belongs to - and then uses that information to authorize access to files and folders.  But what if I want to have OS users from Active Directory? This is where SSSD steps in.  It is a PAM module that will forward user lookups to Active Directory.  This means that you don't need to replicate user/group information at the OS level; it simply leverages the information found in Active Directory.  See the Oracle Support site you for MOS notes about how to set it up (well written detailed instruction): Guidelines for Active Directory Organizational Unit Setup Required for BDA 4.9 and Higher SSSD Setup (Doc ID 2289768.1) and How to Set up an SSSD on BDA V4.9 and Higher (Doc ID 2298831.1) After you pass all steps listed there you may use AD user/password for login to the Linux servers of your Hadoop cluster. Here is the example: # id uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 # ssh afilanov@$(hostname) afilanov@scajvm1bda01.vm.oracle.com's password: Last login: Fri Sep  1 20:16:37 2017 from scajvm1bda01.vm.oracle.com $ id uid=825201368(afilanov) gid=825200513(domain users) groups=825200513(domain users) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023   Step2. Authorization in Hadoop. Sentry Another one powerful capability of Hadoop in the security field is role-based access for hive queries. In the Cloudera distribution, it is managed by Sentry. Kerberos is required for Sentry installation. As many things on Big Data Appliance installation of Sentry automated and could be done within one command: # bdacli enable sentry and follow the tools guide. More information you could find in MOS "How to Add or Remove Sentry on Oracle Big Data Appliance v4.2 or Higher with bdacli (Doc ID 2052733.1)".  After you enable Sentry, you can now create and enable Sentry policies. I want to mention, that in Sentry there is strict hierarchy; users always belongs to groups, groups have some roles, and roles have some privileges. And, you have to follow this hierarchy. You can't assign some privileges directly to the user or group.  I will show how to setup this policies with HUE. For this test case I'm going to create test data and load it in HDFS: # echo "1, Finance, Bob,100000">> emp.file # echo "2, Marketing, John,70000">> emp.file # hadoop fs -mkdir /tmp/emp_table/; hadoop fs -put emp.file /tmp/emp_table/ After creating the file, I log in into Hive and create an external table (for power users) and view with a restricted set of the columns(for limited users): hive> create external table emp(id int,division string,  name string, salary int) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' location "/tmp/emp_table"; hive> create view emp_limited as select division, name from emp; After I have data in my cluster, I'm going to create test users across all nodes in the cluster (or, if I'm using AD - create the users/groups there): # dcli -C "useradd limited_user" # dcli -C "groupadd limited_grp" # dcli -C "usermod -g limited_grp limited_user" # dcli -C "useradd power_user" # dcli -C "groupadd power_grp" # dcli -C "usermod -g power_grp power_user" Now, let's use user-friendly HUE graphical interface. First, go to the Security bookmark and find that we have two objects and don't have any security rules there: After this I clicked on the "Add policy" link and created policy for power user, which allows it to read the "emp" table: After this I did the same for limited_user, but allowed it to read only emp_limited view. Here is my roles with policies: Now I log in as "limited_user" and ask to show the list of the tables: Only emp_limited is available. Let's query it: Perfect. I don't have table "emp" in the list of my tables, but let's imagine that I do know the name and want to query it. My attempt failed, because of lack of priviliges.  This highlighted table level granularity, but Sentry allows you to restrict access to certain columns. I'm going to reconfig power_role and allow it to select all columns except "salary": After this I running test queries with and without salary column in where predicate: Here we go! If I list "salary" in the select statement my query fails because of lack of privileges. Practical Tip: Useful Sentry commands Alternatively, you may use beeline cli to view and manage sentry roles. Below, I'll show how to manage privileges with beeline cli. First, I obtain a Kerberos ticket for the hive user and login into hive cli ("hive" is the admin) after this drop role and create it again. Assign to it some privileges and link this role with some group. After this login as limited_user, who belongs to limited_grp and check the permissions: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST # beeline beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; ... // Check which roles sentry has 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> SHOW ROLES; ... +---------------+--+ |     role      | +---------------+--+ | limited_role  | | admin         | | power_role    | | admin_role    | +---------------+--+ 4 rows selected (0.146 seconds) // Check which role has current session jdbc:hive2://scajvm1bda04.vm.oracle.com:10> SHOW CURRENT ROLES; ... +-------------+--+ |    role     | +-------------+--+ | admin_role  | +-------------+--+ // drop role 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> drop role limited_role; ... // create this role again 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> create role limited_role; ... // give a grant to the newly created role 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> grant select on emp_limited to role limited_role; ... // link role and the group 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> grant role limited_role to group limited_grp; ... 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> !exit   go back to linux for getting ticket for another user: // obtain ticket for limited_user # kinit limited_user Password for limited_user@ORALCE.TEST: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: limited_user@ORALCE.TEST   // check groups for this user # hdfs groups limited_user limited_user : limited_grp # beeline ... beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; ... 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> show current roles; ... +---------------+--+ |     role      | +---------------+--+ | limited_role  | +---------------+--+   // Check my privilegies 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10>  show grant role limited_role; +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | database  |    table     | partition  | column  | principal_name  | principal_type  | privilege  | grant_option  |    grant_time     | grantor  | +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+ | default   | emp_limited  |            |         | limited_role    | ROLE            | select     | false         | 1502250173482000  | --       | +-----------+--------------+------------+---------+-----------------+-----------------+------------+---------------+-------------------+----------+--+   // try to select table which I've not permied to query 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> select * from emp; Error: Error while compiling statement: FAILED: SemanticException No valid privileges  User limited_user does not have privileges for QUERY  The required privileges: Server=server1->Db=default->Table=emp->Column=division->action=select; (state=42000,code=40000)   // select query which we allowed to query 0: jdbc:hive2://scajvm1bda04.vm.oracle.com:10> select * from emp_limited; ... +-----------------------+-------------------+--+ | emp_limited.division  | emp_limited.name  | +-----------------------+-------------------+--+ |  Finance              |  Bob              | |  Marketing            |  John             | +-----------------------+-------------------+--+   Step3. Encryption in Motion Another important aspect of security is network encryption. For example, even if you protect access to the servers, somebody may listen to the network between the cluster and client and intercept network packets for future analysis. Here is an example how it could be hacked (note: my cluster already Kerberized). Intruder server: # tcpdump -XX -i eth3 > tcpdump.file ------------------------------------ Client machine # echo "encrypt your data" > test.file # hadoop fs -put test.file /tmp/ ------------------------------------ Intruder server: # less tcpdump.file |grep -B 1 data         0x0060:  0034 ff48 2a65 6e63 7279 7074 2079 6f75  .4.H*encrypt.you         0x0070:  7220 6461 7461 0a                        r.data. Now we are hacked. Fortunately, Hadoop has the capability to protect the network between clients and cluster. It will cost you some performance, but the performance impact should not prevent you from enabling the network encryption between clients and cluster. Before enabling encryption I've run a simple performance test: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... Both jobs took 3.7 minutes. Remember it for now. Fortunately, Oracle Big Data Appliance provides an easy way for enabling network encryption with bdacli.  You can set it up it with one command: # bdacli enable hdfs_encrypted_data_transport ... You will need to answer some questions about your specific cluster configs, such as Cloudera Manager admin password and OS passwords. After finishing the command, I ran performance test again: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... and now I took 4.5 and 4.2 minutes respectively. The jobs perform a bit more slowly, but it is worth it. For improved performance of transferring encrypted data, we may use an advantage of Intel embedded instructions and change encryption algorithm (go to Cloudera Manager -> HDFS -> Configuration -> dfs.encrypt.data.transfer.algorithm -> AES/CTR/NoPadding): Another vulnerability is network interception during the shuffle (step between Map and Reduce operation) and communication between clients. To prevent this vulnerability, you have to encrypt shuffle traffic.  BDA again has an easy solution: run bdacli enable hadoop_network_encryption command. Now we will encrypt intermediate files, generated after shuffle step: # bdacli enable hadoop_network_encryption ... Like in the previous example, simply answer a few questions the encryption will be enabled. Let's check performance numbers again: # hadoop jar hadoop-mapreduce-examples.jar teragen 100000000 /teraInput ... # hadoop jar hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput ... we have 5.1 minute and 4.4 minutes respectively. A bit slower, but it is important in order to keep data safe. Conclusion: Network encryption prevents intruders from capturing data in flight Network encryption degrades your overall performance.  However, this degradation shouldn't stop you from enabling it because it's very important from a security perspective Step4. Encryption at Rest. HDFS transparent encryption All right, now we protected the cluster from external unauthorized access (by enabling Kerberos), encrypted network communication between the cluster and clients, encrypted intermediate files, but we still have vulnerabilities. If a user gets access to the cluster's server, he or she could read the data (despite on ACL).  Let me give you an example.  Some ordinary user put sensitive information in the file and put it on HDFS: # echo "sensetive information here" > test.file # hadoop fs -put test.file /tmp/test.file Intruder knows the file name and wants to get the content of it (sensitive information) # hdfs fsck /tmp/test.file -locations -files -blocks ... 0. BP-421546782-192.168.254.66-1501133071977:blk_1073747034_6210 len=27 Live_repl=3 ... # find / -name "blk_1073747034*" /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034_6210.meta /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034 # cat /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir20/blk_1073747034 sensetive information here The hacker found the blocks that store the data and then reviewed the physical files at the OS level.  It was so easy to do. What can you do to prevent this? The answer is to use HDFS encryption.  Again, BDA has a single command to do enable HDFS transparent encryption: bdacli enable hdfs_transparent_encryption. You may find more details in MOS "How to Enable/Disable HDFS Transparent Encryption on Oracle Big Data Appliance V4.4 with bdacli (Doc ID 2111343.1)". I'd like to note, that Cloudera has  great blogpost about HDFS transparent encryption and I would recommend you to read it. So, after encryption had been enabled, I'll repeat my previous test case.  Prior to running the test, we will create an encryption zone and copy files into that zone. // obtain hdfs kerberos ticket for working with cluster on behalf of hdfs user # kinit -kt `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` hdfs/`hostname` // create directory, which will be our security zone in future # hadoop fs -mkdir /tmp/EZ/ // create key in Key Trustee Server # hadoop key create myKey // create encrypted zone, using key created earlier # hdfs crypto -createZone -keyName myKey -path /tmp/EZ // make Oracle owner of this directory # hadoop fs -chown oracle:oracle /tmp/EZ // Switch user to oracle # kinit -kt oracle.scajvm1bda01.vm.oracle.com.keytab oracle/scajvm1bda01.vm.oracle.com // Load file in encrypted zone # hadoop fs -put test.file /tmp/EZ/ // define physical location of the file on the Linus FS # hadoop fsck /tmp/EZ/test.file -blocks -files -locations ... 0. BP-421546782-192.168.254.66-1501133071977:blk_1073747527_6703  ... [root@scajvm1bda01 ~]# find / -name blk_1073747527* /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527_6703.meta /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527 // trying to read the file [root@scajvm1bda01 ~]# cat /u02/hadoop/dfs/current/BP-421546782-192.168.254.66-1501133071977/current/finalized/subdir0/subdir22/blk_1073747527 ▒i#▒▒C▒x▒1▒U▒l▒▒▒[ Bingo! The file is encrypted and the person who attempted to access the data can only see a series of nonsensical bytes. Now let me tell couple words how encryption works. There are a few types of keys (screenshots I took from Cloudera's blog): 1) Encryption Zone key. You may encrypt files in a certain directory using some unique key. This directory called Encryption Zone (EZ) and a key called EZ key. This approach maybe quite useful, when you share Hadoop cluster among different divisions within the same company. This key stored in KMS (Key Managment Server). KMS handles generating encryption keys (EZ and DEK), also it communicates with key server and decrypts EDEK. 2) Encrypted data encryption keys (EDEK) is an attribute of the files, which stored in Name Node. 3) DEK is not persistent, you compute it on the fly from EDEK and EZ. Here is the flow of how to client write data to the encrypted HDFS. I took the explanation of this from Hadoop Security book: 1) The HDFS client calls create() to write to the new file. 2) The NameNode requests the KMS to create a new EDEK using the EZK-id/version. 3) The KMS generates a new DEK. 4) The KMS retrieves the EZK from the key server. 5) The KMS encrypts the DEK, resulting in the EDEK. 6) The KMS provides the EDEK to the NameNode. 7) The NameNode persists the EDEK as an extended attribute for the file metadata. 8) The NameNode provides the EDEK to the HDFS client. 9) The HDFS client provides the EDEK to the KMS, requesting the DEK. 10) The KMS requests the EZK from the key server. 11) The KMS decrypts the EDEK using the EZK. 12) The KMS provides the DEK to the HDFS client. 13) The HDFS client encrypts data using the DEK. 14) The HDFS client writes the encrypted data blocks to HDFS for data reading you will follow this steps: 1) The HDFS client calls open() to read a file. 2) The NameNode provides the EDEK to the client. 3) The HDFS client passes the EDEK and EZK-id/version to the KMS. 4) The KMS requests the EZK from the key server. 5) The KMS decrypts the EDEK using the EZK. 6) The KMS provides the DEK to the HDFS client. 7) The HDFS client reads the encrypted data blocks, decrypting them with the DEK. I'd like to highlight again, that all these steps are completely transparent and the end user doesn't feel any difference while working with HDFS. Tip: Key Trustee Server and Key Trustee KMS For those of you who are just starting to work with HDFS data encryption - the terms Key Trustee Server and KMS maybe a bit confusing. Which component do you need to use and for what purpose? From Cloudera's documentation: Key Trustee Server is an enterprise-grade virtual safe-deposit box that stores and manages cryptographic keys. With Key Trustee Server, encryption keys are separated from the encrypted data, ensuring that sensitive data is protected in the event that unauthorized users gain access to the storage media. Key Trustee KMS - for HDFS Transparent Encryption, Cloudera provides Key Trustee KMS, a customized Key Management Server. The KMS service is a proxy that interfaces with a backing key store on behalf of HDFS daemons and clients. Both the backing key store and the KMS implement the Hadoop KeyProvider client API. Encryption and decryption of EDEKs happen entirely on the KMS. More importantly, the client requesting creation or decryption of an EDEK never handles the EDEK's encryption key (that is, the encryption zone key). This picture (again from cloudera's documentation) shows that KMS is intermediate service in between Name Node and Key Trustee Server: Practical Tip: HDFS transparent encryption Linux operations // For get list of the keys you may use: # hadoop key list Listing keys for KeyProvider: org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider@5386659f myKey   //For get list of encryprion zones use: # hdfs crypto -listZones /tmp/EZ  myKey   // For encrypt existing data you need to copy it: # hadoop distcp /user/dir /encryption_zone Step5. Audit. Cloudera Navigator Auditing tracks who does what on the cluster - making it easy to identify improper attempts to access. Fortunately, Cloudera provides easy and efficient way to do so, it's called Cloudera Navigator. Cloudera Navigator is included with BDA and Big Data Cloud Service.  It is accessible thru Cloudera Manager: After this you may use "admin" password from Cloudera Manager. After logon you may choose "Audit" section. Where you may create different Audit reports. Like "which files user afilanov created on HDFS for a last hour":   Step6. User management and tools in Hadoop. Group Mapping HDFS is filesystem and as we discussed earlier it has ACL for managing file permissions. As you know those three magic numbers define access rules for owner-group-others. But how to understand which group belong my user? There are two types of user group lookup - LDAP based and UnixShell based. In Cloudera Manager it defined through hadoop.security.group.mapping parameter: for check a list of the groups for certain user from Linux console, just run: $ hdfs groups hive hive : hive oinstall hadoop   Step6. User management and tools in Hadoop. Connect to the hive from the bash console The two most common ways of connecting to the hive from the shell are the 1) hive cli and 2) beeline. The first is deprecated as it bypasses the security in HiveServer2; it communicates directly with the metastore.  Therefore, beeline is the recommended; it communicates with HiveServer2 - enabling authorization rules to engage. Hive cli tool is big back door for the security and it's highly recommended to disable it. To accomplish this, you need to configure hive properly. You may use hadoop.proxyuser.hive.groups parameter to allow only the users belong to the group specified in the proxy list to connect to the metastore (the application components) and as consequence, a user that does not belong to these group and run the hive cli will not connect to the metastore. Go to the Cloudera Manager -> Hive -> Configuration -> in search bar type "hadoop.proxyuser.hive.groups" and add hive, Impala and hue users: Restart hive server.  You will be able to connect to the hive cli only as a privileged user (belongs to hive, hue, Impala groups). # kinit -kt hive.keytab hive/scajvm1bda04.vm.oracle.com # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST ... # hive ... hive> show tables; OK emp emp_limited Time taken: 1.045 seconds, Fetched: 2 row(s) hive> exit; # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM ... # hive ... hive> show tables; FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.thrift.transport.TTransportException: java.net.SocketException: Connection reset hive> Great, now we locked old hive cli and now it's a good time to use new modern beeline console. So, for running beeline, you need to invoke beeline from cli and put following connection string: beeline> !connect jdbc:hive2://<FQDN to HS2>:10000/default;principal=hive/<FQDN to HS2>@<YOUR REALM>; For example: # beeline Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Beeline version 1.1.0-cdh5.11.1 by Apache Hive beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; Note: before the connect to beeline you must obtain the Kerberos ticket that is used to confirm your identity. # kinit afilanov@BDA.COM Password for afilanov@BDA.COM: # klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: afilanov@BDA.COM   Valid starting     Expires            Service principal 08/14/17 04:53:27  08/14/17 14:53:17  krbtgt/BDA.COM@BDA.COM         renew until 08/21/17 04:53:27 Note, that you may have SSL/TLS encryption (Oracle recommends to have it). You may check it, by running: $ openssl s_client -debug -connect node03.us2.oraclecloud.com:10000 if you don't have TLS/SSL encryption for Hive, you could enable it, by running: [root@node01 ~]# bdacli enable hadoop_network_encryption If you enable hive TLS/SSL encryption (for ensuring integrity and confidence between your client and server connection) you need to use quite a tricky authentification with beeline. You have to use SSL Trust Store file and trust Store Password. When a client connects to a server and that server sends a public certificate across to the client to begin the encrypted connection the client must determine if it 'trusts' the server's certificate. In order to do this, it checks the server's certificate against a list of things it's been configured to trust called a trust store You may find it with Cloudera Manager REST API: # curl -X GET -u "admin:admin1" -k -i https://<Cloudera Manager Host>:7183/api/v15/cm/config ...  "name" : "TRUSTSTORE_PASSWORD",     "value" : "Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN"   },{     "name" : "TRUSTSTORE_PATH",     "value" : "/opt/cloudera/security/jks/scajvm.truststore"   } Alternatively, you may use bdacli tool on BDA: # bdacli getinfo cluster_https_truststore_password # bdacli getinfo cluster_https_truststore_path Now you know trustee password and trrustee path. If you doubt that it matches, you could try to take a look on the trustore file content: afilanov-mac:~ afilanov$ keytool -list -keystore testbdcs.truststore  Enter keystore password:   Keystore type: JKS Keystore provider: SUN Your keystore contains 5 entries   cfclbv3874.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): F0:5D:28:36:99:67:FB:C0:B1:D5:B3:75:DF:D6:51:9B:DF:EB:3E:3A cfclbv3871.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): AF:3A:20:90:04:0A:27:B5:BD:DF:83:32:C7:4A:AF:AF:C4:97:E1:30 cfclbv3873.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): 30:09:B9:A8:79:D7:F4:02:3F:72:8C:05:F1:A4:BF:04:9B:8B:78:CA cfclbv3870.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): EA:F0:38:1E:BB:89:E2:05:38:CA:F2:FB:4D:41:82:75:BE:5D:F7:88 cfclbv3872.us2.oraclecloud.com, Apr 13, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): C5:7D:F2:FA:96:8C:AB:4A:D2:03:02:DA:D3:F5:0C:7C:45:8E:26:E7 and after this connect to the certain database: For example, in my case I used: # beeline Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0 Beeline version 1.1.0-cdh5.11.1 by Apache Hive beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;sslTrustStore=/opt/cloudera/security/jks/scajvm.truststore;trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; Please note, that trustore is the public key and it's not a big secret; it generally is not a problem to use it in scripts and share. Also, alternatively, if you don't want to use so long connection string all the time, you could just put truststore credentials in Linux environment: # export HADOOP_OPTS="-Djavax.net.ssl.trustStore=/opt/cloudera/security/jks/scajvm.truststore -Djavax.net.ssl.trustStorePassword=Hh1SDy8wMyKpoz25vgIB7fMwjJkkaLtA6FR4SzW1bULVs9dKhgVNQwvaoRy1GVbN" # beeline ... beeline> !connect jdbc:hive2://scajvm1bda04.vm.oracle.com:10000/default;ssl=true;principal=hive/scajvm1bda04.vm.oracle.com@ORALCE.TEST; you may also automate all steps above by scripting them: $ cat beeline.login #!/bin/bash export CMUSR=admin export CMPWD=admin_password tspass=`bdacli getinfo cluster_https_truststore_password` tspath=`bdacli getinfo cluster_https_truststore_path` domain=`bdacli getinfo cluster_domain_name` realm=`bdacli getinfo cluster_kerberos_realm` hivenode=`json-select --jpx=HIVE_NODE /opt/oracle/bda/install/state/config.json` set +o histexpand echo "!connect jdbc:hive2://$hivenode:10000/default;ssl=true;sslTrustStore=$tspath;trustStorePassword=$tspass;principal=hive/$hivenode@$realm" beeline -u "jdbc:hive2://$hivenode:10000/default;ssl=true;sslTrustStore=$tspath;trustStorePassword=$tspass;principal=hive/$hivenode@$realm" $ ./beeline.login Note: if you are using HiveServer2 balancer, you should specify host with the balancer as principal. Step6. User management and tools in Hadoop. HUE and LDAP authentification. One more interesting thing which you could do with your Active Directory (or any other LDAP implementation) is integrate HUE and LDAP and use LDAP passwords for authenticate your users in HUE.  Before doing this you have to enable TLSv1 in your Java settings on the HUE server. Here is detailed MOS note how to do this. Search: Disables TLSv1 by Default For Cloudera Manager/Hue/And in System-Wide Java Configurations (Doc ID 2250841.1) After this, you may want to watch these youtube videos to understand how easy is to do this integration.  Authenticate Hue with LDAP and Search Bind or Authenticate Hue with LDAP and Direct Bind. It's really not too hard. Potentially, you may need to define your base_dn and here is the good article about how to do this. Next you may need the bind user for make the first connection and import all other users (here is the explanation from Cloudera Manager: Distinguished name of the user to bind as. This is used to connect to LDAP/AD for searching user and group information. This may be left blank if the LDAP server supports anonymous binds.). For this purposes I used my AD account afilanov.  After this I have to login to the HUE using afilanov login/password: Then click on the user name and choose "Manage Users".  Add/Sync LDAP users: Optionally, you may put the Username pattern and click Sync: Here we go! Now we have list of the LDAP users imported into HUE: and now we could use any of this accounts to log in into HUE: Appendix A. Performance impact of HDFS Transparent Encryption. As we all saw in this post HDFS transparent encryption is the good way to protect your data. But Many you users, remembering that nothing is free, are curious, how expensive it is. Let's make a test and after enabling HDFS transparent encryption, run the test case to measure performance degradation. First of all we will need to create encrypted Zone (as HDFS user): # kinit -kt `find / -name hdfs.keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2` hdfs/`hostname` # hadoop key create myKey # hadoop fs -mkdir /tmp/EZ/ # hdfs crypto -createZone -keyName myKey -path /tmp/EZ/ # hadoop fs -chown oracle:oracle /tmp/EZ After this we will need to copy data over there. I've used distcp to copy data: # hadoop distcp -m 200 -skipcrccheck -update /user/hive/warehouse/parq.db /tmp/EZ/parq_encrypted.db please, note that you have to use keys update and skipcrccheck for accomplish copy to secured zone. For generate metadata, I've used follow script: #!/bin/bash rm -f tableNames.txt rm -f HiveTableDDL.txt hive -e "use $1; show tables;" > tableNames.txt   wait cat tableNames.txt |while read LINE    do    hive -e "use $1;show create table $LINE" >>HiveTableDDL.txt    echo  -e "\n" >> HiveTableDDL.txt    done rm -f tableNames.txt echo "Table DDL generated"   Note: for a small database you could use export/import scripts for your convenience. it's much slower, than distcp option, but more convenient. for export entire database, run: # cat export_hive.sh    #!/bin/bash EXPORT_DIR=/tmp/export EXPORT_DB=parq HiveTables=$(hive -e "use $EXPORT_DB;show tables;" 2>/dev/null | egrep -v "WARN|^$|^Logging|^OK|^Time\ taken") hdfs dfs -mkdir -p $EXPORT_DB 2>/dev/null for Table in $HiveTables do     hive -e "EXPORT TABLE $EXPORT_DB.$Table TO '$EXPORT_DIR/$EXPORT_DB.db/$Table';" done and for import, run: # cat import_hive.sh    #!/bin/bash EXPORT_DIR=/tmp/export IMPORT_DIR=/tmp/EZ EXPORT_DB=parq IMPORT_DB=parq_encrypted HDFS_NAMESPACE=hdfs://lab-bda-ns HiveTables=$(hive -e "use $EXPORT_DB;show tables;" 2>/dev/null | egrep -v "WARN|^$|^Logging|^OK|^Time\ taken") hdfs dfs -mkdir -p $EXPORT_DB 2>/dev/null for Table in $HiveTables do echo "IMPORT EXTERNAL TABLE $IMPORT_DB.$Table from '$HDFS_NAMESPACE$EXPORT_DIR/$EXPORT_DB.db/$Table' LOCATION '$HDFS_NAMESPACE$IMPORT_DIR/$IMPORT_DB.db/$Table';" hive -e "IMPORT EXTERNAL TABLE $IMPORT_DB.$Table from '$HDFS_NAMESPACE$EXPORT_DIR/$EXPORT_DB.db/$Table' LOCATION '$HDFS_NAMESPACE$IMPORT_DIR/$IMPORT_DB.db/$Table';" done after changing location to the new secure point, we are ready to run the test. I use 60 queries from TPCDS benchmark and Spark SQL as processing engine. Results was quite positive test run over encrypted data takes 65.8 minutes, over non-encrypted data 59.3 minutes. Here we may conclude that processing of encrypted data roughly 11% slower in our case. Below diagrams, that shows detailed CPU and IO usage for both cases: for writing there is no difference for writing to encrypted zone or unencrypted directory. One very important note is that you have to use AES/CTR/NoPadding algorithm: It raise up performance drastically. For example, copy of 1.1TB data in case of AES takes 27 minutes versus 5.1 Hours with RC4. Appendix B. Configuring Key Trustee HDFS Transparent Encryption. There are 2 ways to configure HDFS Transparent Encryption on BDA clusters 1) Have mammoth install the Key Trustee servers on the BDA cluster and manage How to Enable HDFS Transparent Encryption with Key Trustee Servers Set Up on the BDA on Oracle Big Data Appliance V4.5 and Higher with bdacli (Doc ID 2166648.1) 2) Have the customer install the Key Trustee servers on his own servers and configure mammoth to point to them How to Enable HDFS Transparent Encryption with Key Trustee Servers Set Up off the BDA on Oracle Big Data Appliance V4.4 and Higher with bdacli (Doc ID 2111343.1) Both ways are fully supported but the second method offers a higher level of security (because the keys are stored on different servers from HDFS) and so we recommend the second method.  

Security is a very important aspect of many projects and you must not underestimate it, Hadoop security is very complex and consist of many components, it's better to enable one by one security...

Big Data

How Enabling CDSW Will Help You Make Better Use of Your Big Data Appliance

No one has to elaborate on the interest and importance of Data Science, so we won't go into why you should be looking at frameworks and tools to enable AI/ML and more fun things on your Hadoop infrastructure. One way to do this on Oracle Big Data Appliance is to use Cloudera Data Science Workbench (CDSW). See at the end of this post for some information on CDSW and its benefits. How does it work? Assuming you want to go with CDSW for your data science needs, here is what is being enabled with Big Data Appliance and what we did to enable support for CDSW. CDSW will run on (a set of) edge nodes on the cluster. These nodes must adhere to some specific OS versions, and so we released a new BDA base image for edge nodes that provides Oracle Linux 7.x with UEK 4. CDSW supports Oracle Linux 7 as of CDSW 1.1 (more version information here). With the OS version squared away, we are set to support CDSW, and on a BDA (schematic shown below) with 8 nodes, you would re-image the two edges to the BDA OL7 base image, configure the network and integrate the nodes as edges into the cluster. After this you apply the CDSW install as documented by Cloudera.   As you can see in the image, the two edge nodes are running OL7, but they form an integral part of the BDA cluster. They are also covered under the embedded Cloudera Enterprise Data Hub license. The remainder of the cluster nodes, as would be done in almost all instances, remains your regular OL6 OS, with the Hadoop stack installed. Cloudera Manager if available for you to administer the cluster (no changes there of course). And that really is it. Detailed steps for Oracle customers are tested as well as published via My Oracle Support. What is Cloudera Data Science Workbench? [From Cloudera - Neither I nor Oracle take credit for the below]  The Cloudera Data Science Workbench (CDSW) is a self-service environment for data science on Cloudera Enterprise. Based on Cloudera’s acquisition of data science startup Sense.io, CDSW allows data scientists to use their favorite open source languages -- including R, Python, and Scala -- and libraries on a secure enterprise platform with native Apache Spark and Apache Hadoop integration, to accelerate analytics projects from exploration to production. CDSW delivers the following benefits: For data scientists:  Use R, Python, or Scala with their favorite libraries and frameworks, directly from a web browser. Directly access data in secure Hadoop clusters with Spark and Impala. Share insights with their entire team for reproducible, collaborative research. For IT professionals:  Give your data science team the freedom to work how they want, when they want.  Stay compliant with out-of-the-box support for full Hadoop security, especially Kerberos. Run on Private Cloud, Cloud at Customer, or Public Cloud. Read more on CDSW here. [End Cloudera bit] If you are reading this you must be interested in Analytics, AI/ML on Hadoop. This post is very cool and uses the freely downloadable Big Data Lite VM. Check it out...

No one has to elaborate on the interest and importance of Data Science, so we won't go into why you should be looking at frameworks and tools to enable AI/ML and more fun things on your Hadoop...

Big Data

If You Struggle With Keeping your BDAs up to date, Then Read This

[Updated on October 15th to reflect the release of BDA 4.10, with CDH 5.12.1] One of the interesting aspects of keeping your Oracle Big Data infrastructure up to date (Hadoop, but also the OS and the JDK) is trying to get a hold of the latest information enabling everyone to plan their upgrades and see what is coming. The following is a list of versions released over the past quarters and a look ahead to what is coming. What is the Schedule? The intention is to release a software bundle - often referred to as a Mammoth bundle (the install utility is called Mammoth) - for our systems roughly 4-8 weeks after Cloudera releases their release. We are getting in the habit to actually release the BDA versions with the .1 update to the Cloudera version. As an example: CDH 5.11.0 was released on April 19, 2017 CDH 5.11.1 was released on June 13, 2017 BDA 4.9.0 was release on June 18, 2017, picking up 5.11.1 for both CDH and CM So, what is going on in the time between a CDH release and a BDA release? We do a few things (all on the same hardware our customers run): We pre-test with pre-GA drops and try to uncover major issues early We do the full OS, MySQL Database and Java upgrades (a Mammoth bundle ups the infra, not just Hadoop and Spark), test those and then do the below. As part of this, we also run security scans to ensure we pick up the latest security fixes - which is of course one of the reasons we update the OS with every release  We fully test deploying secure and non-secure clusters with the latest version as soon as the final bits drop (this is not weeks in advance, but when the SW is released), and we run smoke tests  We fully test upgrading clusters (secure and non-secure) from a variety of versions to the new version on BDA hardware We fully test Node Migration, which is our automated way of dealing with node failures. Eg. in the unlikely event that node 2 fails, you run a single command to migrate node 2 and you are back in full HA mode... We update the Big Data Connectors and other related components and run smoke tests We update relevant parameters to comply with best practices and with the BDA hardware profiles to optimize the clusters before we ship them Add BDA specific features like (we just enabled OL5 to OL6 migration of the clusters in BDA 4.9 as an example of such features) Make sure that we do all of this quickly again to pick up the .1 and not the .0 if we pick up .1 Stuff I tend to forget about... What is Past and what is Next The table below captures - and will be updated going forward - where we are now and what is coming next in terms of estimated timing for releases. All SUBJECT TO CHANGE WITHOUT NOTICE (also see the note below on Safe Harbor): Date BDA Version CDH Version Comments Apr 11, 2017 4.8.0 5.10.1 MySQL version uptick Jun 18, 2017 4.9.0 5.11.1 OL5 --> OL6 migration Oct 15, 2017 4.10.0 5.12.1 Final OL5 release | Faster Node migration Futures:       Dec 2017 4.11.0 5.13.x             As we move forward, we will attempt to keep this up to date, so folks can look ahead a little. For Cloudera's release sequence and times, please refer to Cloudera's communications. To configure and set up BDA systems, ensure you download the configurator utility (here) and review the documentation on Mammoth and BDA in general (here). Learn more about BDA, Mammoth on OTN. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

[Updated on October 15th to reflect the release of BDA 4.10, with CDH 5.12.1] One of the interesting aspects of keeping your Oracle Big Data infrastructure up to date (Hadoop, but also the OS and the...

Big Data SQL

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22 Many thanks to Dario Vega, who is the actual author of this content, I'm just publishing it in the Big Data SQL blog. Create a hive table with a binary field and cast to BLOB type in RDBMS when using big data sql For text files, hive is storing in a base64 representation the binary fields. Normally, there is no problem with newline character and not extra work inside the Oracle database the conversion is done by Big Data SQL Using json files org.apache.hadoop.hive.serde2.SerDeException: java.io.IOException: JsonSerDe does not support BINARY type org.openx.data.jsonserde.JsonSerDe is generating null values, So you need write in base64 and do the transformation in the database. TIP : The standard ECMA-404 “The JSON Data Interchange Format” is suggesting “JSON is not indicated for applications requiring binary data” but it is used for many people including our cloud services (of course using base 64). Using ORC, parquet, avro it is working well When using avro-tools the json file is generated using base32 but each format is storing using their own representation [oracle@tvpbdaacn13 dvega]$ /usr/bin/avro-tools tojson avro.file.dvega | more {"zipcode":{"string":"00720"}, "lastname":{"string":"ALBERT"}, "firstname":{"string":"JOSE"}, "ssn":{"long":253181087}, "gender":{"string":"male"}, "license":{"bytes":"S11641384"} } [oracle@tvpbdaacn13 dvega]$ /usr/bin/parquet-tools head parquet.file.dvega zipcode = 00566 lastname = ALEXANDER firstname = PETER ssn = 637221663 gender = male license = UzY4NTkyNTc4 Simulating using Linux tools On hive: create table image_new_test (img binary); On Oracle: SQL> CREATE TABLE image_new_test ( IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_new_test ) ); On Linux: base64 --w 10000000 YourImage.PNG > YourImage.BASE64 #Be sure to have only one line before copy to hadoop. If not fix wc -l YourImage.BASE64 # you can concat many images on the same BASE64 file - one image by line hadoop fs -put Capture.BASE64 hdfs://tvpbdaacluster3-ns/user/hive/warehouse/pmt.db/image_new_test or use load hive commands Validate using SQL Developer: Compare to the original one: Copying images stored in the database to Hadoop Original tables: SQL> create table image ( id number, img BLOB); insert an image using sqldeveloper REM create an external table to copy the dmp files to hadoop CREATE TABLE image_dmp ORGANIZATION EXTERNAL ( TYPE oracle_datapump DEFAULT DIRECTORY DEFAULT_DIR LOCATION ('filename1.dmp') ) AS SELECT * FROM image; Hive Tables: # copy files to hadoop eg. on /user/dvega/images/filename1.dmp CREATE EXTERNAL TABLE image_hive_dmp ROW FORMAT SERDE 'oracle.hadoop.hive.datapump.DPSerDe' STORED AS INPUTFORMAT 'oracle.hadoop.hive.datapump.DPInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/user/oracle/dvega/images/'; create table image_hive_text as select * from image_hive_dmp ; Big Data SQL tables: CREATE TABLE IMAGE_HIVE_DMP ( ID NUMBER , IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_hive_dmp ) ); CREATE TABLE IMAGE_HIVE_TEXT ( ID NUMBER , IMG BLOB ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster= tvpbdaacluster3 com.oracle.bigdata.tablename: pmt.image_hive_text ) );

Big Data SQL Quick Start. Binary Images and Big Data SQL – Part 22 Many thanks to Dario Vega, who is the actual author of this content, I'm just publishing it in the Big Data SQL blog. Create a hive...

Big Data SQL

Big Data SQL Quick Start. Complex Data Types – Part 21

Big Data SQL Quick Start. Complex Data Types – Part 21 Many thanks to Dario Vega, who is the actual author of this content. I'm just publishing it on this blog. A common potentially mistaken approach that people take regarding the integration of NoSQL, Hive and ultimately BigDataSQL is to use only a RDBMS perspective and not an integration point of view. People generally think about all the features and data types they're already familiar with from their experience using one of these products; rather than realizing that the actual data is stored in the Hive (or NoSQL) database rather than RDBMS. Or without understanding that the data will be querying from RDBMS.  When using Big Data SQL with complex types, we are thinking to use JSON/SQL without taking care of differences between Oracle Database and Hive use of Complex Types. Why ? Because the complex types are mapped to varchar2 in JSON format, so we are reading the data in JSON style instead of the original system.  The Best sample of this is from a Json perspective JSON ECMA-404 - Map type does not exist.  Programming languages vary widely on whether they support objects, and if so, what characteristics and constraints the objects offer. The models of object systems can be wildly divergent and are continuing to evolve. JSON instead provides a simple notation for expressing collections of name/value pairs. Most programming languages will have some feature for representing such collections, which can go by names like record, struct, dict, map, hash, or object. The following built-in collection functions are supported in Hive: int size (Map<K.V>) Returns the number of elements in the map type. array<K> map_keys(Map<K.V>) Returns an unordered array containing the keys of the input map. array<V> map_values(Map<K.V>)Returns an unordered array containing the values of the input map. Are they supported in RDBMS? the answer is NO but may be YES if using APEX PL/SQL or JAVA programs.  In the same way, there is also a difference between Impala and Hive. Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types. The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other programming languages to deconstruct the underlying data structures. Best practices We would advise taking a conservative approach. This is because the mappings between the NoSQL data model, the Hive data model, and the Oracle RDBMS data model is not 1-to-1. For example, the NoSQL data model is quite a rich and there are many things one can do with nested classes in NoSQL that have no counterpart in either Hive or Oracle Database (or both). As a result, integration of the three technologies had to take a 'least-common-denominator' approach; employing mechanisms common to all three. Edit But let me show a sample Impala code `phoneinfo` map<string,string> impala> SELECT ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO.* FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO WHERE zipcode = '02610' AND lastname = 'ACEVEDO' AND firstname = 'TAMMY' AND ssn = 576228946 ; +---------+----------+-----------+-----------+--------+------+--------------+ | zipcode | lastname | firstname | ssn | gender | KEY | VALUE | +---------+----------+-----------+-----------+--------+------+--------------+ | 02610 | ACEVEDO | TAMMY | 576228946 | female | WORK | 617-656-9208 | | 02610 | ACEVEDO | TAMMY | 576228946 | female | cell | 408-656-2016 | | 02610 | ACEVEDO | TAMMY | 576228946 | female | home | 213-879-2134 | +---------+----------+-----------+-----------+--------+------+--------------+ Oracle code: `phoneinfo` IS JSON SQL> SELECT /*+ MONITOR */ a.json_column.zipcode ,a.json_column.lastname ,a.json_column.firstname ,a.json_column.ssn ,a.json_column.gender ,a.json_column.phoneinfo FROM pmt_rmvtable_hive_json_api a WHERE a.json_column.zipcode = '02610' AND a.json_column.lastname = 'ACEVEDO' AND a.json_column.firstname = 'TAMMY' AND a.json_column.ssn = 576228946 ; ZIPCODE : 02610 LASTNAME : ACEVEDO FIRSTNAME : TAMMY SSN : 576228946 GENDER : female PHONEINFO :{"work":"617-656-9208","cell":"408-656-2016","home":"213-879-2134"} QUESTION : How to transform this JSON - PHONEINFO in two “arrays” keys, values- Map behavior expected. Unfortunately, the nested path JSON_TABLE operator is only available for JSON ARRAYS. In the other side, when using JSON, we can access to each field as columns. SQL> SELECT /*+ MONITOR */ ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,LICENSE ,a.PHONEINFO.work ,a.PHONEINFO.home ,a.PHONEINFO.cell FROM pmt_rmvtable_hive_orc a WHERE zipcode = '02610' AND lastname = 'ACEVEDO' AND firstname = 'TAMMY' AND ssn = 576228946; ZIPCODE LASTNAME FIRSTNAME SSN GENDER LICENSE WORK HOME CELL -------------------- -------------------- -------------------- ---------- -------------------- ------------------ --------------- --------------- --------------- 02610 ACEVEDO TAMMY 576228946 female 533933353734363933 617-656-9208 213-879-2134 408-656-2016 and what about using map columns on the where clause Looking for a specific phone number Impala code `phoneinfo` map<string,string> SELECT ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO.* FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO WHERE PHONEINFO.key = 'work' AND PHONEINFO.value = '617-656-9208' ; +---------+------------+-----------+-----------+--------+------+--------------+ | zipcode | lastname | firstname | ssn | gender | KEY | VALUE | +---------+------------+-----------+-----------+--------+------+--------------+ | 89878 | ANDREWS | JEREMY | 848834686 | male | WORK | 617-656-9208 | | 00183 | GRIFFIN | JUSTIN | 976396720 | male | WORK | 617-656-9208 | | 02979 | MORGAN | BONNIE | 904775071 | female | WORK | 617-656-9208 | | 14462 | MCLAUGHLIN | BRIAN | 253990562 | male | WORK | 617-656-9208 | | 83193 | BUSH | JANICE | 843046328 | female | WORK | 617-656-9208 | | 57300 | PAUL | JASON | 655837757 | male | WORK | 617-656-9208 | | 92762 | NOLAN | LINDA | 270271902 | female | WORK | 617-656-9208 | | 14057 | GIBSON | GREGORY | 345334831 | male | WORK | 617-656-9208 | | 04336 | SAUNDERS | MATTHEW | 180588967 | male | WORK | 617-656-9208 | ... | 23993 | VEGA | JEREMY | 123967808 | male | WORK | 617-656-9208 | +---------+------------+-----------+-----------+--------+------+--------------+ Fetched 852 ROW(s) IN 99.80s But let me continue showing the same code on Oracle (querying on work phone). Oracle code `phoneinfo` IS JSON SELECT /*+ MONITOR */ ZIPCODE ,LASTNAME ,FIRSTNAME ,SSN ,GENDER ,PHONEINFO FROM pmt_rmvtable_hive_parquet a WHERE JSON_QUERY("A"."PHONEINFO" FORMAT JSON , '$.work' RETURNING VARCHAR2(4000) ASIS WITHOUT ARRAY WRAPPER NULL ON ERROR)='617-656-9208' ; 35330 SIMS DOUGLAS 295204437 male {"work":"617-656-9208","cell":"901-656-9237","home":"303-804-7540"} 43466 KIM GLORIA 358875034 female {"work":"617-656-9208","cell":"978-804-8373","home":"415-234-2176"} 67056 REEVES PAUL 538254872 male {"work":"617-656-9208","cell":"603-234-2730","home":"617-804-1330"} 07492 GLOVER ALBERT 919913658 male {"work":"617-656-9208","cell":"901-656-2562","home":"303-804-9784"} 20815 ERICKSON REBECCA 912769190 female {"work":"617-656-9208","cell":"978-656-0517","home":"978-541-0065"} 48250 KNOWLES NANCY 325157978 female {"work":"617-656-9208","cell":"901-351-7476","home":"213-234-8287"} 48250 VELEZ RUSSELL 408064553 male {"work":"617-656-9208","cell":"978-227-2172","home":"901-630-7787"} 43595 HALL BRANDON 658275487 male {"work":"617-656-9208","cell":"901-351-6168","home":"213-227-4413"} 77100 STEPHENSON ALBERT 865468261 male {"work":"617-656-9208","cell":"408-227-4167","home":"408-879-1270"} 852 ROWS selected. Elapsed: 00:05:29.56 In this case, we can also use the dot-notation A.PHONEINFO.work = '617-656-9208' Note: for make familiar with Database JSON API you may use follow blog series: https://blogs.oracle.com/jsondb

Big Data SQL Quick Start. Complex Data Types – Part 21 Many thanks to Dario Vega, who is the actual author of this content. I'm just publishing it on this blog. A common potentially mistaken approach...

Big Data SQL

Big Data SQL Quick Start. Custom SerDe – Part 20

Big Data SQL Quick Start. Custom SerDe – Part 20 Many thanks to Bilal Ibdah, who is actual author of this content, I'm just publishing it in the Big Data SQL blog. A modernized data warehouse is a data warehouse augmented with insights and data from a Big Data environment, typically Hadoop, now rather than moving and pushing the Hadoop data to a database, companies tend to expose this data through a unified layer that allows access to all data storage platforms, Hadoop, Oracle DB & NoSQL to be more specific. The problem lies when the data that we want to expose is stored in its native format and in the lowest granularity possible, for example packet data, which can be in a binary format (PCAP), typical uses of packet data is in the telecommunications industry where this data is generated from a packet core, and can contain raw data records, known in the telecom industry as XDRs. Here as an example of traditional architecture when source data is loading into mediation and after this TEXT (CSV) files parsed to some ETL engine and then load data into Database: here is an alternative architecture, when you load the data directly to the HDFS (which is the part of your logical datawarehouse) and after this parse it on the fly during SQL running: In this blog we’re going to use Oracle Big Data SQL to expose and access raw data stored in PCAP format living in hadoop. The first step is up store the PCAP files in HDFS using the “copyFromLocal” command. This is what the file pcap file looks like in HDFS: In order to expose this file using Big Data SQL, we need to register this file in the Hadoop Metastore, once it’s registered in the metastore Big Data SQL can access the metadata, create an external table, and run pure Oracle SQL queries on the file, but registering this file requires to unlock the content using a custom SerDe, more details here. Start by downloading the PCAP project from GitHub here, the project contains two components: The hadoop-pcap-lib, which can be used in MapReduce jobs and, The hadoop-pcap-serde, which can be used to query PCAPs in HIVE For this blog, we will only use the serde component. If the serde project hasn’t been compiled, compile it in an IDE or in a cmd window using the command “mvn package -e -X” Copy the output jar named “hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar” found in the target folder to each node in your hadoop cluster: Then add the pcap serde to the HIVE environment variables through Cloudera Manager: Then save the changes and restart HIVE (you might also need to redeploy the configuration and restart the stale services). Now let’s create a HIVE table and test the serde; copy the below to create a HIVE table: DROP table pcap; ADD JAR hadoop-pcap-serde-0.1-jar-with-dependencies.jar; SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.DnsPcapReader; CREATE EXTERNAL TABLE pcap (ts bigint, ts_usec string, protocol string, src string, src_port int, dst string, dst_port int, len int, ttl int, dns_queryid int, dns_flags string, dns_opcode string, dns_rcode string, dns_question string, dns_answer array<string>, dns_authority array<string>, dns_additional array<string>) ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer' STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs:///user/oracle/pcap/'; Now it’s time to test the serde on HIVE, let’s run the below query: select * from pcap limit 5; The query ran successfully. Next we will create an Oracle external table that points to the pcap file using Big Data SQL, for this purpose we need to add the PCAP serde file to the Big Data SQL environment variables (this must be done on each node in your hadoop cluster). Create a directory on each server in the Oracle Big Data Appliance such as “/home/oracle/pcapserde/ ” Copy the serde jar to each node in your Big Data Appliance. Browse to /opt/oracle/bigdatasql/bdcell-12.1 Add the the pcap jar file to the environment variables list in the configuration file “bigdata.properties” The class also needs to be updated in bigdata.properties file on the database nodes. First we need to copy the jar to the database nodes:  Copy jar to db side Add jar to class path Create db external table and run query Restart “bdsql” service in Cloudera Manager After this we are goot to define External table in Oracle RDBMS and query it! Just in case I will highlight that in the last query we query (read as parse and query) binary data on the fly.

Big Data SQL Quick Start. Custom SerDe – Part 20 Many thanks to Bilal Ibdah, who is actual author of this content, I'm just publishing it in the Big Data SQL blog. A modernized data warehouse is a data...

Data Warehousing

Connecting Apache Zeppelin to your Oracle Data Warehouse

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your browser so as an end user there is no desktop software to download and install. Interestingly, I had a very quick response to this article asking about how to setup a connection within Zeppelin to an Oracle Database. Therefore, in this post I am going to look at how you can install the Zeppelin server and create a connection to your Oracle data warehouse. This aim of this post is to walk you through the following topics: Installing Zeppelin Configuring Zeppelin What is an interpreter Finding and installing the Oracle JDBC drivers Setting up a connection to an Oracle PDB Firstly a quick warning! There are a couple of different versions of Zeppelin available for download. At the moment I am primarily using version 0.6.2 which works really well. Currently, for some reason I am seeing performance problems with the latest iterations around version 0.7.x and this issue. I have discussed this a few people here at Oracle and we are all seeing the same behaviour - queries will run, they just take 2-3 minutes longer for some unknown reason compared with earlier versions, pre-0.7.x, of Zeppelin. In the interests of completeness in this post I will cover setting up a 0.6.2 instance of Zeppelin as well as a 0.7.1 instance. Installing Zeppelin The first thing you need to decide is where to install the Zeppelin software. You can run on your own PC or on a separate server or on the same server that is running your Oracle Database. I run all my linux based database environments within Virtualbox images so I always install onto the same virtual machine as my Oracle database - makes life easier for moving demos around when I am heading off to user conference. Step two is to download the software. The download page is here: https://zeppelin.apache.org/download.html. Simply pick the version you want to run and download the corresponding compressed file - my recommendation, based on my experience, is to stick with version 0.6.2 which was released on Oct 15, 2016. I always select to download the full application - “Binary package with all interpreters” just to make life easy and it also gives me access the full range of connection options which, as you will discover in my next post, is extremely useful. Installing Zeppelin - Version 0.6.2 After downloading the zeppelin-0.6.2-bin-all.tgz file onto my Linux Virtualbox machine I simply expand the file to create a “zeppelin-0.6.2-bin-all” directory. The resulting directory structure looks like this: Of course you can rename the folder name to something more meaningful, such as “my-zeppelin” if you wish….obviously, the underlying folder structure remains the same! Installing Zeppelin - Version 0.7.x The good news is that if you want to install one of the later versions of Zeppelin then the download and unzip process is exactly the same. At this point in time there are two versions of 0.7, however, both 0.7.0 and 0.7.1 seem to suffer from poor query performance when using the JDBC driver (I have only tested the JDBC driver against Oracle Database but I presume the same performance issues are affecting other types of JDBC-related connections). As with the previous version of Zeppelin you can, if required, change the default directory name to something more suitable. Now we have our notebook software unpacked and ready to go! Configuring Zeppelin (0.6.2 and 0.7.x) This next step is optional. If you have installed the Zeppelin software on the same server or virtual environment that runs your Oracle Database then you will need to tweak the default configuration settings to ensure there are no clashes with the various Oracle Database services. By default, you access the Zeppelin Notebook home page via the port 8080. Depending on your database environment this may or may not cause problems. In my case, this port was already being used by APEX, therefore, it was necessary to change the default port… Configuring the Zeppelin http port If you look inside the “conf” directory there will be a file named “zeppelin-site.xml.template”, rename this to “zeppelin-site.xml”. Find the following block of tags: <property> <name>zeppelin.server.port</name> <value>8080</value> <description>Server port.</description> </property> the default port settings in the conf file will probably clash with the APEX environment in your Oracle Database. Therefore, you will need to change the port setting to another value, such as: <property> <name>zeppelin.server.port</name> <value>7081</value> <description>Server port.</description> </property> Save the file and we are ready to go! It is worth spending some time reviewing the other settings within the conf file that let you use cloud storage services, such as the Oracle Bare Metal Cloud Object Storage service. For my purposes I was happy to accept the default storage locations for managing my notebooks and I have not tried to configure the use of an SSL service to manage client authentication. Obviously, there is a lot more work that I need to do around the basic setup and configuration procedures which hopefully I will be able to explore at some point in time - watch this space! OK, now we have everything in place: software, check…. port configuration, check. It’s time to start your engine! Starting Zeppelin This is the easy part. Within the bin directory there is a shell script to run the Zeppelin daemon: . ../my-zeppelin/bin/zeppelin-daemon.sh start There is a long list of command line environment settings that you can use, see here: https://zeppelin.apache.org/docs/0.6.2/install/install.html. In my Virtualbox environment I found it useful to configure the following settings: ZEPPELIN_MEM: amount of memory available to Zeppelin. The default setting is - -Xmx1024m -XX:MaxPermSize=512m ZEPPELIN_INTP_MEM: amount of memory available to the Zeppelin Interpreter (connection) engine and the default setting is derived from the setting of ZEPPELIN_MEM ZEPPELIN_JAVA_OPTS: simply lists any additional JVM options  therefore, my startup script looks like this: set ZEPPELIN_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m set ZEPPELIN_INTP_MEM=-Xms1024m -Xmx4096m -XX:MaxPermSize=2048m set ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=8g -Dspark.cores.max=16" . ../my-zeppelin/bin/zeppelin-daemon.sh start    Fingers crossed, once Zeppelin has started the following message should appear on your command line:  Zeppelin start                                             [  OK  ]   Connecting to Zeppelin Everything should now be in place to test whether your Zeppelin environment is up and running. Open a browser and type the ip address/host name and port reference which in my case is: http://localhost:7081/#/ then the home page should appear: The landing pad interface is nice and simple. In the top right-hand corner you will see a green light which tells me that the Zeppelin service is up and running. “anonymous” is my user id because I have not enabled client side authentication. In the main section of the welcome screen you will see links to the help system and the community pages, which is where you can log any issues that you find. The Notebook section is where all the work is done and this is where I am going to spend the next post exploring in some detail. If you are used using a normal BI tool then Zeppelin (along with most other notebook applications) will take some getting used to because it creating reports follows is more of scripting-style process rather than a wizard-driven click-click process you get with products like Oracle Business Intelligence. Anyway, more on this later, What is an Interpreter? To build notebooks in Zeppelin you need to make connections to your data sources. This is done using something called an “Interpreter”. This is a plug-in which enables Zeppelin to use not only a specific query language but also provides access to backend data-processing capabilities. For example, it is possible to include shell scripting code within a Zeppelin notebook by using the %sh interpreter. To access an Oracle Database we use the JDBC interpreter. Obviously, you might want to have lots of different JDBC-based connections - maybe you have an Oracle 11g instance, a 12cR1 instance and a 12c R2 instance. Zeppelin allows you to create new interpreters and define their connection characteristics. It’s at this point that version 0.6.2 and versions 0.7.x diverge. Each has its own setup and configuration process for interpreters so I will explain the process for each version separately. Firstly, we need to track down some JDBC files… Configuring your JDBC files Finally, we have reached the point of this post - connecting Zeppelin to your Oracle data warehouse. But before we dive into setting up connections we need to track down some Oracle specific jdbc files. You will need to locate one of the following files to use with Zeppelin: ojdbc7.jar  (Database 12c Release 1) or ojdbc8.jar (Database 12c Release 2). You can either copy the relevant file to your Zeppelin server or simply point the Zeppelin interpreter to the relevant directory. My preference is to keep everything contained within the Zeppelin folder structure so I have taken my Oracle JDBC files and moved them to my Zeppelin server. If you want to find the JDBC files that come with your database version then you need to find the jdbc folder within your version-specific folder. In my 12c Release 2 environment this was located in the folder shown below: alternatively, I could have copied the files from my local SQL Developer installation: take the jdbc file(s) and copy them to the /interpreter/jdbc directory within your Zeppelin installation directory, as shown below: Creating an Interpreter for Oracle Database At last we are finally ready to create a connection to our Oracle Database! Make a note of the directory containing the Oracle JDBC file because you will need that information during the configuration process. There is a difference between the different versions of Zeppelin in terms of creating a connection to an Oracle database/PDB. Personally, I think the process in version 0.7.x makes more sense but the performance of jdbc is truly dreadful for some reason. There is obviously been a major change of approach in terms of how connections are managed within Zeppelin and this seems to causing a few issues. Digging around in the documentation it would appear that 0.8.x version will be available shortly so I am hoping the version 0.7x connection issues will be resolved! Process for creating a connection using version 0.6.2 Starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu. Select “Interpreter” as shown below: this will take you to the home page for managing your connections, or interpreters. Each query language and data processing language has its own interpreter and these are all listed in alphabetical order. scroll down until you find the entry for jdbc: here you will see that the jdbc interpreter is already configured for two separate connections: postgres and hive. By clicking on the “edit” button on the right-hand side we can add new connection attributes and in this case I have removed the hive and postgres attributes and added new attributes osql.driver osql.password osql.url osql.user the significance of the “osql.” prefix will become obvious when we start to build our notebooks - essentially this will be our reference to these specific connection details. I have added a dependency by including an artefact that points to the location of my jdbc file. In the screenshot below you will see that I am connecting to the example sales history schema owned by user sh, password sh, which I have installed in my pluggable database dw2pdb2. The listener port for my jdbc connection is 1521. If you have access to SQL Developer then an easy solution for testing your connection details is to setup a new connection and run the test connection routine. If SQL Developer connects to your database/pdb using your jdbc connection string then Zeppelin should also be able to connect successfully. FYI…error messages in Zeppelin are usually messy and long listings of a Java program stack. Not easy to workout where the problem actually originates. Therefore, the more you can test outside of Zeppelin the easier life will be - at least that is what I have found! Below is my enhanced configuration for the jdbc interpreter: The default.driver is simply the entry point into the Oracle jdbc driver which is oracle.jdbc.driver.OracleDriver. The last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 1 driver stored in the ../zeppelin/intepreter/jdbc folder. Process for creating a connection using version 0.7.x As before, starting from the home page (just click on the word “Zeppelin” in the top left corner of your browser or open a new window and connect to http://localhost:7081/#/), then click on the username “anonymous” which will reveal a pulldown menu shown below: now with version 0.7.0 and 0.7.1 we need to actually create a new interpreter, therefore, just click on the “+Create” button: this will bring up the “Create new interpreter” form that will allow you to define the attributes for the new interpreter: I will name my new interpreter “osql” and assign it to the JDBC group: this will pre-populate the form with the default attributes needed to define a JDBC-type connection such as: default.driver: driver entry point into the Oracle JDBC driver default.password: Oracle user password default.url: JDBC connection string to access the Oracle database/pDB  default.user: Oracle username the initial form will look like this: and in my case I need to connect to a PDB called dw2pdb2 on the same server accessed via the listener port 1521, the username is sh and the password is sh. The only non-obvious entry is the default.driver which is oracle.jdbc.driver.OracleDriver. As before, the last task is to add an artifact [sic] that points to the location of the Oracle JDBC file. In this case I have pointed to the 12c Release 2 driver stored in the ../zeppelin/intepreter/jdbc folder. Once you have entered the configuration settings, hit Save and your form should look like this:   Testing your new interpreter To test the your interpreter will successfully connect to your database/pdb and run a SQL statement we need to create a new notebook. Go back to the home page and click on the “Create new note” link in the list on the left side of the screen. Enter a name for your new note: which will bring you to the notebook screen which is where you write your scripts - in this case SQL statements. This is similar in layout and approach as many worksheet-based tools (SQL Developer, APEX SQL Worksheet etc etc). If you are using version 0.6.x of Zeppelin then you can bypass the following… If you are using version 0.7.x then we have to bind our SQL interpreter (osql) to this new note which will allow us to run SQL commands against the sh schema. To add the osql interpreter simply click on the gear icon in the top right-hand side of the screen: this will then show you the list of interpreters which are available to this new note. You can switch interpreters on and off by clicking on them and for this example I have reduced the number of interpreters to just the following: markup (md), shell scripting (sh), file management (file), our Oracle SH pdb connection (osq) and jdbc connections (jdbc). Once you are done, click on the “Save” button to return to the note. I will explain the layout the of the note interface in my next post. For the purposes of testing the connection to my pdb I need to use the “osql” interpreter and give it a SQL statement to run. This is two-lines of code as shown here On the right side of the screen there is a triangle icon which is will execute or “Run” my SQL statement: SELECT sysdate FROM dual note that I have not included a semi-colon (;) at the end of the SQL statement! In version 0.6.2 if you include the semi-colon (;) you will get a java error. Version 0.7x is a little more tolerant and does not object to having or not having a semi-colon (;). Using my Virtualbox environment the first time I make a connection to execute a SQL statement the query takes 2-3 minutes to establish the connection to my PDB and then run the query. This is true even for simple queries such as SELECT * FROM dual. Once the first query has completed then all subsequent queries run in the normal expected timeframe (i.e. around the same time as executing the query from within SQL Developer). Eventually, the result will be displayed. By default, output is shown in tabular layout (as you can see from the list of available icons, "graph-based layouts are also available" …and we have now established that the connection to our SH schema is working. Summary In this post we have covered the following topics: How to install Zeppelin How configure and start Zeppelin Finding and installing the correct Oracle JDBC drivers Set up a connection to an Oracle PDB and tested the connection As we have seen during this post, there are some key differences between the 0.6.x and 0.7.x versions of Zeppelin in terms of the way interpreters (connections) are defined. Now we have a fully working environment (Zeppelin connected to my Oracle 12c Release 2 PDB which includes sales history sample schema). Therefore, in my next post I am going to look at how you can use the powerful notebook interface to access remote data files, load data into a schema, create both tabular and graph-based reports, briefing books and even design simple dashboards. Stay tuned for more information about how to use Zeppelin with Oracle Database. If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: keith.laker@oracle.com. (image at top of post is courtesy of wikipedia)  

In my last posts I provided an overview of the Apache Zeppelin open source project which is a new style of application called a “notebook”. These notebook applications typically runs within your...