X

Information, tips, tricks and sample code for Big Data Warehousing in an autonomous, cloud-driven world

Recent Posts

Autonomous

Keeping Your Autonomous Data Warehouse Secure with Data Safe - Part 2

In the part 1 of this series we looked at how to get your ADW and OCI environment ready for using Data Safe. If you missed part 1 or need a quick refresher then the blog post is here: https://blogs.oracle.com/datawarehousing/keeping-your-autonomous-data-warehouse-secure-with-data-safe-part-1. In this post we are going to explore the process of connecting and Autonomous Data Warehouse instance to our newly deployed Data Safe environment. Remember that you deploy your Data Safe control center within a specific OCI regional data center - as is the case with all our other cloud services. Therefore, if you switch to a different data center then you will need to deploy a new Data Safe environment. Hope that makes sense! Launching the Data Safe Service Console In part 1 we got to the point of enabling Data Safe in the Frankfurt data center. Now when we login to Oracle Cloud using our newly created OCI credentials we can pop open the hamburger menu and select Data Safe: and arrive on the Data Safe landing pad. The next step is to launch the Service Console (you may be wondering...why doesn't the Service Console just open automatically since the landing pad page is empty, apart from the Service Console button! Great question and we will come back to this towards the end of the series of posts when the landing pad page will show a lot more information). After clicking on the Service Console button a new window pops open which looks like this:   Right now, there is no information showing on any of our graphs or any of the other pages. This is because we have not registered our data warehouse instance so that's the next step. Registering an ADW with Data Safe We need to register our ADW with Data Safe before we can generate any of the reports that are part of the Data Safe library. To do that we need to go to the tab marked "Target" in the horizontal menu at the top of the page: Clicking on the "Register" button will pop open a form where we can input the connection details for our ADW... Data Safe is not limited to just working with Autonomous Databases and we could register any of the following: Nnote that Data Safe supports only serverless deployments for Autonomous Database. "Dedicated" is not currently supported. There is more information here: https://docs.oracle.com/en/cloud/paas/data-safe/udscs/supported-target-databases.html For ADW (and ATP) we first need to change the connection type to TLS which will add some additional fields to the form - this will be most recognisable if you have been spent time configuring connections to ADW from DI/ETL or BI tools: it looks as if a lot of information is now required to register our ADW instance but the good news is that just about all the information we need is contained within a small zip file which we can download from our OCI ADB console. Essentially we need the wallet file for our instance. But first let's quickly complete the fields in the top part of the form:   Ok, now we need the information about our ADW instance and here's how you get it: Collecting Connection Information For ADW If we flip over to the OCI console page for our ADW instance we can see that there is a line for something called "OCID" which is the first piece of information we need to collect .There are two links next to it: "Show" and "Copy". Click on copy and then flip over to our Data Safe page and paste in the OCID reference. Now we need things like hostname, port, service name and target distinguished name along with various secure wallet files. To get this information we need to download the wallet file which can be accessed by clicking on the "DB Connection" button. On the pop-up form click the "Download Wallet" button and enter a password...note this down because we are going to need it again shortly... Once the file has been downloaded, find the file on your filesystem and unzip it. The result will be a folder containing the following files: Ok, back to our Target registration form on Data Safe....the data for the next four fields can all be found in the tnsnames.ora file. We are going to use the "low service" for this connection because running Data Safe reports is not an urgent, rush-rush, workload. If you have no idea what a "low service" is then it might be a good idea to quickly read through the section on "Managing Concurrency and Priorities on Autonomous Data Warehouse" in section 12 of the ADW documentation. In simple terms...when we connect to an ADW instance we need to select a service (low, medium or high). These services map to LOW, MEDIUM, and HIGH consumer groups which have the following characteristics: HIGH: Highest resources, lowest concurrency. Queries run in parallel. MEDIUM: Fewer resources, higher concurrency. Queries run in parallel. LOW: Least resources, highest concurrency. Queries run serially. Anyway....as long as the jobs run, then we are going to be happy. Therefore, we need to find the details for our low-service connection in the tnsnames.ora file which will look something like this: adwdemo_low = (description=(address=(protocol=tcps)(port=1522)(host=xxxxxx.oraclecloud.com))(connect_data=(service_name=xxxxx_adwdemo_low.xxxxxxx))(security=(ssl_server_cert_dn="CN=xxxxxx.oraclecloud.com,OU=Oracle,O=Oracle Corporation,L=Redwood City,ST=California,C=US"))) You can copy & paste the host, port, service_name and ssl_server_cert_dn into the four fields below the "TLS" pulldown menu entry. So now our form looks like this... now the last few steps....make sure the wallet type is set to "JKS Wallet". For the Certificate/Wallet find the "truststore.jks" file from our downloaded and unzipped connection file. In the same directory/folder we can pick "keystone.jks" for the "Keystore Wallet". The next field needs the password we used on the OCI ADW console page when we downloaded the connection zip file so paste that in... Lastly add the ADW instance username/password that we created in Part 1of this series of blog posts - our user was called DATASAFE.   before you click on the "Test Connection" button we need to run a PL/SQL script to give our new DATASAFE database user some privileges that will allow Data Safe to run though it's library of checks... Click on the download button then search for the PL/SQL script dscs_privileges.sql. Using SQL Developer (or any other tool) we need to login as our standard ADMIN user and run that script (copy & paste will do the trick). Check the log for the script and you should see something like this: Enter value for USERNAME (case sensitive matching the username from dba_users) Setting USERNAME to DATASAFE Enter value for TYPE (grant/revoke) Setting TYPE to GRANT Enter value for MODE (audit_collection/audit_setting/data_discovery/masking/assessment/all) Setting MODE to ALL Granting AUDIT_COLLECTION privileges to "DATASAFE" ...  Granting AUDIT_SETTING privileges to "DATASAFE" ...  Granting DATA_DISCOVERY role to "DATASAFE" ...  Granting MASKING role to "DATASAFE" ...  Granting ASSESSMENT role to "DATASAFE" ...  Done. Disconnected from Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production Version 18.4.0.0.0 NOW, we can test the connection... and everything should work and we should see a nice big green tick.    Finally we are ready to click on "Register Target". Now our target page shows the details of our newly registered database...     Of course that's all we have done - register our ADW as a new target - so all the other pages are still empty, including the home page:     Wrap up for Part 2 In part 1 we setup our environment ready for using Data Safe and we enabled Data Safe within our regional data center - in this case the Frankfurt Data Center. In this post, Part 2, we have successfully added our existing ADW instance as a new target database in Data Safe.   Coming up in Part 3 In the next post we will start to explore some of the data discovery and data masking features that are part of Data Safe.   Learn More... Our security team has created a lot of great content to help you learn more about Data Safe so here are my personal bookmarks: Documentation - start here: https://docs.oracle.com/en/cloud/paas/data-safe/udscs/oracle-data-safe-overview.html Data Safe page on Oracle.com - https://www.oracle.com/database/technologies/security/data-safe.html Database Security Blog: https://blogs.oracle.com/cloudsecurity/db-sec https://blogs.oracle.com/cloudsecurity/keep-your-data-safe-with-oracle-autonomous-database-today https://blogs.oracle.com/cloudsecurity/keeping-your-data-safe-part-4-auditing-your-cloud-databases        

In the part 1 of this series we looked at how to get your ADW and OCI environment ready for using Data Safe. If you missed part 1 or need a quick refresher then the blog post is here: https://blogs.ora...

Autonomous

Keeping Your Autonomous Data Warehouse Secure with Data Safe - Part 1

One of the big announcements at OpenWorld 2019 in San Francisco was Oracle Data Safe - a totally new cloud-based security control center for your Oracle Data Warehouse (ADW) and it's completely FREE!    So what exactly does it do? Well, in summary Data Safe delivers essential data security capabilities as a service on Oracle Cloud Infrastructure. Essentially, it helps you understand the sensitivity of your data, evaluate risks to data, mask sensitive data, implement and monitor security controls, assess user security, monitor user activity, and address data security compliance requirements.   Maybe a little video will help...     Data Safe Console The main console dashboard page for Data Safe looks like this:  giving you a fantastic window directly into the types of data sets sitting inside Autonomous Data Warehouse. it means you can... Assess if your database is securely configured Review and mitigate risks based on GDPR Articles/Recitals, Oracle Database STIG Rules, and CIS Benchmark recommendations Assess user risk by highlighting critical users, roles and privileges Configure audit policies and collect user activity to identify unusual behavior Discover sensitive data and understand where it is located Remove risk from non-production data sets by masking sensitive data So let's see how easy it is to connect an existing Autonomous Data Warehouse to Data Safe and learn about the types of security reviews you can run on your data sets... Getting ADW ready to work with Data Safe To make this more useful to everyone I am going to take an existing ADW instance and create a new user called LOCAL_SH. Then I am going to copy the supplementary demographics, countries and customer tables from the read-only sales history demo schema to my new local_sh schema. This will give me some "sensitive" data points for Data Safe to discover when I connect my ADW to Data Safe. CREATE USER local_sh IDENTIFIED BY "Welcome1!Welcome1"; GRANT DWROLE TO local_sh; CREATE TABLE local_sh.supplementary_demographics AS SELECT * FROM sh.supplementary_demographics; CREATE TABLE local_sh.customers AS SELECT * FROM sh.customers; CREATE TABLE local_sh.create table countries AS SELECT * FROM sh.countries; So what does the customer table look like: and as you can see that there are definitely some columns that would help to personally identify someone. Those types of columns need to be hidden or masked from our development teams and business users...   Now we know that we have some very sensitive data! If you are not quite as lucky as me and you are starting from a completely clean ADW and need to load some of your data then checkout the steps in our documentation guide that explains how to load your own data into ADW: https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/tasks_load_data.html   Next step is to create a user within my ADW instance that I will use for in my Data Safe connection process so that my security review process is not tied to one of my application users. CREATE USER datasafe IDENTIFIED BY "Welcome1!DataSafe1"; GRANT DWROLE TO datasafe; What you are going to see later is that we have to run an installation script, which is available from the Data Safe console when we register a database. This user is going to own the required roles and privileges for running Data Safe, which is why I don't want to tie it to one of my existing application users.   Do I have the right type of Cloud Account? If you are new to Oracle Cloud or maybe created your account within the last 12 months then you can probably skip this section. For some of you who have had cloud accounts on the Oracle Cloud for over two years then when you try to access Data Safe you may get an interesting warning message about federated accounts... Accessing Data Safe After you log into your cloud account click on the hamburger (three horizontal lines) menu in the top left corner. The pop-out menu will have Data Safe listed underneath Autonomous Transaction Processing...   Click on Data Safe and this screen might appear! If this message doesn't appear then jump ahead a paragraphs to "Enabling Data Safe".   Don't panic all is not lost at this point. All you need to do is create a new OCI user. In the same hamburger menu list scroll down to the section for "Governance and Administration", select "Identity" and then select "Users"... Click on the big blue "Create User" button at the top of the screen and then fill in the boxes to create a new OCI user which we will use as the owner of your Data Safe environment. After you create the new user a welcome email should arrive with a link to reset the password...something similar to this: Having created a completely new user it's important to enable all the correct permissions so that Data Safe can access the resources and autonomous database instances within your tenancy. For my user, called DataSafe, I already have an “administrators” group within my tenancy that contains all the require privileges needed for Data Safe.  In the real world it's probably prudent to setup a new group just for Data Safe administrators and then assign OCI privileges to just that group. There is more information about this process here: https://docs.oracle.com/en/cloud/paas/data-safe/udscs/required-permission-enabling-oracle-data-safe.html Quick Recap We have an existing Autonomous Data Warehouse instance setup. I have used SQL Developer to create a new user to own a small data set containing potentially sensitive information, copied some tables from an existing schema that I know has sensitive data points so I now have a working data set and we have setup a new OCI user to own our Data Safe deployment.   Enabling Data Safe Now we are ready to start working with Data Safe. From the hamburger menu select "Data Safe" and you should see this screen if it's the first time you used it in your tenancy-region. You can see below that I am working in our Frankfurt Data Center and this is the first time I have logged into Data Safe: so to get to the next stage all we need to do is click on the big blue button to enable Data Safe. At which point we get the usual "working..." screen   followed by the "ready to get to work" screen...   Wrap-up for Part 1 This is a wrap for this particular post. In the next instalment we will look at how to register an Autonomous Data Warehouse instance and then run some of the security reports that can help us track down those objects that contain sensitive data about our customers. Learn More... Our security team has created a lot of great content to help you learn more about Data Safe so here are my personal bookmarks: Documentation - start here: https://docs.oracle.com/en/cloud/paas/data-safe/udscs/oracle-data-safe-overview.html Data Safe page on Oracle.com - https://www.oracle.com/database/technologies/security/data-safe.html Database Security Blog: https://blogs.oracle.com/cloudsecurity/db-sec https://blogs.oracle.com/cloudsecurity/keep-your-data-safe-with-oracle-autonomous-database-today https://blogs.oracle.com/cloudsecurity/keeping-your-data-safe-part-4-auditing-your-cloud-databases        

One of the big announcements at OpenWorld 2019 in San Francisco was Oracle Data Safe - a totally new cloud-based security control center for your Oracle Data Warehouse (ADW) and it's completely FREE!   ...

Autonomous

Key Highlights from an Autonomous OpenWorld 2019

It's taken longer than expected (too many things to do post-OpenWorld!) but I have finally finished and published my review of OpenWorld 2019 and it is now available in the Apple iBooks store.   Why do you need this book? Well as usual there were so many sessions and hands-on labs at this year's conference it really is hard to know where to start when you click on the link (https://events.rainfocus.com/widget/oracle/oow19/catalogow19?) to access the content catalog guide. There are some filters that help you narrow down the huge list but it can take time to search and download all the most important content linked to Autonomous Database. To save you all that time searching, I have put together a complete review in beautiful iBook format! And if you didn't manage to get to San Francisco then here is the perfect way to learn about the key messages, announcements, roadmaps, features and hands-on training that will help you get the most from your Autonomous Database experience.   So what's in the book? The guide includes all the key sessions, labs and key announcements from this year's Oracle OpenWorld conference broken down into the following sections Chapter 1 - Welcome and key video highlights  Chapter 2 -  List of key sessions, labs and videos with links to download the related presentations Chapter 3 - Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages. Chapter 4 - Everything you need to justify being at OpenWorld 2020 Where can I get it? If you have an Apple device then you will definitely want to get the iBook version which is available here: http://books.apple.com/us/book/id1483761470 If you are still stuck on Windows or stuck using an Android device then the PDF version is the way to go. The download link for this version is here: https://www.dropbox.com/s/7ejlmhmwhwpbdpw/ADB-Review-oow19.pdf?dl=0 Feedback If you think anything is missing then let me know via this blog by leaving a comment or send me an email (keith.laker@oracle.com). Hope you find the review useful and look forward to seeing you all at Moscone Center next year for OpenWorld 2020.  

It's taken longer than expected (too many things to do post-OpenWorld!) but I have finally finished and published my review of OpenWorld 2019 and it is now available in the Apple iBooks store.   Why do...

Autonomous

Getting A Sneak Preview into the Autonomous World Of 19c

If you have been autonomous data warehouse for a while now you will know that when you create a new data warehouse your instance will be based on Database 18c. But you also probably spotted that Database 19c is also available via some of our other cloud services and LiveSQL also runs Database 19c. Which probably makes you wonder when Autonomous Data Warehouse will be upgraded to 19c? The good news is that we are almost there! We have taken the first step by releasing a "19c Preview" mode so you can test your applications and tools against this latest version of the Oracle Database in advance of ADW being autonomously upgraded to 19c. So if you are using Autonomous Data Warehouse today then now is the time to start testing your data warehouse tools and apps using the just released  "19c Preview" feature. Where is "19c Preview"  available? The great news is that you can enable this feature in any of our data centers! We are rolling it out right now so if you don't see the exact flow outlined below when you select your data center then don't panic because it just means we haven't got to your data center yet but we will, just give us a bit more time! If you can see the option to enable preview (I will show you how to do this in the next section) then let's work through the options to build your first 19c Preview instance. How to Enable 19c Preview Scenario 1 - creating a new instance based on 19c Let's assume that in this case you want to create a completely new instance for testing your tools and apps. The great news is that process is almost identical to the existing process for creating a new data warehouse. We have just added two additional mouse clicks. So after you login to the main OCI console you will navigate to your autonomous database console: Note that in this screenshot I am using our US, Ashburn data center. This is my default data center. Since this feature is available across all our data centers it doesn't matter which data center you use. Click on the big blue create "Autonomous Database" button to launch the pop-up create form... this should look familiar...I have add a display name for my new instance "ADW 19c Preview" and then I set the database name:  in this case "ADW19CP". Looks nice and simple so far! Next we pick the workload type which in my example I selected the "Data Warehouse" workload... ...now this brings us to one of our other recent additions to ADW - you can now opt for a "Serverless" vs. "Dedicated". Configuration. So what's the difference? Essentially... Serverless is a simple and elastic deployment choice. Oracle autonomously operates all aspects of the database lifecycle from database placement to backup and updates. Dedicated is a private cloud in public cloud deployment choice. A completely dedicated compute, storage, network and database service for only a single tenant. You get customizable operational policies to guide Autonomous Operations for workload placement, workload optimization, update scheduling, availability level, over provisioning and peak usage. There is more information in one my recent blog posts "There's a minor tweak to our UI - DEDICATED" In this demo I am going to select "Serverless"... and we are almost there...just note that as you scroll down the "Auto Scaling" box has been automatically selected which is now the default behaviour for all new data warehouse instances. Of course if you don't want auto scaling enabled simply untick the box and move on to the next step... Finally we get to the most important tick box on the form! the text above the tick box says "New Database Preview Version 19c Available" and all you need to do is tick the "Enable Preview Mode" box. Of course as with everything in life you need to pay attention to the small print so carefully read the text in the yellow information box: You can't actually move forward until you agree to the T&Cs related to using preview mode. The most important part is that preview mode is time boxed and ends on December 1st 2019.  That means, right now, you have about 3 weeks for testing! Once you confirm agreement to the T&Cs you can scroll down, add your administrator password and finally click on the usual big blue "Create Autonomous Database" button. Notice that the instance page on the service console now has a yellow banner telling you when your preview period will end. There is marker on the OCI console page so you can easily spot a "preview" instance in your list of existing autonomous database instances...   Now let's move on to scenario 2 and create a 19c clone - if you have no idea what a "clone" is then this blog post might help: "What is cloning and what does it have to do with Autonomous Data Warehouse?" Scenario 2 - cloning an existing instance to 19c You may already have an existing data warehouse instance and want to check that everything that's working today (ETL jobs, scripts, reports etc) will still work when ADW moves to Database 19c. Easiest way to do this to simply clone your existing instance and transform it into a 19c instance during the cloning process.  Let's assume that you are on the service console page for your instance... click on the "Actions" button and select "Create Clone" The first step is to select the type of clone you want to create? For testing purposes it's likely that you will want to have the some data in your 19c clone as your original data warehouse as this will make it easier to test reports and scripts. This is what I have done below by selecting "Full Clone".. Of course if you just want to make sure that your existing tools and applications can connect to your new 19c ADW then a metadata clone could well be sufficient. The choice is yours! The rest of the form is as per the usual cloning process until you get towards the bottom where you will spot the new section to enable "19c Preview Mode". Click to enable the preview mode and then agree to the T&Cs and your done! Simply add your administrator password and finally click on the usual big blue "Create Autonomous Database" button. That's it! Welcome to the new world of Autonomous Database 19c. Happy testing and keep an eye on your calendar because preview mode ends on December 1st, 2019. If you want more information about preview versions for Autonomous Database then checkout the overview page in the documentation which is here:  https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/autonomous-preview.html   What's New in 19c Autonomous Database Here is a quick summary of some of the default settings in Oracle Database Preview Version 19c for Autonomous Database: Real-Time Statistics: Enabled by default. Real-Time Statistics enables the database to automatically gather real-time statistics during conventional DML operations. Fresh statistics enable the optimizer to produce more optimal plans. See Real-Time Statistics for more information. High-Frequency Automatic Optimizer Statistics Collection: Enabled by default. High-Frequency Automatic Optimizer Statistics Collection enables the database to gather optimizer statistics every 15 minutes for objects that have stale statistics. See About High-Frequency Automatic Optimizer Statistics Collection for more information. High-Frequency SQL Plan Management Evolve Advisor Task: Enabled by default. When enabled, the database will assess the opportunity for automatic SQL plan changes to improve the performance for known statements every hour. A frequent execution means that the Optimizer has more opportunities to find and evolve to better performing plans. See Managing the SPM Evolve Advisor Task for more information. Automatic Indexing: Disabled by default. To take advantage of automatic indexing you can enable automatic indexing. When enabled automatic indexing automates the index management tasks in an Oracle database. Automatic indexing automatically creates, rebuilds, and drops indexes in a database based on the changes in application workload, thus improving database performance. See Managing Auto Indexes for more information. Enjoy the autonomous world of Database 19c.

If you have been autonomous data warehouse for a while now you will know that when you create a new data warehouse your instance will be based on Database 18c. But you also probably spotted that...

Autonomous

Autonomous Wednesday at OpenWorld 2019 - List of Must See Sessions

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Wednesday and make sure you drink lots of water! If you want this agenda on your phone (iPhone or Android) then checkout our smartphone web app by clicking here !function(){ var d=document.documentElement;d.className=d.className.replace(/no-js/,'js'); if(document.location.href.indexOf('betamode=') > -1) document.write(''); }(); AGENDA - WEDNESDAY 09:00AM Moscone South - Room 207/208 SOLUTION KEYNOTE: A New Vision for Oracle Analytics T.K. Anand, Senior Vice President, Analytics, Oracle 09:00 AM - 10:15 AM SOLUTION KEYNOTE: A New Vision for Oracle Analytics 09:00 AM - 10:15 AM Moscone South - Room 207/208 In this session learn about the bright new future for Oracle Analytics, where customers and partners benefit from augmented analytics working together with Oracle Autonomous Data Warehouse, automating the delivery of personalized insights to fuel innovation without limits. SPEAKERS:T.K. Anand, Senior Vice President, Analytics, Oracle Moscone South - Room 152B Managing One of the Largest IoT Systems in the World with Autonomous Technologies Manuel Martin Marquez, Senior Project Leader, Cern Organisation Européenne Pour La Recherche Nucléaire Sebastien MASSON, Oracle DBA, CERN 09:00 AM - 09:45 AM Managing One of the Largest IoT Systems in the World with Autonomous Technologies 09:00 AM - 09:45 AM Moscone South - Room 152B CERN’s particle accelerator control systems produce more than 2.5 TB of data per day from more than 2 million heterogeneous signals. This IoT system and data is used by scientists and engineers to monitor magnetic field strengths, temperatures, and beam intensities among many other parameters to determine if the equipment is operating correctly. These critical data management and analytics tasks represent important challenges for the organization, and key technologies including big data, machine learning/AI, IoT, and autonomous data warehouses, coupled cloud-based models, can radically optimize the operations. Attend this session to learn from CERN’s experience with IoT systems, Oracle’s cloud, and autonomous solutions. SPEAKERS:Manuel Martin Marquez, Senior Project Leader, Cern Organisation Européenne Pour La Recherche Nucléaire Sebastien MASSON, Oracle DBA, CERN Moscone South - Room 213 Strategy and Roadmap for Oracle Data Integrator and Oracle Enterprise Data Quality Jayant Mahto, Senior Software Development Manager, Oracle 09:00 AM - 09:45 AM Strategy and Roadmap for Oracle Data Integrator and Oracle Enterprise Data Quality 09:00 AM - 09:45 AM Moscone South - Room 213 This session provides a detailed look into Oracle Data Integrator and Oracle Enterprise Data Quality, Oracle’s strategic products for data integration and data quality. See product overviews, highlights from recent customer implementations, and future roadmap plans, including how the products will work with ADW. SPEAKERS:Jayant Mahto, Senior Software Development Manager, Oracle Moscone South - Room 203 The Hidden Data Economy and Autonomous Data Management Paul Sonderegger, Senior Data Strategist, Oracle 09:00 AM - 09:45 AM The Hidden Data Economy and Autonomous Data Management 09:00 AM - 09:45 AM Moscone South - Room 203 Inside every company is a hidden data economy. But because there are no market prices for data inside a single firm, most executives don’t think of it this way. They should. Seeing enterprise data creation, use, and management in terms of supply, demand, and transaction costs will enable companies to compete more effectively on data. In this session learn to see data economy hiding in your company and see Oracle’s vision for helping you get the most out of it. SPEAKERS:Paul Sonderegger, Senior Data Strategist, Oracle Moscone West - Room 3021 Hands-on Lab: Oracle Machine Learning Mark Hornick, Senior Director Data Science and Big Data, ORACLE Marcos Arancibia Coddou, Product Manager, Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Data Science and Big Data, Oracle 09:00 AM - 10:00 AM Hands-on Lab: Oracle Machine Learning 09:00 AM - 10:00 AM Moscone West - Room 3021 In this introductory hands-on-lab, try out the new Oracle Machine Learning Zeppelin-based notebooks that come with Oracle Autonomous Database. Oracle Machine Learning extends Oracle’s offerings in the cloud with its collaborative notebook environment that helps data scientist teams build, share, document, and automate data analysis methodologies that run 100% in Oracle Autonomous Database. Interactively work with your data, and build, evaluate, and apply machine learning models. Import, export, edit, run, and share Oracle Machine Learning notebooks with other data scientists and colleagues. Share and further explore your insights and predictions using the Oracle Analytics Cloud. SPEAKERS:Mark Hornick, Senior Director Data Science and Big Data, ORACLE Marcos Arancibia Coddou, Product Manager, Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Data Science and Big Data, Oracle 10:00AM Moscone South - Room 214 Oracle Essbase 19c: Roadmap on the Cloud Raghuram Venkatasubramanian, Product Manager, Oracle Ashish Jain, Product Manager, Oracle 10:00 AM - 10:45 AM Oracle Essbase 19c: Roadmap on the Cloud 10:00 AM - 10:45 AM Moscone South - Room 214 In this session learn about new analytic platform on Oracle Autonomous Data Warehouse and the role of Oracle Essbase as an engine for data analysis at the speed of thought. Learn how you can perform analysis on data that is stored in Oracle Autonomous Data Warehouse without having to move it into Oracle Essbase. Learn about zero footprint Oracle Essbase and how it provides an architecture that is efficient both in terms of performance and resource utilization. The session also explores new innovations in the future Oracle Essbase roadmap. SPEAKERS:Raghuram Venkatasubramanian, Product Manager, Oracle Ashish Jain, Product Manager, Oracle Moscone West - Room 3000 The Autonomous Trifecta: How a University Leveraged Three Autonomous Technologies Erik Benner, VP Enterprise Transformation, Mythics, Inc. Carla Steinmetz, Senior Principal Consultant, Mythics, Inc. 10:00 AM - 10:45 AM The Autonomous Trifecta: How a University Leveraged Three Autonomous Technologies 10:00 AM - 10:45 AM Moscone West - Room 3000 The challenges facing organizations are often more complex than what one technology can solve. Most solutions require more than just a fast database, or an intelligent analytics tool. True solutions need to store data, move data, and report on data—ideally with all of the components being accelerated with machine learning. In this session learn how Adler University migrated to the cloud with Oracle Autonomous Data Warehouse, Oracle Data Integration, and Oracle Analytics. Learn how the university enhanced the IT systems that support its mission of graduating socially responsible practitioners, engaging communities, and advancing social justice. SPEAKERS:Erik Benner, VP Enterprise Transformation, Mythics, Inc. Carla Steinmetz, Senior Principal Consultant, Mythics, Inc. Moscone South - Room 152C Graph Databases and Analytics: How to Use Them Melli Annamalai, Senior Principal Product Manager, Oracle Hans Viehmann, Product Manager EMEA, Oracle 10:00 AM - 10:45 AM Graph Databases and Analytics: How to Use Them 10:00 AM - 10:45 AM Moscone South - Room 152C Graph databases and graph analysis are powerful new tools that employ advanced algorithms to explore and discover relationships in social networks, IoT, big data, data warehouses, and complex transaction data for applications such as fraud detection in banking, customer 360, public safety, and manufacturing. Using a data model designed to represent linked and connected data, graphs simplify the detection of anomalies, the identification of communities, the understanding of who or what is the most connected, and where there are common or unnatural patterns in data. In this session learn about Oracle’s graph database and analytic technologies for Oracle Cloud, Oracle Database, and big data including new visualization tools, PGX analytics, and query language. SPEAKERS:Melli Annamalai, Senior Principal Product Manager, Oracle Hans Viehmann, Product Manager EMEA, Oracle Moscone South - Room 213 Data Architect's Dilemma: Many Specialty Databases or One Multimodel Database? Tirthankar Lahiri, Senior Vice President, Oracle Juan Loaiza, Executive Vice President, Oracle 10:00 AM - 10:45 AM Data Architect's Dilemma: Many Specialty Databases or One Multimodel Database? 10:00 AM - 10:45 AM Moscone South - Room 213 The most fundamental choice for an enterprise data architect to make is between using a single multimodel database or different specialized databases for each type of data and workload. The decision has profound effects on the architecture, cost, agility, and stability of the enterprise. This session discusses the benefits and tradeoffs of each of these alternatives and also provides an alternative solution that combines the best of the multimodal architecture with a powerful multimodel database. Join this session to find out what is the best choice for your enterprise. SPEAKERS:Tirthankar Lahiri, Senior Vice President, Oracle Juan Loaiza, Executive Vice President, Oracle 10:30AM Moscone West - Room 3021 Hands-on Lab: Oracle Big Data SQL Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle 10:30 AM - 11:30 AM Hands-on Lab: Oracle Big Data SQL 10:30 AM - 11:30 AM Moscone West - Room 3021 Modern data architectures encompass streaming data (e.g. Kafka), Hadoop, object stores, and relational data. Many organizations have significant experience with Oracle Databases, both from a deployment and skill set perspective. This hands-on lab on walks through how to leverage that investment. Learn how to extend Oracle Database to query across data lakes (Hadoop and object stores) and streaming data while leveraging Oracle Database security policies. SPEAKERS:Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle Moscone West - Room 3023 Hands-on Lab: Oracle Multitenant John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle 10:30 AM - 11:30 AM Hands-on Lab: Oracle Multitenant 10:30 AM - 11:30 AM Moscone West - Room 3023 This is your opportunity to get up close and personal with Oracle Multitenant. In this session learn about a very broad range of Oracle Multitenant functionality in considerable depth. Warning: This lab has been filled to capacity quickly at every Oracle OpenWorld that it has been offered. It is strongly recommended that you sign up early. Even if you're only able to get on the waitlist, it's always worth showing up just in case there's a no-show and you can grab an available seat. SPEAKERS:John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle Moscone West - Room 3019 HANDS-ON LAB: RESTful Services with Oracle REST Data Services and Oracle Autonomous Database Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle 10:30 AM - 11:30 AM HANDS-ON LAB: RESTful Services with Oracle REST Data Services and Oracle Autonomous Database 10:30 AM - 11:30 AM Moscone West - Room 3019 In this session learn to develop and deploy a RESTful service using Oracle SQL Developer, Oracle REST Data Services, and Oracle Autonomous Database. Then connect these services as data sources to different Oracle JavaScript Extension Toolkit visualization components to quickly build rich HTML5 applications using a free and open source JavaScript framework. SPEAKERS:Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle 10:45AM The Exchange - Ask Tom Theater Scaling Open Source R and Python for the Enterprise Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle 10:45 AM - 11:05 AM Scaling Open Source R and Python for the Enterprise 10:45 AM - 11:05 AM The Exchange - Ask Tom Theater Open source environments such as R and Python offer tremendous value to data scientists and developers. Scalability and performance on large data sets, however, is not their forte. Memory constraints and single-threaded execution can significantly limit their value for enterprise use. With Oracle Advanced Analytics’ R and Python interfaces to Oracle Database, users can take their R and Python to the next level, deploying for enterprise use on large data sets with ease of deployment. In this session learn the key functional areas of Oracle R Enterprise and Oracle Machine Learning for Python and see how to get the best combination of open source and Oracle Database. SPEAKERS:Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle Moscone South - Room 156B Demystifying Graph Analytics for the Non-expert Peter Jeffcock, Big Data and Data Science, Cloud Business Group, Oracle Sherry Tiao, Oracle 11:15 AM - 12:00 PM Demystifying Graph Analytics for the Non-expert 11:15 AM - 12:00 PM Moscone South - Room 156B This session is aimed at the non-expert: somebody who wants to know how it works so they can ask the technical experts to apply it in new ways to generate new kinds of value for the business. Look behind the curtain to see how graph analytics works. Learn how it enables use cases, from giving directions in your car, to telling the tax authorities if your business partner’s first cousin is conspiring to cheat on payments. SPEAKERS:Peter Jeffcock, Big Data and Data Science, Cloud Business Group, Oracle Sherry Tiao, Oracle Moscone South - Room 214 Oracle Autonomous Data Warehouse: How to Connect Your Tools and Applications What exactly do you need to know to connect your existing on-premises and cloud tools to Oracle Autonomous Data Warehouse? Come to this session and learn about the connection architecture of Autonomous Data Warehouse. For example, learn how to set up and configure Java Database Connectivity connections, how to configure SQLNet, and which Oracle Database Cloud driver you need. Learn how to use Oracle wallets and Java key store files, and what to do if you have a client that is behind a firewall and your network configuration requires an HTTP proxy. All types of connection configurations are explored and explained. 11:15 AM - 12:00 PM "Oracle Autonomous Data Warehouse: How to Connect Your Tools and Applications 11:15 AM - 12:00 PM Moscone South - Room 214 George Lumpkin, Vice President, Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle SPEAKERS:"What exactly do you need to know to connect your existing on-premises and cloud tools to Oracle Autonomous Data Warehouse? Come to this session and learn about the connection architecture of Autonomous Data Warehouse. For example, learn how to set up and configure Java Database Connectivity connections, how to configure SQLNet, and which Oracle Database Cloud driver you need. Learn how to use Oracle wallets and Java key store files, and what to do if you have a client that is behind a firewall and your network configuration requires an HTTP proxy. All types of connection configurations are explored and explained. Moscone South - Room 152B 11 Months with Oracle Autonomous Transaction Processing Eric Grancher, Head of Database Services group, IT department, CERN 11:15 AM - 12:00 PM 11 Months with Oracle Autonomous Transaction Processing 11:15 AM - 12:00 PM Moscone South - Room 152B Oracle Autonomous Transaction Processing and Oracle Autonomous Data Warehouse represent a new way to deploy applications, with the platform providing a performant environment with advanced and automated features. In this session hear one company’s experience with the solution over the past 11 months, from setting up the environment to application deployment, including how to work with it on a daily basis, coordinating with maintenance windows, and more. SPEAKERS:Eric Grancher, Head of Database Services group, IT department, CERN Moscone South - Room 155B Oracle Data Integration Cloud: Database Migration Service Deep Dive Alex Kotopoulis, Director, Product Management – Data Integration Cloud, Oracle Chai Pydimukkala, Senior Director of Product Management – Data Integration Cloud, Oracle 11:15 AM - 12:00 PM Oracle Data Integration Cloud: Database Migration Service Deep Dive 11:15 AM - 12:00 PM Moscone South - Room 155B Oracle Database Migration Service is a new Oracle Cloud service that provides an easy-to-use experience to migrate databases into Oracle Autonomous Transaction Processing, Oracle Autonomous Data Warehouse, or databases on Oracle Cloud Infrastructure. Join this Oracle Product Management–led session to see how Oracle Database Migration Service makes life easier for DBAs by offering an automated means to address the complexity of enterprise databases and data sets. Use cases include offline migrations for batch database migrations, online migrations requiring minimized downtime, and schema conversions for heterogeneous database migrations. The session also shows how it provides the added value of monitoring, auditability, and data validation to deliver real-time progress. SPEAKERS:Alex Kotopoulis, Director, Product Management – Data Integration Cloud, Oracle Chai Pydimukkala, Senior Director of Product Management – Data Integration Cloud, Oracle 11:30AM Moscone South - Room 301 Cloud Native Data Management Gerald Venzl, Master Product Manager, Oracle Maria Colgan, Master Product Manager, Oracle 11:30 AM - 12:15 PM Cloud Native Data Management 11:30 AM - 12:15 PM Moscone South - Room 301 The rise of the cloud has brought many changes to the way applications are built. Containers, serverless, and microservices are now commonplace in a modern cloud native architecture. However, when it comes to data persistence, there are still many decisions and trade-offs to be made. But what if you didn’t have to worry about how to structure data? What if you could store data in any format, independent of having to know if a cloud service could handle it? What if performance, web-scale, or security concerns no longer held you back when you were writing or deploying apps? Sound like a fantasy? This session shows you how you can combine the power of microservices with the agility of cloud native data management to make this fantasy a reality. SPEAKERS:Gerald Venzl, Master Product Manager, Oracle Maria Colgan, Master Product Manager, Oracle 12:00PM Moscone West - Room 3019 HANDS-ON LAB: Low-Code Development with Oracle Application Express and Oracle Autonomous Database David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle 12:00 PM - 01:00 PM HANDS-ON LAB: Low-Code Development with Oracle Application Express and Oracle Autonomous Database 12:00 PM - 01:00 PM Moscone West - Room 3019 Oracle Application Express is a low-code development platform that enables you to build stunning, scalable, secure apps with world-class features that can be deployed anywhere. In this lab start by initiating your free trial for Oracle Autonomous Database and then convert a spreadsheet into a multiuser, web-based, responsive Oracle Application Express application in minutes—no prior experience with Oracle Application Express is needed. Learn how you can use Oracle Application Express to solve many of your business problems that are going unsolved today. SPEAKERS:David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle Moscone West - Room 3021 Hands-on Lab: Oracle Autonomous Data Warehouse Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle 12:00 PM - 01:00 PM Hands-on Lab: Oracle Autonomous Data Warehouse 12:00 PM - 01:00 PM Moscone West - Room 3021 In this hands-on lab discover how you can access, visualize, and analyze lots of different types of data using a completely self-service, agile, and fast service running in the Oracle Cloud: oracle Autonomous Data Warehouse. See how quickly and easily you can discover new insights by blending, extending, and visualizing a variety of data sources to create data-driven briefings on both desktop and mobile browsers—all without the help of IT. It has never been so easy to create visually sophisticated reports that really communicate your discoveries, all in the cloud, all self-service, powered by Oracle Autonomous Data Warehouse. Oracle’s perfect quick-start service for fast data loading, sophisticated reporting, and analysis is for everybody. SPEAKERS:Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle 12:30PM Moscone South - Room 152C Oracle Autonomous Data Warehouse and Oracle Analytics Cloud Boost Agile BI Reiner Zimmermann, Senior Director, DW & Big Data Product Management, Oracle Christian Maar, CEO, 11880 Solutions AG Pawarit Ruengsuksilp, Business Development Officer, Forth Corporation PCL 12:30 PM - 01:15 PM Oracle Autonomous Data Warehouse and Oracle Analytics Cloud Boost Agile BI 12:30 PM - 01:15 PM Moscone South - Room 152C 11880.com is a midsize media company in Germany that used to do traditional BI using Oracle Database technology and Oracle BI. In this session, learn how the switch to Oracle Autonomous Data Warehouse and Oracle Analytics Cloud enabled 11880.com to give more responsibility directly to business users, who are now able to do real self-service BI. They can start their own database, load data, and analyze it the way they want and need it without the need to ask IT and wait for days or weeks to have a system set up and running. SPEAKERS:Reiner Zimmermann, Senior Director, DW & Big Data Product Management, Oracle Christian Maar, CEO, 11880 Solutions AG Pawarit Ruengsuksilp, Business Development Officer, Forth Corporation PCL Moscone South - Room 214 JSON in Oracle Database: Common Use Cases and Best Practices Beda Hammerschmidt, Consulting Member of Technical Staff, Oracle 12:30 PM - 01:15 PM JSON in Oracle Database: Common Use Cases and Best Practices 12:30 PM - 01:15 PM Moscone South - Room 214 JSON is a popular data format in modern applications (web/mobile, microservices, etc.) that brings about increased demand from customers to store, process, and generate JSON data using Oracle Database. JSON supports a wide range of requirements including real-time payment processing systems, social media analytics, and JSON reporting. This session offers common use cases, explains why customers pick JSON over traditional relational or XML data models, and provides tips to optimize performance. Discover the Simple Oracle Document Access (Soda) API that simplifies the interaction with JSON documents in Oracle Database and see how a self-tuning application can be built over Oracle Documents Cloud. SPEAKERS:Beda Hammerschmidt, Consulting Member of Technical Staff, Oracle 02:00PM Moscone North - Hall F Main Keynote Oracle Eexcutives 2:00 p.m. – 3:00 p.m. Main Keynote 2:00 p.m. – 3:00 p.m. Moscone North - Hall F Main OpenWorld Keynote SPEAKERS:Oracle Eexcutives 03:45PM Moscone South - Room 214 Top 10 SQL Features for Developers/DBAs in the Latest Generation of Oracle Database Keith Laker, Senior Principal Product Manager, Oracle 03:45 PM - 04:30 PM Top 10 SQL Features for Developers/DBAs in the Latest Generation of Oracle Database 03:45 PM - 04:30 PM Moscone South - Room 214 SQL is at the heart of every enterprise data warehouse running on Oracle Database, so it is critical that your SQL code is optimized and makes use of the latest features. The latest generation of Oracle Database includes a lot of important new features for data warehouse and application developers and DBAs. This session covers the top 10 most important features, including new and faster count distinct processing, improved support for processing extremely long lists of values, and easier management and optimization of data warehouse and operational queries. SPEAKERS:Keith Laker, Senior Principal Product Manager, Oracle Moscone West - Room 3023 Hands-on Lab: Oracle Database In-Memory Andy Rivenes, Product Manager, Oracle 03:45 PM - 04:45 PM Hands-on Lab: Oracle Database In-Memory 03:45 PM - 04:45 PM Moscone West - Room 3023 Oracle Database In-Memory introduces an in-memory columnar format and a new set of SQL execution optimizations including SIMD processing, column elimination, storage indexes, and in-memory aggregation, all of which are designed specifically for the new columnar format. This lab provides a step-by-step guide on how to get started with Oracle Database In-Memory, how to identify which of the optimizations are being used, and how your SQL statements benefit from them. The lab uses Oracle Database and also highlights the new features available in the latest release. Experience firsthand just how easy it is to start taking advantage of this technology and its performance improvements. SPEAKERS:Andy Rivenes, Product Manager, Oracle 04:00PM Moscone South - Room 306 Six Technologies, One Name: Flashback—Not Just for DBAs Connor Mcdonald, Database Advocate, Oracle 04:00 PM - 04:45 PM Six Technologies, One Name: Flashback—Not Just for DBAs 04:00 PM - 04:45 PM Moscone South - Room 306 There is a remarkable human condition where you can be both cold and sweaty at the same time. It comes about three seconds after you press the Commit button and you realize that you probably needed to have a WHERE clause on that “delete all rows from the SALES table” SQL statement. But Oracle Flashback is not just for those “Oh no!” moments. It also enables benefits for developers, ranging from data consistency and continuous integration to data auditing. Tucked away in Oracle Database, Enterprise Edition, are six independent and powerful technologies that might just save your career and open up a myriad of other benefits as well. Learn more in this session. SPEAKERS:Connor Mcdonald, Database Advocate, Oracle 04:15PM Moscone South - Room 213 The Changing Role of the DBA Maria Colgan, Master Product Manager, Oracle Jenny Tsai-Smith, Vice President, Oracle 04:15 PM - 05:00 PM The Changing Role of the DBA 04:15 PM - 05:00 PM Moscone South - Room 213 The advent of the cloud and the introduction of Oracle Autonomous Database presents opportunities for every organization, but what's the future role for the DBA? In this session explore how the role of the DBA will continue to evolve, and get advice on key skills required to be a successful DBA in the world of the cloud. SPEAKERS:Maria Colgan, Master Product Manager, Oracle Jenny Tsai-Smith, Vice President, Oracle 04:45PM Moscone South - Room 214 Remove Silos and Query the Data Warehouse, Data Lake, and Streams with Oracle SQL Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle 04:45 PM - 05:30 PM Remove Silos and Query the Data Warehouse, Data Lake, and Streams with Oracle SQL 04:45 PM - 05:30 PM Moscone South - Room 214 The latest data architectures take the approach of using the right tool for the right job. Data lakes have become the repository for capturing and analyzing raw data. Data warehouses continue to manage enterprise data, with Oracle Autonomous Data Warehouse greatly simplifying optimized warehouse deployments. Kafka is key capability capturing for real-time streams. This architecture makes perfect sense until you need answers to questions that require correlations across these sources. And you need business users—using their familiar tools and applications—to be able to find the answers themselves. This session outlines how to break down these data silos to query across these sources with security and without moving mountains of data. SPEAKERS:Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Moscone South - Room 152B Creating a Multitenant Sandbox with Oracle Cloud Infrastructure Andrew Westwood, VP, Principal Engineer - Innovation, Bank of the West 04:45 PM - 05:30 PM Creating a Multitenant Sandbox with Oracle Cloud Infrastructure 04:45 PM - 05:30 PM Moscone South - Room 152B Bank of the West needed a multitenanted sandbox database environment in the cloud to work with multiple potential partners and with internal customers simultaneously and independently of each other. The environment needed to provide rapid deployment, data segregation and versioning, data isolation that ensured each partner could access only the information it specifically needed, and self-service that enabled each team to query and build their own database objects and create their own restful services on the sandbox data. In this session learn why Bank of the West chose Oracle Cloud Infrastructure, Oracle Application Container Cloud, and Oracle Application Express. SPEAKERS:Andrew Westwood, VP, Principal Engineer - Innovation, Bank of the West 06:30PM " Chase Center Mission Bay Blocks 29-32 San Francisco, CA Oracle CloudFest.19 "John Mayer. 6:30 p.m.–11 p.m. "Oracle CloudFest.19 6:30 p.m.–11 p.m. Chase Center Mission Bay Blocks 29-32 San Francisco, CA You’ve been energized by the fresh ideas and brilliant minds you’ve engaged with at Oracle OpenWorld. Now cap off the event with an evening of inspiration and celebration with John Mayer. John is many things—a guitar virtuoso, an Instagram live host, a storyteller—and a seven-time Grammy Award-winning performer. Here’s your chance to savor his distinctive and dynamic body of work that’s touched millions worldwide. * Included in full conference pass. Can be purchased for an additional $375 with a Discover pass SPEAKERS:"John Mayer.

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Wednesday and make sure you...

Autonomous

Autonomous Data Warehouse + Oracle Analytics Cloud Hands-on Lab at #OOW19

If you are at Moscone Center for this week's OpenWorld conference then don't miss your chance to get some free hands-on time with Oracle Analytics Cloud querying Autonomous Data Warehouse. We have three sessions left to go this week: Tuesday, September 17, 03:45 PM - 04:45 PM | Moscone West - Room 3021 Wednesday, September 18, 12:00 PM - 01:00 PM | Moscone West - Room 3021 Thursday, September 19, 10:30 AM - 11:30 AM | Moscone West - Room 3021   Each session is being led by Philippe Lions Oracle's own analytics guru and master of data visualization. Philippe and his team built a special version of the their standard ADW+OAC workshop which walks you through connecting OAC to ADW, getting immediate valuable insight out of your ADW data, deepening analysis by mashing-up additional data and leveraging OAC interactive visualization features.  At a higher level the aim is to take a data set which you have never seen before and quickly and easily discover insights into brand and category performance over time, highlighting major sales weaknesses within a specific category. All in just under 45 minutes! Here is Philippe explaining the workshop flow.. Your browser does not support the video tag. What everyone ends up with is the report shown below which identifies the sales weaknesses in the mobile phone category, specifically android phones. All this is done using the powerful data interrogation features of Oracle Analytics Cloud. Make sure you sign up for an ADW hands-on lab running tomorrow, Wednesday and Thursday and learn from the our experts in data visualization and analytics. If you want to try this workshop at home then everything you need is here: https://www.oracle.com/solutions/business-analytics/data-visualization/tutorials.html  

If you are at Moscone Center for this week's OpenWorld conference then don't miss your chance to get some free hands-on time with Oracle Analytics Cloud querying Autonomous Data Warehouse. We have...

Autonomous

Need to Query Autonomous Data Warehouse directly in Slack?

We have one of THE coolest demos ever in the demogrounds area at OpenWorld (head straight to the developer tools area and look for the Chatbot and Digital Assistant demo booths. My colleagues in the Oracle Analytics team have combined Oracle Analytics Cloud with the new Digital Assistant tool to create an amazing experience for users who want to get data out using what I would call slightly unusual channels!  First step is to create a connection to ADW or ATP using the pre-built connection wizards - as shown below there are dozens of different connection wizards to help you get started... Then once the connection is setup and the data sets/projects built you just swap over to your favorite front end collaboration tool such as Slack or Microsoft Teams and just start writing your natural language query which is translated into SQL and the results are returned directly to your collaboration tool where depending what the APIs support you will get a tabular report, a summary analysis and a graphical report.  The Oracle Analytics Cloud experts on the demo booth kindly walked me through their awesome demo: Your browser does not support the video tag. Need to do the same on your mobile phone, then this has you covered as well (apologies had my thumb over the lens for a while in this video - complete amateur) Your browser does not support the video tag. So if you are at Moscone this week and want to see quite possibly the coolest demo at the show then head to the demogrounds and make for the developer tools area which is clearly marked with a huge overhead banner. If you get stuck then swing by the ADW booth which is right next to the escalators as you come down to the lower area in Moscone South. Enjoy OpenWorld 2019 folks!    

We have one of THE coolest demos ever in the demogrounds area at OpenWorld (head straight to the developer tools area and look for the Chatbot and Digital Assistant demo booths. My colleagues in...

Autonomous

Autonomous Tuesday at OpenWorld 2019 - List of Must See Sessions

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Tuesday and make sure you drink lots of water! If you want this agenda on your phone (iPhone or Android) then checkout our smartphone web app by clicking here !function(){ var d=document.documentElement;d.className=d.className.replace(/no-js/,'js'); if(document.location.href.indexOf('betamode=') > -1) document.write(''); }(); AGENDA - TUESDAY 08:45AM Moscone South - Room 204 Using Graph Analytics for New Insights Melliyal Annamalai, Senior Principal Product Manager, Oracle   08:45 AM - 10:45 AM Using Graph Analytics for New Insights 08:45 AM - 10:45 AM Moscone South - Room 204 Graph is an emerging data model for analyzing data. Graphs enable navigation of large and complex data warehouses and intuitive detection of complex relationships for new insights into your data. Powerful algorithms for Graph models such as ranking, centrality, community identification, and path-finding routines support fraud detection, recommendation engines, social network analysis, and more. In this session, learn how to load a graph; insert nodes and edges with Graph APIs; and traverse a graph to find connections and do high-performance Graph analysis with PGQL, a SQL-like graph query language. Also learn how to use visualization tools to work with Graph data. SPEAKERS: Melliyal Annamalai, Senior Principal Product Manager, Oracle     Moscone South - Room 301 Fraud Detection with Oracle Autonomous Data Warehouse and Oracle Machine Learning Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle   08:45 AM - 10:45 AM Fraud Detection with Oracle Autonomous Data Warehouse and Oracle Machine Learning 08:45 AM - 10:45 AM Moscone South - Room 301 Oracle Machine Learning is packaged with Oracle Autonomous Data Warehouse. Together, they can transform your cloud into a powerful anomaly- and fraud-detection cloud solution. Oracle Autonomous Data Warehouse’s Oracle Machine Learning extensive library of in-database machine learning algorithms can help you discover anomalous records and events that stand out—in a multi-peculiar way. Using unsupervised learning techniques (SQL functions), Oracle Autonomous Data Warehouse and Oracle Machine Learning SQL notebooks enable companies to build cloud solutions to detect anomalies, noncompliance, and fraud (taxes, expense reports, people, claims, transactions, and more). In this session, see example scripts, Oracle Machine Learning notebooks, and best practices and hear customer examples. SPEAKERS: Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle   11:15AM Moscone South - Room 215/216 Oracle Database: What's New and What's Coming Next Dominic Giles, Master Product Manager, Oracle   11:15 AM - 12:00 PM Oracle Database: What's New and What's Coming Next 11:15 AM - 12:00 PM Moscone South - Room 215/216 In this informative session learn about recent Oracle Database news and developments, and take a sneak peek into what's coming next from the Oracle Database development team. SPEAKERS: Dominic Giles, Master Product Manager, Oracle     Moscone South - Room 214 Oracle Partitioning: What Everyone Should Know Hermann Baer, Senior Director Product Management, Oracle   11:15 AM - 12:00 PM Oracle Partitioning: What Everyone Should Know 11:15 AM - 12:00 PM Moscone South - Room 214 Oracle Partitioning has proven itself over the course of decades and ensures that tens of thousands of systems run successfully day in and out. But do you really know Oracle Partitioning? Whether you are using it already or are planning to leverage it, this is the time for you to check your knowledge. Attend this session to learn about all the things you already should know about Oracle Partitioning, including the latest innovations of this most widely used functionality in the Oracle Database. SPEAKERS: Hermann Baer, Senior Director Product Management, Oracle     Moscone South - Room 156B Drop Tank: A Cloud Journey Case Study Timothy Miller, CTO, Drop Tank LLC Shehzad Ahmad, Oracle   11:15 AM - 12:00 PM Drop Tank: A Cloud Journey Case Study 11:15 AM - 12:00 PM Moscone South - Room 156B Drop Tank, based out of Burr Ridge, Illinois, was started by a couple of seasoned execs from major oil and fuel brands. Seeing the need for POS connectivity in an otherwise segmented industry, they designed and created proprietary hardware devices. Fast forward a couple years, and Drop Tank is now a full-fledged loyalty program provider for the fuel industry, allowing for an end-to-end solution. Attend this session to learn more. SPEAKERS: Timothy Miller, CTO, Drop Tank LLC Shehzad Ahmad, Oracle     Moscone West - Room 3022C HANDS-ON LAB: Migrate Databases into Oracle Cloud with Oracle Database Migration Service Alex Kotopoulis, Director, Product Management – Data Integration Cloud, Oracle Chai Pydimukkala, Senior Director of Product Management – Data Integration, Oracle Julien Testut, Senior Principal Product Manager – Data Integration Cloud, Oracle Sachin Thatte, Senior Director, Software Development, Data Integration Cloud, Oracle David Allan, Architect - Data Integration Cloud, Oracle Shubha Sundar, Software Development Director, Data Integration Cloud, Oracle   11:15 AM - 12:15 PM HANDS-ON LAB: Migrate Databases into Oracle Cloud with Oracle Database Migration Service 11:15 AM - 12:15 PM Moscone West - Room 3022C Join this hands-on lab to gain experience with Oracle Database Migration Service, which provides an easy-to-use experience to assist in migrating databases into Oracle Autonomous Transaction Processing, Oracle Autonomous Data Warehouse, and databases on Oracle Cloud Infrastructure. The service provides an automated means to address the complexity of enterprise databases and data sets. Use cases include offline migrations for batch migration of databases, online migrations needing minimized database downtime, and schema conversions for heterogeneous migrations from non-Oracle to Oracle databases. Learn how Oracle Database Migration Service delivers monitoring, auditability, and data validation to ensure real-time progress and activity monitoring of processes SPEAKERS: Alex Kotopoulis, Director, Product Management – Data Integration Cloud, Oracle Chai Pydimukkala, Senior Director of Product Management – Data Integration, Oracle Julien Testut, Senior Principal Product Manager – Data Integration Cloud, Oracle Sachin Thatte, Senior Director, Software Development, Data Integration Cloud, Oracle David Allan, Architect - Data Integration Cloud, Oracle Shubha Sundar, Software Development Director, Data Integration Cloud, Oracle     Moscone South - Room 211 Security Architecture for Oracle Database Cloud Tammy Bednar, Sr. Director of Product Management, Database Cloud Services, Oracle   11:15 AM - 12:00 PM Security Architecture for Oracle Database Cloud 11:15 AM - 12:00 PM Moscone South - Room 211 Oracle enables enterprises to maximize the number of mission-critical workloads they can migrate to the cloud while continuing to maintain the desired security posture and reducing the overhead of building and operating data center infrastructure. By design, Oracle provides the security of cloud infrastructure and operations (cloud operator access controls, infrastructure security patching, and so on), and you are responsible for securely configuring your cloud resources. This provides unparalleled control and transparency with applications running on Oracle Cloud. Security in the cloud is a shared responsibility between you and Oracle. Join this session to gain a better understanding of the Oracle Cloud security architecture. SPEAKERS: Tammy Bednar, Sr. Director of Product Management, Database Cloud Services, Oracle     Moscone West - Room 3023 Hands-on Lab: Oracle Database In-Memory Andy Rivenes, Product Manager, Oracle   11:15 AM - 12:15 PM Hands-on Lab: Oracle Database In-Memory 11:15 AM - 12:15 PM Moscone West - Room 3023 Oracle Database In-Memory introduces an in-memory columnar format and a new set of SQL execution optimizations including SIMD processing, column elimination, storage indexes, and in-memory aggregation, all of which are designed specifically for the new columnar format. This lab provides a step-by-step guide on how to get started with Oracle Database In-Memory, how to identify which of the optimizations are being used, and how your SQL statements benefit from them. The lab uses Oracle Database and also highlights the new features available in the latest release. Experience firsthand just how easy it is to start taking advantage of this technology and its performance improvements. SPEAKERS: Andy Rivenes, Product Manager, Oracle     Moscone West - Room 3021 Hands-on Lab: Oracle Machine Learning Mark Hornick, Senior Director Oracle Data Science and Big Data Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Oracle Data Science and Big Data, Oracle   11:15 AM - 12:15 PM Hands-on Lab: Oracle Machine Learning 11:15 AM - 12:15 PM Moscone West - Room 3021 In this introductory hands-on-lab, try out the new Oracle Machine Learning Zeppelin-based notebooks that come with Oracle Autonomous Database. Oracle Machine Learning extends Oracle’s offerings in the cloud with its collaborative notebook environment that helps data scientist teams build, share, document, and automate data analysis methodologies that run 100% in Oracle Autonomous Database. Interactively work with your data, and build, evaluate, and apply machine learning models. Import, export, edit, run, and share Oracle Machine Learning notebooks with other data scientists and colleagues. Share and further explore your insights and predictions using the Oracle Analytics Cloud. SPEAKERS: Mark Hornick, Senior Director Oracle Data Science and Big Data Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Oracle Data Science and Big Data, Oracle   12:00PM The Exchange - Ask Tom Theater Ask TOM: SQL Pattern Matching Chris Saxon, Developer Advocate, Oracle   12:00 PM - 12:20 PM Ask TOM: SQL Pattern Matching 12:00 PM - 12:20 PM The Exchange - Ask Tom Theater SQL is a powerful language for accessing data. Using analytic functions, you can gain lots of insights from your data. But there are many problems that are hard or outright impossible to solve using them. Introduced in Oracle Database 12c, the row pattern matching clause, match_recognize, fills this gap. With this it's easy-to-write efficient SQL to answer many previously tricky questions. This session introduces the match_recognize clause. See worked examples showing how it works and how it's easier to write and understand than traditional SQL solutions. This session is for developers, DBAs, and data analysts who need to do advanced data analysis. SPEAKERS: Chris Saxon, Developer Advocate, Oracle   12:30AM Moscone West - Room 2002 The DBA's Next Great Job Rich Niemiec, Chief Innovation Officer, Viscosity North America   12:30 PM - 01:15 PM The DBA's Next Great Job 12:30 PM - 01:15 PM " Moscone West - Room 2002 What's the next role for the DBA as the autonomous database and Oracle enhancements free up time? This session explores the DBA role (managing more databases with autonomous database) and the integration of AI and machine learning. Topics covered include what's next for the DBA with the autonomous database, important skills for the future such as machine learning and AI, how the merging of tech fields cause both pain and opportunity, and future jobs and the world ahead where Oracle will continue to lead. SPEAKERS: "Rich Niemiec, Chief Innovation Officer, Viscosity North America   12:45PM Moscone West - Room 3021 Hands-on Lab: Oracle Essbase Eric Smadja, Oracle Ashish Jain, Product Manager, Oracle Mike Larimer, Oracle   12:45 PM - 01:45 PM Hands-on Lab: Oracle Essbase 12:45 PM - 01:45 PM Moscone West - Room 3021 The good thing about the new hybrid block storage option (BSO) is that many of the concepts that we learned over the years about tuning a BSO cube still have merit. Knowing what a block is, and why it is important, is just as valuable today in hybrid BSO as it was 25 years ago when BSO was introduced. In this learn best practices for performance optimization, how to manage data blocks, how dimension ordering still impacts things such as calculation order, the reasons why the layout of a report can impact query performance, how logs will help the Oracle Essbase developer debug calculation flow and query performance, and how new functionality within Smart View will help developers understand and modify the solve order. SPEAKERS: Eric Smadja, Oracle Ashish Jain, Product Manager, Oracle Mike Larimer, Oracle   01:30PM The Exchange - Ask Tom Theater Extracting Real Value from Data Lakes with Machine Learning Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle   01:30 PM - 01:50 PM Extracting Real Value from Data Lakes with Machine Learning 01:30 PM - 01:50 PM The Exchange - Ask Tom Theater In this session learn how to interface with and analyze large data lakes using the tools provided by the Oracle Big Data platform (both on-premises and in the cloud), which includes the entire ecosystem of big data components, plus several tools and interfaces ranging from data loading to machine learning and data visualization. Using notebooks for machine learning makes the environment easy and intuitive, and because the platform is open users can choose the language they feel most comfortable with, including R, Python, SQL, Java, etc. Customer stories of how to achieve success on the Oracle Big Data platform are shared, and the proven architecture of the solutions and techniques should help anyone starting a big data project today. SPEAKERS: Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle   01:45PM Moscone South - Room 214 Choosing and Using the Best Performance Tools for Every Situation Mike Hallas, Architect, Oracle Juergen Mueller, Senior Director, Oracle John Zimmerman, Oracle   01:45 PM - 02:30 PM Choosing and Using the Best Performance Tools for Every Situation 01:45 PM - 02:30 PM Moscone South - Room 214 How do you investigate the performance of Oracle Database? The database is highly instrumented with counters and timers to record important activity, and this instrumentation is enabled by default. A wealth of performance data is stored in the Automatic Workload Repository, and this data underpins various performance tools and reports. With so many options available, how do you choose and use the tools with the best opportunity for improvement? Join this session to learn how real-world performance engineers apply different tools and techniques, and see examples based on real-world situations. SPEAKERS: Mike Hallas, Architect, Oracle Juergen Mueller, Senior Director, Oracle John Zimmerman, Oracle     Moscone South - Room 213 Oracle Multitenant: Best Practices for Isolation Can Tuzla, Senior Product Manager, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle   01:45 PM - 02:30 PM Oracle Multitenant: Best Practices for Isolation 01:45 PM - 02:30 PM Moscone South - Room 213 The dramatic efficiencies of deploying databases in cloud configurations are worthless if they come at the expense of isolation. Attend this session to learn about the technologies that deliver the isolation behind Oracle Autonomous Data Warehouse. SPEAKERS: Can Tuzla, Senior Product Manager, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle     Moscone South - Room 211 Oracle Autonomous Data Warehouse: Customer Panel Holger Friedrich, CTO, sumIT AG Reiner Zimmermann, Senior Director, DW & Big Data Product Management, Oracle Joerg Otto, Head of IT, IDS GmbH - Analysis and Reporting Services Manuel Martin Marquez, Senior Project Leader, Cern Organisation Européenne Pour La Recherche Nucléaire James Anthony, CTO, Data Intensity Ltd   01:45 PM - 02:30 PM Oracle Autonomous Data Warehouse: Customer Panel 01:45 PM - 02:30 PM Moscone South - Room 211 In this panel session, hear customers discuss their experiences using Oracle Autonomous Data Warehouse. Topics include both the business and use cases for Oracle Autonomous Data Warehouse and why this service is the perfect fit for their organization. The panel includes customers from Oracle's Global Leaders program, including Droptank, QLX, Hertz, 11880.com, Unior, Deutsche Bank, Caixa Bank, and many more. SPEAKERS: Holger Friedrich, CTO, sumIT AG Reiner Zimmermann, Senior Director, DW & Big Data Product Management, Oracle Joerg Otto, Head of IT, IDS GmbH - Analysis and Reporting Services Manuel Martin Marquez, Senior Project Leader, Cern Organisation Européenne Pour La Recherche Nucléaire James Anthony, CTO, Data Intensity Ltd     Moscone South - Room 215/216 Oracle Autonomous Database: Looking Under the Hood Yasin Baskan, Director Product Manager, Oracle   01:45 PM - 02:30 PM Oracle Autonomous Database: Looking Under the Hood 01:45 PM - 02:30 PM Moscone South - Room 215/216 This session takes you under the hood of Oracle Autonomous Database so you can really understand what makes it tick. Learn about key database features and how these building-block features put the “autonomous” into Autonomous Database. See how to manage and monitor an Autonomous Database using all the most popular tools including Oracle Cloud Infrastructure’s management console, Oracle SQL Developer, Oracle Application Express, REST, and more. Take a deep-dive into what you really need to know to make your DBA career thrive in an autonomous-driven world. SPEAKERS: Yasin Baskan, Director Product Manager, Oracle     Moscone South - Room 152C How Oracle Autonomous Data Warehouse/Oracle Analytics Cloud Can Help FinTech Marketing Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle Pawarit Ruengsuksilp, Business Development Officer, Forth Corporation PCL   01:45 PM - 02:30 PM How Oracle Autonomous Data Warehouse/Oracle Analytics Cloud Can Help FinTech Marketing 01:45 PM - 02:30 PM Moscone South - Room 152C Forth Smart has a unique FinTech business based in Thailand. It has more than 120,000 vending machines, and customers can top-up their mobile credit with cash, as well as sending money to bank accounts, ewallets, and more. It uses Oracle Database and Oracle Analytics Cloud for segmenting its customer base, improving operational efficiency, and quantifying the impact of marketing campaigns. This has allowed Forth Smart to identify cross-selling opportunities and utilize its marketing budget. In this session learn about Forth Smart's journey to introduce Oracle Analytics platforms. Hear about its recent trials to introduce technologies, such as Oracle Autonomous Data Warehouse, to help it stay ahead of competitors. SPEAKERS: Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle Pawarit Ruengsuksilp, Business Development Officer, Forth Corporation PCL   03:15PM Moscone South - Room 207/208 Solution Keynote: Oracle Cloud Infrastructure Clay Magouyrk, Senior Vice President, Software Development, Oracle   03:15 PM – 05:00 PM Solution Keynote: Oracle Cloud Infrastructure 03:15 PM – 05:00 PM Moscone South - Room 207/208 In this session join Oracle Cloud Infrastructure senior leadership for the latest news and updates about Oracle's new class of cloud. SPEAKERS: Clay Magouyrk, Senior Vice President, Software Development, Oracle     Moscone South - Room 214 Oracle Machine Learning: Overview of New Features and Roadmap Mark Hornick, Senior Director, Data Science and Machine Learning, Oracle Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle   03:15 PM – 04:00 PM Oracle Machine Learning: Overview of New Features and Roadmap 03:15 PM – 04:00 PM Moscone South - Room 214 Oracle extended its database into one that discovers insights and makes predictions. With more than 30 algorithms as parallelized SQL functions, Oracle Machine Learning and Oracle Advanced Analytics eliminate data movement, preserve security, and leverage database parallelism and scalability. Oracle Advanced Analytics 19c delivers unequaled performance and a library of in-database machine learning algorithms. In Oracle Autonomous Database, Oracle Machine Learning provides collaborative notebooks for data scientists. Oracle Machine Learning for Python, Oracle R Enterprise, Oracle R Advanced Analytics for Hadoop, cognitive analytics for images and text, and deploying models as microservices round out Oracle’s ML portfolio. In this session hear the latest developments and learn what’s next. SPEAKERS: Mark Hornick, Senior Director, Data Science and Machine Learning, Oracle Marcos Arancibia Coddou, Product Manager, Oracle Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle   03:45PM Moscone West - Room 3022C Introduction to Oracle Data Integration Service, a Fully Managed Serverless ETL Service Julien Testut, Senior Principal Product Manager – Data Integration Cloud, Oracle Sachin Thatte, Senior Director, Software Development, Data Integration Cloud, Oracle Denis Gray, Senior Director of Product Management – Data Integration Cloud, Oracle David Allan, Architect - Data Integration Cloud, Oracle Shubha Sundar, Software Development Director, Data Integration Cloud, Oracle Abhiram Gujjewar, Director Product Management, OCI, Oracle SOMENATH DAS, Senior Director, Software Development - Data Integration Cloud, Oracle   03:45 PM - 04:45 PM Introduction to Oracle Data Integration Service, a Fully Managed Serverless ETL Service 03:45 PM - 04:45 PM Moscone West - Room 3022C Cloud implementations need data integration to be successful. To maximize value from data, you must decide how to transform and aggregate data as it is moved from source to target. Join this hands-on lab to have firsthand experience with Oracle Data Integration Service, a key component of Oracle Cloud Infrastructure that provides fully managed data integration and extract/transform/load (ETL) capabilities. Learn how to implement processes that will prepare, transform, and load data into data warehouses or data lakes and hear how Oracle Data Integration Service integrates with Oracle Autonomous Databases and Oracle Cloud Infrastructure. SPEAKERS: Julien Testut, Senior Principal Product Manager – Data Integration Cloud, Oracle Sachin Thatte, Senior Director, Software Development, Data Integration Cloud, Oracle Denis Gray, Senior Director of Product Management – Data Integration Cloud, Oracle David Allan, Architect - Data Integration Cloud, Oracle Shubha Sundar, Software Development Director, Data Integration Cloud, Oracle Abhiram Gujjewar, Director Product Management, OCI, Oracle SOMENATH DAS, Senior Director, Software Development - Data Integration Cloud, Oracle     Moscone West - Room 3019 hands-on lab: Low-Code Development with Oracle Application Express and Oracle Autonomous Database David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle   03:45 PM - 04:45 PM hands-on lab: Low-Code Development with Oracle Application Express and Oracle Autonomous Database 03:45 PM - 04:45 PM Moscone West - Room 3019 Oracle Application Express is a low-code development platform that enables you to build stunning, scalable, secure apps with world-class features that can be deployed anywhere. In this lab start by initiating your free trial for Oracle Autonomous Database and then convert a spreadsheet into a multiuser, web-based, responsive Oracle Application Express application in minutes—no prior experience with Oracle Application Express is needed. Learn how you can use Oracle Application Express to solve many of your business problems that are going unsolved today. SPEAKERS: David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle   Moscone West - Room 3021 Hands-on Lab: Oracle Autonomous Data Warehouse Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle   03:45 PM - 04:45 PM Hands-on Lab: Oracle Autonomous Data Warehouse 03:45 PM - 04:45 PM Moscone West - Room 3021 In this hands-on lab discover how you can access, visualize, and analyze lots of different types of data using a completely self-service, agile, and fast service running in the Oracle Cloud: oracle Autonomous Data Warehouse. See how quickly and easily you can discover new insights by blending, extending, and visualizing a variety of data sources to create data-driven briefings on both desktop and mobile browsers—all without the help of IT. It has never been so easy to create visually sophisticated reports that really communicate your discoveries, all in the cloud, all self-service, powered by Oracle Autonomous Data Warehouse. Oracle’s perfect quick-start service for fast data loading, sophisticated reporting, and analysis is for everybody. SPEAKERS: Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle   04:15PM Moscone South - Room 214 JSON Document Store in Oracle Autonomous Database: A Developer Perspective Roger Ford, Product Manager, Oracle   04:15 PM - 05:00 PM JSON Document Store in Oracle Autonomous Database: A Developer Perspective 04:15 PM - 05:00 PM Moscone South - Room 214 JSON Document Store is a service that works alongside Oracle Autonomous Data Warehouse and Oracle Autonomous Transaction Processing. In this session see how JSON Document Store makes it easy for developers to create straightforward document-centric applications utilizing the power of Oracle Database without needing to write SQL code or to plan schemas in advance. Learn about the full development lifecycle, from JSON source to an end-user application running on Oracle Cloud. SPEAKERS: Roger Ford, Product Manager, Oracle   05:00PM The Exchange - Ask Tom Theater Ask TOM: SQL Pattern Matching Chris Saxon, Developer Advocate, Oracle   05:00 PM - 05:20 PM Ask TOM: SQL Pattern Matching 05:00 PM - 05:20 PM The Exchange - Ask Tom Theater SQL is a powerful language for accessing data. Using analytic functions, you can gain lots of insights from your data. But there are many problems that are hard or outright impossible to solve using them. Introduced in Oracle Database 12c, the row pattern matching clause, match_recognize, fills this gap. With this it's easy-to-write efficient SQL to answer many previously tricky questions. This session introduces the match_recognize clause. See worked examples showing how it works and how it's easier to write and understand than traditional SQL solutions. This session is for developers, DBAs, and data analysts who need to do advanced data analysis. SPEAKERS: Chris Saxon, Developer Advocate, Oracle   05:15PM Moscone South - Room 211 What’s New in Oracle Optimizer Nigel Bayliss, Senior Principal Product Manager, Oracle   05:15 PM - 06:00 PM What’s New in Oracle Optimizer 05:15 PM - 06:00 PM Moscone South - Room 211 This session covers all the latest features in Oracle Optimizer and provides an overview of the most important differences you can expect to see as you upgrade from earlier releases. Discover all the new, enhanced, and fully automated features relating to Oracle Optimizer. For example, learn how Oracle Database delivers better SQL execution plans with real-time statistics, protects you from query performance regression with SQL quarantine and automatic SQL plan management, and self-tunes your workload with automatic indexing. SPEAKERS: Nigel Bayliss, Senior Principal Product Manager, Oracle     Moscone South - Room 214 Oracle Advanced Compression: Essential Concepts, Tips, and Tricks for Enterprise Data Gregg Christman, Product Manager, Oracle   05:15 PM - 06:00 PM Oracle Advanced Compression: Essential Concepts, Tips, and Tricks for Enterprise Data 05:15 PM - 06:00 PM Moscone South - Room 214 Oracle Database 19c provides innovative row and columnar compression technologies that improve database performance while reducing storage costs. This enables organizations to utilize optimal compression based on usage. Automatic policies enable dynamic, lifecycle-aware compression including lower compression for faster access to hot/active data and higher compression for less frequently accessed inactive/archival data. In this session learn about the essential concepts, as well as compression best practices, and tips and tricks to ensure your organization is achieving the best possible compression savings, and database performance, for your enterprise data. SPEAKERS: Gregg Christman, Product Manager, Oracle     Moscone South - Room 152C Billboards to Dashboards: OUTFRONT Media Uses Oracle Analytics Cloud to Analyze Marketing Tim Vlamis, Vice President & Analytics Strategist, VLAMIS SOFTWARE SOLUTIONS INC Dan Vlamis, CEO - President, VLAMIS SOFTWARE SOLUTIONS INC Derek Hayden, VP - Data Strategy & Analytics, OUTFRONT Media   05:15 PM - 06:00 PM Billboards to Dashboards: OUTFRONT Media Uses Oracle Analytics Cloud to Analyze Marketing 05:15 PM - 06:00 PM Moscone South - Room 152C As one of the premier outdoor and out-of-home media organizations in the world, OUTFRONT Media (Formerly CBS Outdoor) serves America’s largest companies with complex marketing and media planning needs. In this session learn how the OUTFRONT’s analytics team migrated from Oracle Business Intelligence Cloud Service to Oracle Analytics Cloud with an Oracle Autonomous Data Warehouse backend. Hear what lessons were learned and what best practices to follow for Oracle Autonomous Data Warehouse and Oracle Analytics Cloud integration and see how sales analytics and location analysis are leading the organization to new analytics insights. Learn how a small team produced gigantic results in a short time. SPEAKERS: Tim Vlamis, Vice President & Analytics Strategist, VLAMIS SOFTWARE SOLUTIONS INC Dan Vlamis, CEO - President, VLAMIS SOFTWARE SOLUTIONS INC Derek Hayden, VP - Data Strategy & Analytics, OUTFRONT Media     Moscone South - Room 212 Oracle Autonomous Transaction Processing Dedicated Deployment: The End User’s Experience Robert Greene, Senior Director, Product Management, Oracle Jim Czuprynski, Enterprise Data Architect, Viscosity NA   05:15 PM - 06:00 PM Oracle Autonomous Transaction Processing Dedicated Deployment: The End User’s Experience 05:15 PM - 06:00 PM Moscone South - Room 212 Oracle Autonomous Transaction Processing dedicated deployment is the latest update to Oracle Autonomous Database. In this session see the service in action and hear a preview customer describe their experience using a dedicated deployment. Walk through the process of setting up and using HammerDB to run some quick performance workloads and then put the service through its paces to test additional functionality in the areas of private IP, fleet administrator resource isolation, service setup, bulk data loading, patching cooperation controls, transparent application continuity, and more. Hear directly from one of the earliest end users about their real-world experience using Autonomous Database. SPEAKERS: Robert Greene, Senior Director, Product Management, Oracle Jim Czuprynski, Enterprise Data Architect, Viscosity NA     Moscone West - Room 3019 Hands-on lab: RESTful Services with Oracle REST Data Services and Oracle Autonomous Database Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle   05:15 PM - 06:15 PM Hands-on lab: RESTful Services with Oracle REST Data Services and Oracle Autonomous Database 05:15 PM - 06:15 PM Moscone West - Room 3019 In this session learn to develop and deploy a RESTful service using Oracle SQL Developer, Oracle REST Data Services, and Oracle Autonomous Database. Then connect these services as data sources to different Oracle JavaScript Extension Toolkit visualization components to quickly build rich HTML5 applications using a free and open source JavaScript framework. SPEAKERS: Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle     Moscone West - Room 3023 Hands-on Lab: Oracle Multitenant John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle   05:15 PM - 06:15 PM Hands-on Lab: Oracle Multitenant 05:15 PM - 06:15 PM Moscone West - Room 3023 This is your opportunity to get up close and personal with Oracle Multitenant. In this session learn about a very broad range of Oracle Multitenant functionality in considerable depth. Warning: This lab has been filled to capacity quickly at every Oracle OpenWorld that it has been offered. It is strongly recommended that you sign up early. Even if you're only able to get on the waitlist, it's always worth showing up just in case there's a no-show and you can grab an available seat. SPEAKERS: John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle     Moscone West - Room 3021 Hands-on Lab: Oracle Big Data SQL Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle   05:15 PM - 06:15 PM Hands-on Lab: Oracle Big Data SQL 05:15 PM - 06:15 PM Moscone West - Room 3021 Modern data architectures encompass streaming data (e.g. Kafka), Hadoop, object stores, and relational data. Many organizations have significant experience with Oracle Databases, both from a deployment and skill set perspective. This hands-on lab on walks through how to leverage that investment. Learn how to extend Oracle Database to query across data lakes (Hadoop and object stores) and streaming data while leveraging Oracle Database security policies. SPEAKERS: Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle   06:00PM Moscone South - Room 309 Building Apps with an Autonomous Multi-model Database in the Cloud Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle   06:00 PM - 06:45 PM Building Apps with an Autonomous Multi-model Database in the Cloud 06:00 PM - 06:45 PM Moscone South - Room 309 Oracle Autonomous Database provides full support for relational data and non-relational data such as JSON, XML, text, and spatial. It brings new capabilities, delivers dramatically faster performance, moves more application logic and analytics into the database, provides cloud-ready JSON and REST services to simplify application development, and enables analysis on dramatically larger datasets—making it ideally suited for the most advanced multi-model cloud-based applications. Learn more in this session. SPEAKERS: Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle   Oracle Autonomous Database Product Management Team Designed by: Keith Laker Senior Principal Product Manager | Analytic SQL and Autonomous Database Oracle Database Product Management Scotscroft, Towers Business Park, Wilmslow Road, Didsbury. M20 2RY.  

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Tuesday and make sure you...

Autonomous

Where to find us in the demogrounds at #oow19....

If you are OpenWorld tomorrow then I expect that you have lots of questions that need answering. Well, the demogrounds is this place for that! Our development teams will be at the demogrounds all week ready to answer all your questions. The demo area is located in the base of Moscone South and it will formally open at 10:00 am tomorrow (Monday) morning. The full list of opening times is: Monday 10:00am Tuesday 10:30am Wednesday 10:00am The full agenda for #OOW19 is here: https://www.oracle.com/openworld/agenda.html. So where can you find us? And by "us" I mean the analytic SQL, parallel execution, PL/SQL and Autonomous Data Warehouse development teams? Well here some information to guide you to the right area: Got questions about Analytics SQL, parallel execution and PL/SQL? So where do you go to answers to questions about Analytics SQL, parallel execution and PL/SQL? You need to find demo booth ADB-014. Which probably isn't that helpful really! So here is a short video showing you how to find the demo booth Analytics SQL, parallel execution and PL/SQL once you have got yourself to Moscone South and come down the moving stairway/escalator...and then follow in my footsteps and we will be ready and waiting for you! Your browser does not support the video tag.   Got questions about Autonomous Data Warehouse and Analytics Cloud? So where do you go to answers to questions about Autonomous Data Warehouse and Oracle Analytics Cloud? Well, these booths are in a different area so you need to take a slightly different route which is outlined here: Your browser does not support the video tag.   Hopefully that's all clear now! Enjoy OpenWorld and looking forward to answering your questions.      

If you are OpenWorld tomorrow then I expect that you have lots of questions that need answering. Well, the demogrounds is this place for that! Our development teams will be at the demogrounds all week...

Autonomous

Autonomous Monday at OpenWorld 2019 - List of Must See Sessions

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Monday and make sure you drink lots of water! If you want this agenda on your phone (iPhone or Android) then checkout our smartphone web app by clicking here !function(){ var d=document.documentElement;d.className=d.className.replace(/no-js/,'js'); if(document.location.href.indexOf('betamode=') > -1) document.write(''); }(); AGENDA - MONDAY 08:30AM Moscone West - Room 3019 RESTful Services with Oracle REST Data Services and Oracle Autonomous Database Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle 08:30 AM - 09:30 AM RESTful Services with Oracle REST Data Services and Oracle Autonomous Database 08:30 AM - 09:30 AM Moscone West - Room 3019 In this session learn to develop and deploy a RESTful service using Oracle SQL Developer, Oracle REST Data Services, and Oracle Autonomous Database. Then connect these services as data sources to different Oracle JavaScript Extension Toolkit visualization components to quickly build rich HTML5 applications using a free and open source JavaScript framework. SPEAKERS:Jeff Smith, Senior Principal Product Manager, Oracle Ashley Chen, Senior Product Manager, Oracle Colm Divilly, Consulting Member of Technical Staff, Oracle Elizabeth Saunders, Principal Technical Staff, Oracle 09:00AM Moscone South - Room 201 Oracle Database 19c for Developers Chris Saxon, Developer Advocate, Oracle 09:00 AM - 09:45 AM Oracle Database 19c for Developers 09:00 AM - 09:45 AM Moscone South - Room 201 Another year brings another new release: Oracle Database 19c. As always, this brings a host of goodies to help developers. This session gives an overview of the best features for developers in recent releases. It covers JSON support, SQL enhancements, and PL/SQL improvements. It also shows you how to use these features to write faster, more robust, more secure applications with Oracle Database. If you’re a developer or a DBA who regularly writes SQL or PL/SQL and wants to keep up to date, this session is for you. SPEAKERS:Chris Saxon, Developer Advocate, Oracle Moscone South - Room 156B A Unified Platform for All Data Peter Jeffcock, Big Data and Data Science, Cloud Business Group, Oracle Aali Masood, Senior Director, Big Data Go-To-Market, Oracle 09:00 AM - 09:45 AM A Unified Platform for All Data 09:00 AM - 09:45 AM Moscone South - Room 156B Seamless access to all your data, no matter where it’s stored, should be a given. But the proliferation of new data storage technologies has led to new kinds of data silo. In this session learn how to build an information architecture that will stand the test of time and make all data available to the business, whether it’s in Oracle Database, object storage, Hadoop, or elsewhere. SPEAKERS:Peter Jeffcock, Big Data and Data Science, Cloud Business Group, Oracle Aali Masood, Senior Director, Big Data Go-To-Market, Oracle Moscone South - Room 210 Oracle Data Integration Cloud: The Future of Data Integration Now Chai Pydimukkala, Senior Director of Product Management – Data Integration Cloud, Oracle Arun Patnaik, Vice President, Architect - Data Integration Cloud, Oracle 09:00 AM - 09:45 AM Oracle Data Integration Cloud: The Future of Data Integration Now 09:00 AM - 09:45 AM Moscone South - Room 210 The rapid adoption of enterprise cloud–based solutions brings new data integration challenges as data moves from ground-to-cloud and cloud-to-cloud. Join this session led by Oracle Product Management to hear about the Oracle Data Integration Cloud strategy, vision, and roadmap to solve data needs now and into the future. Explore Oracle’s next-generation data integration cloud services and learn about the evolution of data integration in the cloud with a practical path forward for customers using Oracle Data Integration Platform Cloud, Oracle Data Integrator, Oracle GoldenGate, Oracle Enterprise Metadata Management, and more. Learn the crucial role of data integration in successful cloud implementations. SPEAKERS:Chai Pydimukkala, Senior Director of Product Management – Data Integration Cloud, Oracle Arun Patnaik, Vice President, Architect - Data Integration Cloud, Oracle Moscone West - Room 3016 All Analytics, All Data: No Nonsense Shyam Varan Nath, Director IoT & Cloud, BIWA User Group Dan Vlamis, CEO - President, VLAMIS SOFTWARE SOLUTIONS INC Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle 09:00 AM - 09:45 AM All Analytics, All Data: No Nonsense 09:00 AM - 09:45 AM Moscone West - Room 3016 n this session learn from people who have implemented real-world solutions using Oracle's business analytics, machine learning, and spatial and graph technologies. Several Oracle BI, Data Warehouse, and Analytics (BIWA) User Community customers explain how they use Oracle's analytics and machine learning, cloud, autonomous, and on-premises applications to extract more information from their data. Come join the community and learn from people that have been there and done that. SPEAKERS:Shyam Varan Nath, Director IoT & Cloud, BIWA User Group Dan Vlamis, CEO - President, VLAMIS SOFTWARE SOLUTIONS INC Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle 10:00AM YBCA Theater SOLUTION KEYNOTE: Autonomous Data Management Andrew Mendelsohn, Executive Vice President Database Server Technologies, Oracle 10:00 AM - 11:15 AM SOLUTION KEYNOTE: Autonomous Data Management 10:00 AM - 11:15 AM YBCA Theater Oracle is proven to be the database of choice for managing customers operational and analytical workloads on-premises and in the cloud, with truly unique and innovate autonomous database services. In his annual Oracle OpenWorld address, Oracle Executive Vice President Andy Mendelsohn discusses what's new and what's coming next from the Database Development team. SPEAKERS:Andrew Mendelsohn, Executive Vice President Database Server Technologies, Oracle Moscone South - Room 210 Oracle Data Integration Cloud: Data Catalog Service Deep Dive Denis Gray, Senior Director of Product Management – Data Integration Cloud, Oracle Abhiram Gujjewar, Director Product Management, OCI, Oracle 10:00 AM - 10:45 AM Oracle Data Integration Cloud: Data Catalog Service Deep Dive 10:00 AM - 10:45 AM Moscone South - Room 210 The new Oracle Data Catalog enables transparency, traceability, and trust in enterprise data assets in Oracle Cloud and beyond. Join this Oracle Product Management–led session to learn how to get control of your enterprise’s data, metadata, and data movement. The session includes a walkthrough of enterprise metadata search, business glossary, data lineage, automatic metadata harvesting, and native metadata integration with Oracle Big Data and Oracle Autonomous Database. Learn the importance of a metadata catalog and overall governance for big data, data lakes, and data warehouses as well as metadata integration for PaaS services. See how Oracle Data Catalog enables collaboration across enterprise data personas. SPEAKERS:Denis Gray, Senior Director of Product Management – Data Integration Cloud, Oracle Abhiram Gujjewar, Director Product Management, OCI, Oracle Moscone West - Room 3021 Hands-on Lab: Oracle Big Data SQL Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle 10:00 AM - 11:00 AM Hands-on Lab: Oracle Big Data SQL 10:00 AM - 11:00 AM Moscone West - Room 3021 Modern data architectures encompass streaming data (e.g. Kafka), Hadoop, object stores, and relational data. Many organizations have significant experience with Oracle Databases, both from a deployment and skill set perspective. This hands-on lab on walks through how to leverage that investment. Learn how to extend Oracle Database to query across data lakes (Hadoop and object stores) and streaming data while leveraging Oracle Database security policies. SPEAKERS:Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Eric Vinck, Principal Sales Consultant, EMEA Oracle Solution Center, Oracle, Oracle Moscone West - Room 3023 Hands-on Lab: Oracle Multitenant John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle 10:00 AM - 11:00 AM  Hands-on Lab: Oracle Multitenant 10:00 AM - 11:00 AM  Moscone West - Room 3023 This is your opportunity to get up close and personal with Oracle Multitenant. In this session learn about a very broad range of Oracle Multitenant functionality in considerable depth. Warning: This lab has been filled to capacity quickly at every Oracle OpenWorld that it has been offered. It is strongly recommended that you sign up early. Even if you're only able to get on the waitlist, it's always worth showing up just in case there's a no-show and you can grab an available seat. SPEAKERS:John Mchugh, Senior Principal Product Manager, Oracle Thomas Baby, Architect, Oracle Patrick Wheeler, Senior Director, Product Management, Oracle Moscone West - Room 3019 Low-Code Development with Oracle Application Express and Oracle Autonomous Database David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle 10:00 AM - 11:00 AM Low-Code Development with Oracle Application Express and Oracle Autonomous Database 10:00 AM - 11:00 AM Moscone West - Room 3019 Oracle Application Express is a low-code development platform that enables you to build stunning, scalable, secure apps with world-class features that can be deployed anywhere. In this lab start by initiating your free trial for Oracle Autonomous Database and then convert a spreadsheet into a multiuser, web-based, responsive Oracle Application Express application in minutes—no prior experience with Oracle Application Express is needed. Learn how you can use Oracle Application Express to solve many of your business problems that are going unsolved today. SPEAKERS:David Peake, Senior Principal Product Manager, Oracle Marc Sewtz, Senior Software Development Manager, Oracle 11:30AM Moscone West - Room 3023 Hands-on Lab: Oracle Database In-Memory Andy Rivenes, Product Manager, Oracle 11:30 AM - 12:30 PM Hands-on Lab: Oracle Database In-Memory 11:30 AM - 12:30 PM Moscone West - Room 3023 Oracle Database In-Memory introduces an in-memory columnar format and a new set of SQL execution optimizations including SIMD processing, column elimination, storage indexes, and in-memory aggregation, all of which are designed specifically for the new columnar format. This lab provides a step-by-step guide on how to get started with Oracle Database In-Memory, how to identify which of the optimizations are being used, and how your SQL statements benefit from them. The lab uses Oracle Database and also highlights the new features available in the latest release. Experience firsthand just how easy it is to start taking advantage of this technology and its performance improvements. SPEAKERS:Andy Rivenes, Product Manager, Oracle Moscone West - Room 3021 Hands-on Lab: Oracle Autonomous Data Warehouse Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle 11:30 AM - 12:30 PM Hands-on Lab: Oracle Autonomous Data Warehouse 11:30 AM - 12:30 PM Moscone West - Room 3021 In this hands-on lab discover how you can access, visualize, and analyze lots of different types of data using a completely self-service, agile, and fast service running in the Oracle Cloud: oracle Autonomous Data Warehouse. See how quickly and easily you can discover new insights by blending, extending, and visualizing a variety of data sources to create data-driven briefings on both desktop and mobile browsers—all without the help of IT. It has never been so easy to create visually sophisticated reports that really communicate your discoveries, all in the cloud, all self-service, powered by Oracle Autonomous Data Warehouse. Oracle’s perfect quick-start service for fast data loading, sophisticated reporting, and analysis is for everybody. SPEAKERS:Hermann Baer, Senior Director Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle Nilay Panchal, Product Manager, Oracle Yasin Baskan, Senior Principal Product Manager, Oracle 12:15PM The Exchange - Ask Tom Theater Using Machine Learning and Oracle Autonomous Database to Target Your Best Customers Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle 12:15 PM - 12:35 PM Using Machine Learning and Oracle Autonomous Database to Target Your Best Customers 12:15 PM - 12:35 PM The Exchange - Ask Tom Theater Oracle Machine Learning extends Oracle’s offerings in the cloud with its collaborative notebook environment that helps data scientist teams build, share, document, and automate data analysis methodologies that run 100% in Oracle Autonomous Database. In this session learn how to interactively work with your data, and build, evaluate, and apply machine learning models. Import, export, edit, run, and share Oracle Machine Learning notebooks with other data scientists and colleagues, all on Oracle Autonomous Database. SPEAKERS:Charlie Berger, Sr. Director Product Management, Machine Learning, AI and Cognitive Analytics, Oracle Moscone West - Room 3024C Node.js SODA APIs on Oracle Autonomous Database - BYOL Dan Mcghan, Developer Advocate, Oracle 12:30 PM - 02:30 PM Node.js SODA APIs on Oracle Autonomous Database - BYOL 12:30 PM - 02:30 PM " Moscone West - Room 3024C PLEASE NOTE: YOU MUST BRING YOUR OWN LAPTOP (BYOL) TO PARTICIPATE IN THIS HANDS-ON LAB. Oracle Database has many features for different type of data, including spatial and graph, XML, text, and SecureFiles. One of the latest incarnations is Simple Oracle Document Access (SODA), which provides a set of NoSQL-style APIs that enable you to create collections of documents (often JSON), retrieve them, and query them—all without any knowledge of SQL. In this hands-on lab, create an Oracle Autonomous Database instance and learn how to connect to it securely. Then use SODA APIs to complete a Node.js-based REST API powering the front end of a “todo” tracking application. Finally learn how to use the latest JSON functions added to the SQL engine to explore and project JSON data relationally. The lab includes an introduction to more-advanced usage and APIs. SPEAKERS:"Dan Mcghan, Developer Advocate, Oracle 01:00PM Moscone West - Room 3021 Hands-on Lab: Oracle Essbase Eric Smadja, Oracle Ashish Jain, Product Manager, Oracle Mike Larimer, Oracle 01:00 PM - 02:00 PM Hands-on Lab: Oracle Essbase 01:00 PM - 02:00 PM Moscone West - Room 3021 The good thing about the new hybrid block storage option (BSO) is that many of the concepts that we learned over the years about tuning a BSO cube still have merit. Knowing what a block is, and why it is important, is just as valuable today in hybrid BSO as it was 25 years ago when BSO was introduced. In this learn best practices for performance optimization, how to manage data blocks, how dimension ordering still impacts things such as calculation order, the reasons why the layout of a report can impact query performance, how logs will help the Oracle Essbase developer debug calculation flow and query performance, and how new functionality within Smart View will help developers understand and modify the solve order. SPEAKERS:Eric Smadja, Oracle Ashish Jain, Product Manager, Oracle Mike Larimer, Oracle 01:45PM Moscone South - Room 213 Oracle Multitenant: Seven Sources of Savings Patrick Wheeler, Senior Director, Product Management, Oracle 01:45 PM - 02:30 PM Oracle Multitenant: Seven Sources of Savings 01:45 PM - 02:30 PM Moscone South - Room 213 You've heard that the efficiencies of Oracle Multitenant help reduce capital expenses and operating expenses. How much can you expect to save? Attend this session to understand the many ways Oracle Multitenant can help reduce costs so that you can arrive at an answer to this important question. SPEAKERS:Patrick Wheeler, Senior Director, Product Management, Oracle Moscone South - Room 215/216 Rethink Database IT with Autonomous Database Dedicated Robert Greene, Senior Director, Product Management, Oracle Juan Loaiza, Executive Vice President, Oracle 01:45 PM - 02:30 PM Rethink Database IT with Autonomous Database Dedicated 01:45 PM - 02:30 PM Moscone South - Room 215/216 Larry Ellison recently launched a new dedicated deployment option for Oracle Autonomous Transaction Processing. But what does this mean to your organization and how you achieve your key data management goals? This session provides a clear understanding of how dedicated deployment works and illustrates how it can simplify your approach to data management and accelerate your transition to the cloud. SPEAKERS:Robert Greene, Senior Director, Product Management, Oracle Juan Loaiza, Executive Vice President, Oracle Moscone South - Room 211 Oracle Autonomous Data Warehouse: Update, Strategy, and Roadmap George Lumpkin, Vice President, Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle 01:45 PM - 02:30 PM Oracle Autonomous Data Warehouse: Update, Strategy, and Roadmap 01:45 PM - 02:30 PM Moscone South - Room 211 The session provides an overview of Oracle’s product strategy for Oracle Autonomous Data Warehouse. Learn about the capabilities and associated business benefits of recently released features. There is a focus on themes for future releases of the product roadmap as well as the key capabilities being planned. This is a must-attend session for Oracle DBAs, database developers, and cloud architects. SPEAKERS:George Lumpkin, Vice President, Product Management, Oracle Keith Laker, Senior Principal Product Manager, Oracle 01:00PM Moscone West - Room 3019 Building Microservices Using Oracle Autonomous Database Jean De Lavarene, Software Development Director, Oracle Kuassi Mensah, Director, Product Management, Oracle Simon Law, Product Manager, Oracle Pablo Silberkasten, Software Development Manager, Oracle 01:00 PM - 02:00 PM Building Microservices Using Oracle Autonomous Database 01:00 PM - 02:00 PM Moscone West - Room 3019 In this hands-on lab build a Java microservice connecting to Oracle Autonomous Database using the reactive zstreams Ingestor Java library, Helidon SE, Oracle’s JDBC driver, and the connection pool library (UCP). See how to containerize the service with Docker and perform container orchestration using Kubernetes. SPEAKERS:Jean De Lavarene, Software Development Director, Oracle Kuassi Mensah, Director, Product Management, Oracle Simon Law, Product Manager, Oracle Pablo Silberkasten, Software Development Manager, Oracle 02:30PM Moscone South - Room 203 Ten Amazing SQL Features Connor Mcdonald, Database Advocate, Oracle 02:30 PM - 03:15 PM Ten Amazing SQL Features 02:30 PM - 03:15 PM Moscone South - Room 203 Sick and tired of writing thousands of lines of middle-tier code and still having performance problems? Let’s become fans once again of the database by being reintroduced to just how powerful the SQL language really is! Coding is great fun, but we do it to explore complex algorithms, build beautiful applications, and deliver fantastic solutions for our customers, not just to do boring data processing. By expanding our knowledge of SQL facilities, we can let all the boring work be handled via SQL rather than a lot of middle-tier code—and get performance benefits as an added bonus. This session highlights some SQL techniques for solving problems that would otherwise require a lot of complex coding. SPEAKERS:Connor Mcdonald, Database Advocate, Oracle Moscone West - Room 3021 Hands-on Lab: Oracle Machine Learning Mark Hornick, Senior Director Data Science and Big Data, Oracle Marcos Arancibia Coddou, Product Manager, Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Data Science and Big Data, Oracle 02:30 PM - 03:30 PM Hands-on Lab: Oracle Machine Learning 02:30 PM - 03:30 PM Moscone West - Room 3021 In this introductory hands-on-lab, try out the new Oracle Machine Learning Zeppelin-based notebooks that come with Oracle Autonomous Database. Oracle Machine Learning extends Oracle’s offerings in the cloud with its collaborative notebook environment that helps data scientist teams build, share, document, and automate data analysis methodologies that run 100% in Oracle Autonomous Database. Interactively work with your data, and build, evaluate, and apply machine learning models. Import, export, edit, run, and share Oracle Machine Learning notebooks with other data scientists and colleagues. Share and further explore your insights and predictions using the Oracle Analytics Cloud. SPEAKERS:Mark Hornick, Senior Director Data Science and Big Data, Oracle Marcos Arancibia Coddou, Product Manager, Data Science and Big Data, Oracle Charlie Berger, Sr. Director Product Management, Data Science and Big Data, Oracle 02:45PM Moscone South - Room 213 Choosing the Right Database Cloud Service for Your Application Tammy Bednar, Sr. Director of Product Management, Database Cloud Services, Oracle 02:45 PM - 03:30 PM Choosing the Right Database Cloud Service for Your Application 02:45 PM - 03:30 PM Moscone South - Room 213 Oracle Cloud provides automated, customer-managed Oracle Database services in flexible configurations to meet your needs, large or small, with the performance of dedicated hardware. On-demand and highly available Oracle Database, on high-performance bare metal servers or Exadata, is 100% compatible with on-premises Oracle workloads and applications: seamlessly move between the two platforms. Oracle Database Cloud allows you to start at the cost and capability level suitable to your use case and then gives you the flexibility to adapt as your requirements change over time. Join this session to learn about Oracle Database Cloud services. SPEAKERS:Tammy Bednar, Sr. Director of Product Management, Database Cloud Services, Oracle Moscone South - Room 211 Roadmap for Oracle Big Data Appliance and Oracle Big Data Cloud Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Alexey Filanovskiy, Senior Product Manager, Oracle 02:45 PM - 03:30 PM Roadmap for Oracle Big Data Appliance and Oracle Big Data Cloud 02:45 PM - 03:30 PM Moscone South - Room 211 Driven by megatrends like AI and the cloud, big data platforms are evolving. But does this mean it’s time to throw away everything being used right now? This session discusses the impact megatrends have on big data platforms and how you can ensure that you properly invest for a solid future. Topics covered include how platforms such as Oracle Big Data Appliance cater to these trends, as well as the roadmap for Oracle Big Data Cloud on Oracle Cloud Infrastructure. Learn how the combination of these products enable customers and partners to build out a platform for big data and AI, both on-premises and in Oracle Cloud, leveraging the synergy between the platforms and the trends to solve actual business problems. SPEAKERS:Martin Gubar, Director of Product Management, Big Data and Autonomous Database, Oracle Alexey Filanovskiy, Senior Product Manager, Oracle Oracle Autonomous Database Product Management Team Designed by: Keith Laker Senior Principal Product Manager | Analytic SQL and Autonomous Database Oracle Database Product Management Scotscroft, Towers Business Park, Wilmslow Road, Didsbury. M20 2RY.

Here you go folks...OpenWorld is so big this year so to help you get the most from Monday I have put together a cheat sheet listing all the best sessions. Enjoy your Monday and make sure you...

Autonomous Database Now Supports Accessing the Object Storage with OCI Native Authentication

Loading and querying external data from Oracle Cloud Object Storage are amongst some of the common operations performed in Autonomous Database. Accessing Object Storage requires users to have credentials that they can create via the CREATE_CREDENTIAL procedure of DBMS_CLOUD package. Object Storage credentials are based on the user's OCI username and an Oracle-generated token string (also known as 'auth token'). While auth token based credentials are still supported, DBMS_CLOUD now supports creating OCI native credentials as well! In this blog post, we are going to cover how to create a native credential and use it in an operation that requires Object Storage authentication. Let’s start… As you may already be familiar, the syntax to create a credential with a username and an auth token in ADB is as follows: DBMS_CLOUD.CREATE_CREDENTIAL ( credential_name IN VARCHAR2, username IN VARCHAR2, password IN VARCHAR2 DEFAULT NULL); CREATE_CREDENTIAL procedure is now overloaded to provide native authentication with the following syntax: DBMS_CLOUD.CREATE_CREDENTIAL ( credential_name IN VARCHAR2, user_ocid IN VARCHAR2, tenancy_ocid IN VARCHAR2, private_key IN VARCHAR2, fingerprint IN VARCHAR2); In native authentication, the username and password parameters are replaced with the user_ocid, tenancy_ocid, private_key, and fingerprint parameters.  user_ocid and tenancy_ocid are pretty self-explanatory and they correspond to user’s and tenancy’s OCIDs respectively (Check out “Where to Get the Tenancy's OCID and User's OCID” for more details). The private_key parameter specifies the generated private key in PEM format. When it comes to the private_key parameter, there are couple important details worth mentioning. Currently, a private key that is created with a passphrase is not supported. Therefore, you need to make sure you generate a key with no passphrase (Check out “How to Generate an API Signing Key” for more details on how to create a private key with no passphrase). Additionally, the private key that you provide for this parameter should only contain the key itself without any header or footer (e.g. ‘-----BEGIN RSA PRIVATE KEY-----', ‘-----END RSA PRIVATE KEY-----’). Lastly, the fingerprint parameter specifies the fingerprint that can be obtained either after uploading the public key to the console (See “How to Upload the Public Key”) or via the OpenSSL commands (See “How to Get the Key's Fingerprint”). Once you gather all the necessary info and generate your private key, your CREATE_CREDENTIAL procedure should look similar to this: BEGIN DBMS_CLOUD.CREATE_CREDENTIAL ( credential_name => 'OCI_NATIVE_CRED', user_ocid => 'ocid1.user.oc1..aaaaaaaatfn77fe3fxux3o5lego7glqjejrzjsqsrs64f4jsjrhbsk5qzndq', tenancy_ocid => 'ocid1.tenancy.oc1..aaaaaaaapwkfqz3upqklvmelbm3j77nn3y7uqmlsod75rea5zmtmbl574ve6a', private_key => 'MIIEogIBAAKCAQEAsbNPOYEkxM5h0DF+qXmie6ddo95BhlSMSIxRRSO1JEMPeSta0C7WEg7g8SOSzhIroCkgOqDzkcyXnk4BlOdn5Wm/BYpdAtTXk0sln2DH/GCH7l9P8xC9cvFtacXkQPMAXIBDv/zwG1kZQ7Hvl7Vet2UwwuhCsesFgZZrAHkv4cqqE3uF5p/qHfzZHoevdq4EAV6dZK4Iv9upACgQH5zf9IvGt2PgQnuEFrOm0ctzW0v9JVRjKnaAYgAbqa23j8tKapgPuREkfSZv2UMgF7Z7ojYMJEuzGseNULsXn6N8qcvr4fhuKtOD4t6vbIonMPIm7Z/a6tPaISUFv5ASYzYEUwIDAQABAoIBACaHnIv5ZoGNxkOgF7ijeQmatoELdeWse2ZXll+JaINeTwKU1fIB1cTAmSFv9yrbYb4ubKCJuYZJeC6I92rT6gEiNpr670Pn5n43cwblszcTryWOYQVxAcLkejbPA7jZd6CW5xm/vEgRv5qgADVCzDCzrij0t1Fghicc+EJ4BFvOetnzEuSidnFoO7K3tHGbPgA+DPN5qrO/8NmrBebqezGkOuOVkOA64mp467DQUhpAvsy23RjBQ9iTuRktDB4g9cOdOVFouTZTnevN6JmDxufu9Lov2yvVMkUC2YKd+RrTAE8cvRrn1A7XKkH+323hNC59726jT57JvZ+ricRixSECgYEA508e/alxHUIAU9J/uq98nJY/6+GpI9OCZDkEdBexNpKeDq2dfAo9pEjFKYjH8ERj9quA7vhHEwFL33wk2D24XdZl6vq0tZADNSzOtTrtSqHykvzcnc7nXv2fBWAPIN59s9/oEKIOdkMis9fps1mFPFiN8ro4ydUWuR7B2nM2FWkCgYEAxKs/zOIbrzVLhEVgSH2NJVjQs24S8W+99uLQK2Y06R59L0Sa90QHNCDjB1MaKLanAahP30l0am0SB450kEiUD6BtuNHH8EIxGL4vX/SYeE/AF6tw3DqcOYbLPpN4CxIITF0PLCRoHKxARMZLCJBTMGpxdmTNGyQAPWXNSrYEKFsCgYBp0sHr7TxJ1WtO7gvvvd91yCugYBJAyMBr18YY0soJnJRhRL67A/hlk8FYGjLW0oMlVBtduQrTQBGVQjedEsepbrAcC+zm7+b3yfMb6MStE2BmLPdF32XtCH1bOTJSqFe8FmEWUv3ozxguTUam/fq9vAndFaNre2i08sRfi7wfmQKBgBrzcNHN5odTIV8l9rTYZ8BHdIoyOmxVqM2tdWONJREROYyBtU7PRsFxBEubqskLhsVmYFO0CD0RZ1gbwIOJPqkJjh+2t9SH7Zx7a5iV7QZJS5WeFLMUEv+YbYAjnXK+dOnPQtkhOblQwCEY3Hsblj7Xz7o=', fingerprint => '4f:0c:d6:b7:f2:43:3c:08:df:62:e3:b2:27:2e:3c:7a'); END; / PL/SQL procedure successfully completed. We should now be able to see our new credential in the dba_credentials table: SELECT owner, credential_name FROM dba_credentials WHERE credential_name LIKE '%NATIVE%'; OWNER CREDENTIAL_NAME ----- --------------- ADMIN OCI_NATIVE_CRED Let’s go ahead and create an external table using our new credential: BEGIN DBMS_CLOUD.CREATE_EXTERNAL_TABLE( table_name =>'CHANNELS_EXT', credential_name =>'OCI_NATIVE_CRED', file_uri_list =>'https://objectstorage.us-phoenix-1.oraclecloud.com/n/adb/b/bucket_testpdb/o/channels.txt', format => json_object('delimiter' value ','), column_list => 'CHANNEL_ID NUMBER, CHANNEL_DESC VARCHAR2(20), CHANNEL_CLASS VARCHAR2(20), CHANNEL_CLASS_ID NUMBER, CHANNEL_TOTAL VARCHAR2(13), CHANNEL_TOTAL_ID NUMBER'); END; / PL/SQL procedure successfully completed. SELECT count(*) FROM channels_ext; COUNT(*) -------- 5 To summarize, in addition to the auth token based authentication, you can now also have OCI native authentication and CREATE_CREDENTIAL procedure is overloaded to accommodate both options as we demonstrated above.

Loading and querying external data from Oracle Cloud Object Storage are amongst some of the common operations performed in Autonomous Database. Accessing Object Storage requires users to have...

Big Data

Oracle Big Data SQL 4.0 – Query Server

One of the popular new Big Data SQL features is its Query Server.  You can think of Query Server as an Oracle Database 18c query engine that uses the Hive metastore to capture table definitions.  Data isn’t stored in Query Server;  it allows you to access data in Hadoop, NoSQL, Kafka and Object Stores (Oracle Object Store and Amazon S3) using Oracle SQL.  Installation and Configuration Architecturally, here’s what a Big Data SQL deployment with Query Server looks like: There are two parts to the Big Data SQL deployment: Query Server is deployed to an edge node of the cluster (eliminating resource contention with services running on the data nodes) “Cells” are deployed to data nodes. These cells are responsible for scanning, filtering and processing of data and returning summarized results to Query Server Query Server setup is handled by Jaguar – the Big Data SQL install utility.  As part of the installation, update the installer configuration file – bds-config.json – to simply specify that you want to use Query Server and the host that it should be deployed to (that host should be a “gateway” server).  Also, include the Hive databases that should synchronize with Query Server (here we're specifying all): {    “edgedb”: {        "node": “your-host.com",        "enabled": "true"        "sync_hive_db_list": "*"   } } Jaguar will automatically detect the Hive source and the Hadoop cluster security configuration information and configure Query Server appropriately.  Hive metadata will be synchronized with Query Server (either full metadata replacement or incremental updates) using the PL/SQL API (dbms_bdsqs.sync_hive_databases) or thru the cluster management framework (see picture of Cloudera Manager below): For secure clusters, you will log into Query Server using Kerberos – just like you would access other Hadoop cluster services.  Similar to Hive metadata, Kerberos principals can be synchronized thru your cluster admin tool (Cloudera Manager or Ambari), Jaguar (jaguar sync_principals) or PL/SQL (DBMS_BDSQS_ADMIN.ADD_KERBEROS_PRINCIPALS and DBMS_BDSQS_ADMIN.DROP_KERBEROS_PRINCIPALS). Query Your Data Once your Query Server is deployed, query your data using Oracle SQL.  There is a bdsql user that is automatically created and data is accessible thru the bdsqlusr PDB. sqlplus bdsql@bdsqlusr You will see schemas defined for all your hive databases – and external tables within those schemas that map to your Hive tables. The full Oracle SQL language is available to you (queries – not inserts/updates/deletes).  Authorization will leverage the underlying privileges set up on the Hadoop cluster; there are no authorization rules to replicate. You can create new external tables using the Big Data SQL drivers: ORACLE_HIVE – to leverage tables using hive metadata (note, you probably don’t need to do this b/c the external tables are already available) ORACLE_HDFS – to create tables over HDFS data for which there is no hive metadata ORACLE_BIGDATA – to create tables over object store sources Query Server provides a limited use Oracle Database license.  This allows you to create external tables over sources – but not internal tables.  Although there is nothing physically stopping the creation of tables – you will find that any internal table created will be deleted when the Query Server restarts. This beauty of query server is that you get to use the powerful Oracle SQL language and mature Oracle Database optimizer.  It means that your existing query applications will be able to use Query Server as they would any other Oracle Database.  No need to change your queries to support a less rich query engine.  Correlate near real-time Kafka data with information captured in your data lake.  Apply advanced SQL functions like Pattern Matching and time series analyses to gain insights from all your data – and watch your insights and productivity soar :-).

One of the popular new Big Data SQL features is its Query Server.  You can think of Query Server as an Oracle Database 18c query engine that uses the Hive metastore to capture table definitions.  Data...

Big Data

Oracle Big Data SQL 4.0 - Great New Performance Feature

Big Data SQL 4.0 introduces a data processing enhancement that can have a dramatic impact on query performance:  distributed aggregation using in-memory capabilities. Big Data SQL has always done a great job of filtering data on the Hadoop cluster.  It does this using the following optimizations:  1) column projection, 2) partition pruning, 3) storage indexes, 4) predicate pushdown. Column projection is the first optimization.  If your table has 200 columns – and you are only selecting one – then only a single column’s data will be transferred from the Big Data SQL Cell on the Hadoop cluster to the Oracle Database.  This optimization is applied to all file types – CSV, Parquet, ORC, Avro, etc. The image below shows the other parts of the data elimination steps.  Let’s say you are querying 100TB data set. Partition Pruning:  Hive partitions data by a table’s column(s).  If you have two years of data and your table is partitioned by day – and the query is only selecting 2 months – then in this example, 90% of the data will be “pruned” – or not scanned Storage Index:  SIs are a fine-grained data elimination technique.  Statistics are collected for each file’s data blocks based on query usage patterns – and these statistics are used to determine whether or not it’s possible that data for the given query is contained within that block.  If the data does not exist in that block, then the block is not scanned (remember, a block can represent a significant amount of data - oftentimes 128MB). This information is automatically maintained and stored in a lightweight, in-memory structure. Predicate Pushdown:  Certain file types – like Parquet and ORC – are really database files.  Big Data SQL is able to push predicates into those files and only retrieve the data that meets the query criteria Once those scan elimination techniques are applied, Big Data SQL Cells will process and filter the remaining data - returning the results to the database. In-Memory Aggregation In-memory aggregation has the potential to dramatically speed up queries.  Prior to Big Data SQL 4.0, Oracle Database performed the aggregation over the filtered data sets that were returned by Big Data SQL Cells.  With in-memory aggregation, summary computations are run across the Hadoop cluster data nodes.  The massive compute power of the cluster is used to perform aggregations. Below, detailed activity is captured at the customer location level; the query is asking for a summary of activity by region and month. When the query is executed, processing is distributed to each data node on the Hadoop cluster.  Data elimination techniques and filtering is applied – and then each node will aggregate the data up to region/month.  This aggregated data is then returned to the database tier from each cell - and the database then completes the aggregation and applies other functions. Big Data SQL is using an extension to the in-memory aggregation functionality offered by Oracle Database.  Check out the documentation for details on the capabilities and where you can expect a good performance gain. The results can be rather dramatic, as illustrated by the chart found below: This test compares running the same queries with aggregation offload disabled and then enabled.  It shows 1) a simple, single table “count(*)” query, 2) a query against a single table that performs a group by and 3) a query that joins a dimension table to a fact table.  The second and third examples also show increasing the number of columns accessed by the query.  In this simple test, performance improved from 13x to 36x :-). Lots of great new capabilities in Big Data SQL 4.0.  This one may be my favorite :-).

Big Data SQL 4.0 introduces a data processing enhancement that can have a dramatic impact on query performance:  distributed aggregation using in-memory capabilities. Big Data SQL has always done a...

Autonomous

SQL Developer Web comes to Autonomous Data Warehouse - oh YES!

If you login into your cloud console and create a new autonomous data warehouse, or if you have an existing data warehouse instance, then there is great news - you can now launch SQL Developer Web direct from the service console. There is no need to download and install the full desktop version of SQL Developer anymore.  If you want a quick overview of this feature then there is great video by Jeff Smith (Oracle Product Manager for SQL Developer) on YouTube: https://www.youtube.com/watch?v=asHlUW-Laxk. In the video Jeff gives an overview and a short demonstration of this new UI.  ADMIN-only Access First off, straight out the box, only the ADMIN user can access SQL Developer web - which makes perfect sense when you think about it! Therefore, the ADMIN user is always going to be the first person to connect to SQL Dev Web and then they enable access for other users/schemas as required.  A typical autonomous workflow will look something like this: Create a new ADW instance Open Service Console Connect to SQL Dev Web as ADMIN user Enable each schema/user via the ords_admin.enable_schema package Send schema-specific URL to each developer     Connecting as the ADMIN user From the Administration tab on the service console you will see that we added two new buttons - one to access APEX (more information here) and one to access SQL Developer Web.: As this is on the Administration tab then the link for SQL Developer Web, not surprisingly, provides a special admin-only URL which, one you are logged in as the admin user, brings you to the home screen:   and the admin user has some additional features enabled for monitoring their autonomous data warehouse via the hamburger menu in the top left corner     The Dashboard view displays general status information about the data warehouse:. Database Status: Displays the overall status of the database. Alerts: Displays the number of Error alerts in the alert log.  Database Storage: Displays how much storage is being used by the database. Sessions: Displays the status of open sessions in the database. Physical IO Panel: Displays the rates of physical reads and writes of database data. Waits: Displays how many wait events are occurring in the database for various reasons Quick Links: Provides buttons to open the Worksheet, Data Modelel. It also provides a button to open the Oracle Application Express sign-in page for the current database.   Home Page View This page has some cool features - there is a timeline that tracks when objects got added to the database: and there is an associated quick glance view that shows the status of those objects so you know that if it's a table whether it's been automatically analyzed and the stats are up to date:   Enabling the users/schemas To allow a developer to access their schema and login the ADMIN user has to run a small PL/SQL script to enable the schema and that process is outlined here: https://docs.oracle.com/en/database/oracle/sql-developer-web/19.1/sdweb/about-sdw.html#GUID-A79032C3-86DC-4547-8D39-85674334B4FE. Once that's done the ADMIN user can provide the developer with their personal URL to access SQL Developer Web. Essentially, this developer URL is the same as the URL the ADMIN user gets from the service console, but with the /admin/ segment of the URL replaced by /schema-alias/ specified during the "enable-user-access" step. The doc lays this out very nicely.   Guided Demo   Overall adding SQL Dev Web to Autonomous Data Warehouse is going to make life so much easier for DBAs and developers. For most tasks SQL Developer Web can now be the go-to interface for doing most in-database tasks which means you don't have to download and install a desktop tool (which in most corporate environments creates all sorts of problems due to locked-down Windows and Mac desktops). Where to get more information When it comes to SQL Developer there is only one URL you need and it belongs the Jeff Smith who is the product manager for SQL Developer: https://www.thatjeffsmith.com/. Jeff's site contains everything you could ever want to know about using SQL Developer Desktop and SQL Developer Web. There are overview videos, tutorial videos, feature videos, tips & tricks etc etc. Have fun with SQL Developer Web and Autonomous Data Warehouse      

If you login into your cloud console and create a new autonomous data warehouse, or if you have an existing data warehouse instance, then there is great news - you can now launch SQL Developer...

Autonomous

APEX comes to Autonomous Data Warehouse - oh YES!

A big "Autonomous Welcome" to all our APEX developers because your favorite low-code development environment is now built in Autonomous Data Warehouse. And before you ask - YES, if you have existing autonomous data warehouse instances you will find an APEX launch button got added to the Admin tab on your service console (see the screen capture below).  APEX comes to ADW. YES! APEX is now bundled with Autonomous Data Warehouse (even existing data warehouse instances have been updated). What does this mean? It means that you now have free access to  Oracle’s premiere low-code development platform: Application Express (APEX). This provides a low-code development environment that enables customers and partners to build stunning, scalable, secure apps with world-class features fully supported by Autonomous Database.   As an application developer you can now benefit from a simple but very powerful development platform powered by an autonomous database. It’s the perfect combination of low-code development meets zero management database. You can focus on building rich, sophisticated applications with APEX and the database will take care of itself. There are plenty of great use cases for APEX combined with Autonomous Database and from a data warehouse perspective 2 key ones stand out: 1) A replacement for a data mart built around spreadsheets We all done it at some point in our careers - used spreadsheets to build business critical applications and reporting systems. We all know this approach is simply a disaster waiting to happen! Yet almost every organization utilizes spreadsheets to share and report on data. Why? Because spreadsheets are so easy to create - anyone can put together a spreadsheet once they have the data. Once created they often send it out to colleagues who then tweak the data and pass it on to other colleagues, and so forth. This inevitably leads to numerous copies with different data and very a totally flawed business processes. A far better solution is to have a single source of truth stored in a fully secured database with a browser-based app that everyone can use to maintain the data. Fortunately, you now have one! Using Autonomous Data Warehouse and APEX any user can go from a spreadsheet to web app in a few clicks.  APEX provides a very powerful but easy to use wizard that in just a few clicks can transform your spreadsheet into a fully-populated table in Oracle Autonomous Data Warehouse, complete with a fully functioning app with a report and form for maintaining the data. One of the key benefits of switching to APEX is that your data becomes completely secure. The Autonomous Data Warehouse automatically encrypts data at rest and in transit, you can apply data masking profiles on any sensitive data that you share with others and Oracle takes care of making sure you have all the very latest security patches applied. Lastly, all your data is automatically backed. 2) Sharing external data with partners and customers. Many data warehouses make it almost impossible to share data with partners. This can make it very hard to improve your business processes. Providing an app to enable your customers to interact with you and see the same data sets can greatly improve customer satisfaction and lead to repeat business. However, you don't want to expose your internal systems on the Internet, and you have concerns about security, denial of service attacks, and web site uptime. By combining Autonomous Data Warehouse with APEX you can now safely develop public facing apps. Getting Started with APEX!  Getting started with APEX is really easy. Below you will see that I have put together a quick animation which guides you through the process of logging in to your APEX workspace from Autonomous Data Warehouse: What see you above is the process of logging in to APEX for the first time. In this situation you connect as the ADMIN user to the reserved workspace called “INTERNAL”. Once you login you will be required to create a new workspace and assign a user to that workspace to get things setup. In the above screenshots a new workspace called GKL is created for the user GKL. Then at that point everything becomes fully focused on APEX and your  Autonomous Data Warehouse just fades into the background, taking care of itself. It could not be simpler! Learn More about APEX If you are completely new to APEX then I would recommend jumping over to the dedicated Application Express website - apex.oracle.com. On this site you will find the APEX PM team has put together a great 4-step process to get you up-and-running with APEX: https://apex.oracle.com/en/learn/getting-started/ - quick note: obviously, you can skip step 1 which covers how to request and environment on our public APEX service because you have your dedicated environment within your very own Autonomous Data Warehouse. Enjoy your new, autonomous APEX-enabled environment!      

A big "Autonomous Welcome" to all our APEX developers because your favorite low-code development environment is now built in Autonomous Data Warehouse. And before you ask - YES, if you have existing...

Autonomous

There's a minor tweak to our UI - DEDICATED

You may have spotted from all the recent online news headlines and social media activity that we launched a new service for transactional workloads - ATP Dedicated. It allows an organization to rethink how they deliver Database IT, enabling a customizable private database cloud in the public cloud. Obviously this does not affect you if you are using Autonomous Data Warehouse but it does have a subtle impact because our UI has had to change slightly. You will notice in the top left corner of the main console page we now have three types of services: Autonomous Database Autonomous Container Database Autonomous Exadata Infrastructure From a data warehouse perspective you are only interested in the first one in that list: Autonomous Database. In the main table that list all your instances you can see there is a new column headed “Dedicated Infrastructure”. For ADW, this will always show “No” as you can see below.     If you create a new ADW you will notice that the pop-up form has now been replaced by a full width page to make it easier to focus on the fields you need to complete. The new auto-scaling feature is still below the CPU Core Count box (for more information about auto scaling with ADW see this blog post). …and that’s about it for this modest little tweak to our UI. So nothing major, just a subtle change visible when you click on the "Transaction Processing" box. Moving on...  

You may have spotted from all the recent online news headlines and social media activity that we launched a new service for transactional workloads - ATP Dedicated. It allows an organization to...

Autonomous

How to Create a Database Link from an Autonomous Data Warehouse to a Database Cloud Service Instance

Autonomous Data Warehouse (ADW) now supports outgoing database links to any database that is accessible from an ADW instance including Database Cloud Service (DBCS) and other ADW/ATP instances. To use database links with ADW, the target database must be configured to use TCP/IP with SSL (TCPS) authentication. Since both ADW and ATP use TCPS authentication by default, setting up a database link between these services is pretty easy and takes only a few steps. We covered the ADB-to-ADB linking process in the first of this two part series of blog posts about using database links, see Making Database Links from ADW to other Databases. That post explained the simplest use case to configure and use. On the other hand, enabling TCPS authentication in a database that doesn't have it configured (e.g. in DBCS) requires some additional steps that need to be followed carefully. In this blog post, I will try to demonstrate how to create a database link from an ADW instance to a DBCS instance including the steps to enable TCPS authentication. Here is an outline of the steps that we are going to follow: Enable TCPS Authentication in DBCS Connect to DBCS Instance from Client via TCPS Create a DB Link from ADW to DBCS Create a DB Link from DBCS to ADW (Optional) Enable TCPS Authentication in DBCS A DBCS instance uses TCP/IP protocol by default. Configuring TCPS in DBCS involves several steps that need to be performed manually. Since we are going to modify the default listener to use TCPS and it's configured under the grid user, we will be using both oracle and grid users. Here are the steps needed to enable TCPS in DBCS: Create wallets with self signed certificates for server and client Exchange certificates between server and client wallets (Export/import certificates) Add wallet location in the server and the client network files Add TCPS endpoint to the database listener Create wallets with self signed certificates for server and client As part of enabling TCPS authentication, we need to create individual wallets for the server and the client. Each of these wallets has to have their own certificates that they will exchange with one another. For the sake of this example, I will be using a self signed certificate. The client wallet and certificate can be created in the client side; however, I'll be creating my client wallet and certificate in the server and moving them to my local system later on. See Configuring Secure Sockets Layer Authentication for more information. Let's start... Set up wallet directories with the root user [root@dbcs0604 u01]$ mkdir -p /u01/server/wallet [root@dbcs0604 u01]$ mkdir -p /u01/client/wallet [root@dbcs0604 u01]$ mkdir /u01/certificate [root@dbcs0604 /]# chown -R oracle:oinstall /u01/server [root@dbcs0604 /]# chown -R oracle:oinstall /u01/client [root@dbcs0604 /]# chown -R oracle:oinstall /u01/certificate Create a server wallet with the oracle user [oracle@dbcs0604 ~]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet create -wallet ./ -pwd Oracle123456 -auto_login Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a server certificate with the oracle user [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -dn "CN=dbcs" -keysize 1024 -self_signed -validity 3650 -sign_alg sha256 Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a client wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet create -wallet ./ -pwd Oracle123456 -auto_login Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Create a client certificate with the oracle user [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -dn "CN=ctuzla-mac" -keysize 1024 -self_signed -validity 3650 -sign_alg sha256 Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Exchange certificates between server and client wallets (Export/import certificates) Export the server certificate with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet export -wallet ./ -pwd Oracle123456 -dn "CN=dbcs" -cert /tmp/server.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Export the client certificate with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet export -wallet ./ -pwd Oracle123456 -dn "CN=ctuzla-mac" -cert /tmp/client.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Import the client certificate into the server wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/server/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -trusted_cert -cert /tmp/client.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Import the server certificate into the client wallet with the oracle user [oracle@dbcs0604 wallet]$ cd /u01/client/wallet/ [oracle@dbcs0604 wallet]$ orapki wallet add -wallet ./ -pwd Oracle123456 -trusted_cert -cert /tmp/server.crt Oracle PKI Tool Release 18.0.0.0.0 - Production Version 18.1.0.0.0 Copyright (c) 2004, 2017, Oracle and/or its affiliates. All rights reserved. Operation is successfully completed. Change permissions for the server wallet with the oracle user We need to set the permissions for the server wallet so that it can be accessed when we restart the listener after enabling TCPS endpoint. [oracle@dbcs0604 wallet]$ cd /u01/server/wallet [oracle@dbcs0604 wallet]$ chmod 640 cwallet.sso Add wallet location in the server and the client network files Creating server and client wallets with self signed certificates and exchanging certificates were the initial steps towards the TCPS configuration. We now need to modify both the server and client network files so that they point to their corresponding wallet location and they are ready to use the TCPS protocol. Here's how those files look in my case: Server-side $ORACLE_HOME/network/admin/sqlnet.ora under the grid user # sqlnet.ora Network Configuration File: /u01/app/18.0.0.0/grid/network/admin/sqlnet.ora # Generated by Oracle configuration tools. NAMES.DIRECTORY_PATH= (TNSNAMES, EZCONNECT) wallet_location = (SOURCE= (METHOD=File) (METHOD_DATA= (DIRECTORY=/u01/server/wallet))) SSL_SERVER_DN_MATCH=(ON) Server-side $ORACLE_HOME/network/admin/listener.ora under the grid user wallet_location = (SOURCE= (METHOD=File) (METHOD_DATA= (DIRECTORY=/u01/server/wallet))) LISTENER=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER)))) # line added by Agent ASMNET1LSNR_ASM=(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))) # line added by Agent ENABLE_GLOBAL_DYNAMIC_ENDPOINT_ASMNET1LSNR_ASM=ON # line added by Agent VALID_NODE_CHECKING_REGISTRATION_ASMNET1LSNR_ASM=SUBNET # line added by Agent ENABLE_GLOBAL_DYNAMIC_ENDPOINT_LISTENER=ON # line added by Agent VALID_NODE_CHECKING_REGISTRATION_LISTENER=SUBNET # line added by Agent Server-side $ORACLE_HOME/network/admin/tnsnames.ora under the oracle user # tnsnames.ora Network Configuration File: /u01/app/oracle/product/18.0.0.0/dbhome_1/network/admin/tnsnames.ora # Generated by Oracle configuration tools. LISTENER_CDB1 = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) CDB1_IAD1W9 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) PDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = dbcs0604)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = pdb1.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) Add TCPS endpoint to the database listener Now that we are done with configuring our wallets and network files, we can move onto the next step, which is configuring the TCPS endpoint for the database listener. Since our listener is configured under grid, we will be using srvctl command to modify and restart it. Here are the steps: [grid@dbcs0604 ~]$ srvctl modify listener -p "TCPS:1521/TCP:1522" [grid@dbcs0604 ~]$ srvctl stop listener [grid@dbcs0604 ~]$ srvctl start listener [grid@dbcs0604 ~]$ srvctl stop database -database cdb1_iad1w9 [grid@dbcs0604 ~]$ srvctl start database -database cdb1_iad1w9 [grid@dbcs0604 ~]$ lsnrctl status LSNRCTL for Linux: Version 18.0.0.0.0 - Production on 05-JUN-2019 16:07:24 Copyright (c) 1991, 2018, Oracle. All rights reserved. Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER))) STATUS of the LISTENER ------------------------ Alias LISTENER Version TNSLSNR for Linux: Version 18.0.0.0.0 - Production Start Date 05-JUN-2019 16:05:50 Uptime 0 days 0 hr. 1 min. 34 sec Trace Level off Security ON: Local OS Authentication SNMP OFF Listener Parameter File /u01/app/18.0.0.0/grid/network/admin/listener.ora Listener Log File /u01/app/grid/diag/tnslsnr/dbcs0604/listener/alert/log.xml Listening Endpoints Summary... (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER))) (DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=10.0.0.4)(PORT=1521)))   (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=10.0.0.4)(PORT=1522))) Services Summary... Service "867e3020a52702dee053050011acf8c0.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "8a8e0ea41ac27e2de0530400000a486a.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "cdb1XDB.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 1 handler(s) for this service... Service "cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... Service "pdb1.sub05282047220.vcnctuzla.oraclevcn.com" has 1 instance(s). Instance "cdb1", status READY, has 2 handler(s) for this service... The command completed successfully Please note that in the first step we added the TCPS endpoint to the port 1521 and TCP endpoint to the port 1522 of the default listener. It's also possible to keep the port 1521 as is and add TCPS endpoint to a different port (e.g. 1523). Connect to DBCS Instance from Client via TCPS We should have TCPS authentication configured now. Before we move onto testing, let's take a look at the client-side network files (Please note the public IP address of the DBCS instance in tnsnames.ora): Client-side tnsnames.ora CDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = 132.145.151.208)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = cdb1_iad1w9.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) PDB1 = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCPS)(HOST = 132.145.151.208)(PORT = 1521)) (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = pdb1.sub05282047220.vcnctuzla.oraclevcn.com) ) (SECURITY= (SSL_SERVER_CERT_DN="CN=dbcs")) ) Client-side sqlnet.ora WALLET_LOCATION = (SOURCE = (METHOD = FILE) (METHOD_DATA = (DIRECTORY = /Users/cantuzla/Desktop/wallet) ) ) SSL_SERVER_DN_MATCH=(ON) In order to connect to the DBCS instance from the client, you need to add an ingress rule for the port that you want to use (e.g. 1521) in the security list of your virtual cloud network (VCN) in OCI as shown below: We can now try to establish a client connection to PDB1 in our DBCS instance (CDB1): ctuzla-mac:~ cantuzla$ cd Desktop/InstantClient/instantclient_18_1/ ctuzla-mac:instantclient_18_1 cantuzla$ ./sqlplus /nolog SQL*Plus: Release 18.0.0.0.0 Production on Wed Jun 5 09:39:56 2019 Version 18.1.0.0.0 Copyright (c) 1982, 2018, Oracle. All rights reserved. SQL> connect c##dbcs/DBcs123_#@PDB1 Connected. SQL> select * from dual; D - X Create a DB Link from ADW to DBCS We now have a working TCPS authentication in our DBCS instance. Here are the steps from the documentation that we will follow to create a database link from ADW to DBCS: Copy your target database wallet (the client wallet cwallet.sso that we created in /u01/client/wallet) for the target database to Object Store. Create credentials to access your Object Store where you store the cwallet.sso. See CREATE_CREDENTIAL Procedure for details. Upload the target database wallet to the data_pump_dir directory on ADW using DBMS_CLOUD.GET_OBJECT: SQL> BEGIN DBMS_CLOUD.GET_OBJECT( credential_name => 'OBJ_STORE_CRED', object_uri => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/adwctraining8/b/target-wallet/o/cwallet.sso', directory_name => 'DATA_PUMP_DIR'); END; / PL/SQL procedure successfully completed. On ADW create credentials to access the target database. The username and password you specify with DBMS_CLOUD.CREATE_CREDENTIAL are the credentials for the target database that you use to create the database link. Make sure the username consists of all uppercase letters. For this example, I will be using the C##DBCS common user that I created in my DBCS instance: SQL> BEGIN DBMS_CLOUD.CREATE_CREDENTIAL( credential_name => 'DBCS_LINK_CRED', username => 'C##DBCS', password => 'DBcs123_#'); END; / PL/SQL procedure successfully completed. Create the database link to the target database using DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK: SQL> BEGIN DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK( db_link_name => 'DBCSLINK', hostname => '132.145.151.208', port => '1521', service_name => 'pdb1.sub05282047220.vcnctuzla.oraclevcn.com', ssl_server_cert_dn => 'CN=dbcs', credential_name => 'DBCS_LINK_CRED'); END; / PL/SQL procedure successfully completed. Use the database link you created to access the data on the target database: SQL> select * from dual@DBCSLINK; D - X Create a DB Link from DBCS to ADW (Optional) Although the previous section concludes the purpose of this blog post, here's something extra for those who are interested. In just couple additional steps, we can also create a DB link from DBCS to ADW: Download your ADW wallet. Upload the wallet to your DBCS instance using sftp or an FTP client. Unzip the wallet: [oracle@dbcs0604 ~]$ cd /u01/targetwallet [oracle@dbcs0604 targetwallet]$ unzip Wallet_adwtuzla.zip Archive: Wallet_adwtuzla.zip inflating: cwallet.sso inflating: tnsnames.ora inflating: truststore.jks inflating: ojdbc.properties inflating: sqlnet.ora inflating: ewallet.p12 inflating: keystore.jks Set GLOBAL_NAMES parameter to FALSE. This step is very important. If you skip this, your DB link will not work. SQL> alter system set global_names=FALSE; System altered. SQL> sho parameter global NAME TYPE VALUE ---------------------- ----------- ----------- allow_global_dblinks boolean FALSE global_names boolean FALSE global_txn_processes integer 1 Create a DB link as follows (notice the my_wallet_directory clause pointing to where we unzipped the ADW wallet): create database link ADWLINK connect to ADMIN identified by ************ using '(description= (retry_count=20)(retry_delay=3)(address=(protocol=tcps)(port=1522)(host=adb.us-ashburn-1.oraclecloud.com)) (connect_data=(service_name=ctwoqpkdfcuwpsd_adwtuzla_high.adwc.oraclecloud.com)) (security=(my_wallet_directory=/u01/targetwallet)(ssl_server_cert_dn="CN=adwc.uscom-east-1.oraclecloud.com,OU=Oracle BMCS US,O=Oracle Corporation,L=Redwood City,ST=California,C=US")))'; Database link created. Use the database link you created to access the data on the target database (your ADW instance in this case): SQL> select * from dual@ADWLINK; D - X That's it! In this blog post, we covered how to enable TCPS authentication in DBCS and create an outgoing database link from ADW to our DBCS instance. As a bonus content, we also explored how to create a DB link in the opposite direction, that is from DBCS to ADW. Even though we focused on the DBCS configuration, these steps can be applied when setting up a database link between ADW and any other Oracle database.

Autonomous Data Warehouse (ADW) now supports outgoing database links to any database that is accessible from an ADW instance including Database Cloud Service (DBCS) and other ADW/ATP instances. To use...

Autonomous

Making Database Links from ADW to other Databases

Autonomous Database now fully supports database links. What does this mean? It means that from within your Autonomous Data Warehouse you can make a connection to any other database (on-premise or in the cloud) including other Autonomous Data Warehouse instances and/or Autonomous Transaction Processing instances. Before I dive into an example, let’s take a small step backwards and get a basic understanding of what a database links. Firstly, what is a database link? What Are Database Links? A database link is a pointer that defines a one-way communication path from, in this case an Autonomous Data Warehouse instance to another database. The link is one-way in the sense that a client connected to Autonomous Data Warehouse A can use a link stored in Autonomous Data Warehouse A to access information (schema objects such as tables, views etc) in remote database B, however, users connected to database B cannot use the same link to access data in Autonomous Data Warehouse A. If local users on database B want to access data on Autonomous Data Warehouse A, then they must define their own link to Autonomous Data Warehouse A.     There is more information about database links in the Administrator's Guide. Why Are Database Links Useful? In a lot of situations it can be really useful to have access to the very latest data without having to wait for the next run of the ETL processing. Being able to reach through directly into other databases using a DBLINK can be the fastest way to get an upto-the-minute view of what’s happening with sales orders, or expense claims, or trading positions etc. Another use case is to actually make use of dblinks within the actual ETL processing by pulling data from remote databases into staging tables for further processing. This makes the ETL process impose a minimal processing overhead on the remote databases since all that is being typically executed is a basic SQL SELECT statement. There are additional security benefits as well. For example if you consider an example where employees submit expense reports to Accounts Payable (A/P) application and that information needs to be viewed within a financial data mart. The data mart users should be able to connect to the AP database and run queries to retrieve the desired information. The mart users do not need to be A/P application users to do their analysis or run their ETL jobs; they should only be able to access AP information in a controlled, secured way. Setting Up A Database Link in ADW There are not many steps involved in creating a new database link since all the hard work happens under the covers. The first step is to check that you can actually access the target database- i.e. you have a username and password along with all the connection information. To use database links with Autonomous Data Warehouse the target database must be configured to use TCP/IP with SSL (TCPS) authentication. Fortunately if you want to connect to another Autonomous Data Warehouse or Autonomous Transaction Processing instance then everything already in place because ADB’s use TCP/IP with SSL (TCPS) authentication by default. For other cloud and on-premise databases you will most likely have to configure them to use TCP/IP with SSL (TCPS) authentication. I will try and cover this topic in a separate blog post. Word of caution here…don’t forget to check your Network ACLs settings if you are connecting to another ATP or ADW instance since your attempt to connect might get blocked! There is more information about setting up Network ACLs here. Scenario 1 - Connecting an Autonomous Data Warehouse to your Autonomous Transaction Processing instance Let’s assume that I have an ATP instance running a web store application that contains information about sales orders, distribution channels, customers, products etc. I want to access some of that data in real-time from within my sales data mart. The first step is get hold of the secure connection information for my ATP instance - essentially I need the cwallet.sso file that is part of the client credential file. If I click on the “APDEMO” link above I can access the information about that autonomous database and in the list of “management” buttons is the facility to download the client credentials file…    this gets me a zip file containing a series of files two of which are needed to create a database link: cwallet.sso contains all the security credentials and tnsnames.ora contains all the connection information that I am going to need. Uploading the wallet file… Next I goto to my Object Storage page and create a new bucket to store my wallet file. In this case I have just called it “wallet”. Probably in reality you will name your buckets to identify the target database such as “atpdemo_wallet” simply because every wallet for each database will have exactly the same name - cwallet.sso - so you will need to have a way to identify the target database each wallet is associated with and avoid over-writing each wallet.   within my bucket and I click on the blue “Upload” button to find the cwallet.sso file and move it to my Object Storage bucket:     once my wallet file is in my bucket I then need to setup my autonomous data warehouse to use that file when it makes a connection to my ATP instance.   This is where we step out of the cloud GUI and switch to a client tool like SQL Developer. I have already defined my SQL Developer connection to my Autonomous Data Warehouse which means I can start building my new database link. Step 1 - Moving the wallet file To allow Autonomous Data Warehouse to access the wallet file for my ATP target database wallet I need to put it in a special location -  the data_pump_dir directory. This is done by using DBMS_CLOUD.GET_OBJECT as follows: BEGIN DBMS_CLOUD.GET_OBJECT( credential_name => 'DEF_CRED_NAME', object_uri => 'https://objectstorage.us-phoenix-1.oraclecloud.com/n/adwc/b/adwc_user/o/cwallet.sso', directory_name => 'DATA_PUMP_DIR'); END; / If you execute the above command all you will get back in the console is a message something like this “PL/SQL procedure successfully completed”. So to find out if the file actually got moved you can use the following query to query the data_pump_dir directory  SELECT *  FROM table(dbms_cloud.list_files('DATA_PUMP_DIR')) WHERE object_name LIKE '%.sso' which hopefully returns the following result within SQL Developer that confirms my wallet file is now available to my Autonomous Data Warehouse:     Step 2 - Setting up authentication When my database link process connects to my target ATP instance it obviously needs a valid username and password on my target ATP instance. However, if I can use an account in mu Autonomous Data Warehouse if it matches the account in my ATP instance. Chances are you will want to use a specific account on the target database so a credential is required. This can be setup relatively quickly using the following command: BEGIN DBMS_CLOUD.CREATE_CREDENTIAL( credential_name => ‘ATP_DB_LINK_CRED', username => ’scott', password => ’tiger' ); END; / Step 3 - Defining the new database link  For this step I am going to need access to the tnsnames.ora file to extract specific pieces of information about my ATP instance. Don’t forget that for each autonomous instances there is a range of connections that are identified by resource group ids such as “low”, “medium”, “high”, “tp_urgent” etc. When defining your database link make sure you select the correct information from your tnsnames file. You will need to find the following identifiers: hostname port service name  ssl_server_cert_dn In the example below I am using the “low” resource group connection: BEGIN DBMS_CLOUD_ADMIN.CREATE_DATABASE_LINK( db_link_name => 'SHLINK', hostname => 'adb.us-phoenix-1.oraclecloud.com', port => '1522', service_name => 'example_low.adwc.example.oraclecloud.com', ssl_server_cert_dn => ‘CN=adwc.example.oraclecloud.com,OU=Oracle BMCS PHOENIX,O=Oracle Corporation,L=Redwood City,ST=California,C=US’, credential_name => 'ATP_DB_LINK_CRED'); END; /   I could configure the database link to authenticate using the current user within my Autonomous Data Warehouse (assuming that I had a corresponding account in my Autonomous Transaction Processing instance). That’s all there is to it! Everything is now in place which means I can directly query my transactional data from my data warehouse. For example if I want to see the table of distribution channels for my tp_app_orders then I can simply query the channels table as follows: SELECT  channel_id,  channel_desc,  channel_class,  channel_class_id,  channel_total,  channel_total_id  FROM channels@SHLINK;  Will now return the following:   and if I query my tp_app_orders table I can see the live data in my Autonomous Transaction Processing instance: All Done! That's it. It's now possible to connect your Autonomous Data Warehouse to any other database running on-premise or in the cloud, including other Autonomous Database instances.  This makes it even quicker and easier to pull data from existing systems into your staging tables or even just query data directly from your source applications to get the most up to date view.  In this post you will have noticed that I have created a new database link between an Autonomous Data Warehouse and an Autonomous Transaction Processing instance. Whilst this is a great use case I suspect that many of you will want to connect your Autonomous Data Warehouse to an on-premise database. Well, as I mentioned at the start of this post there are some specific requirements related to using database links with Autonomous Data Warehouse where the target instance is not an autonomous database and we will deal with those in the next post: How to Create a Database Link from an Autonomous Data Warehouse to a Database Cloud Service Instance. For more information about using dblinks with ADW click here.  

Autonomous Database now fully supports database links. What does this mean? It means that from within your Autonomous Data Warehouse you can make a connection to any other database (on-premise or...

Autonomous

Autonomous Data Warehouse - Now with Spatial Intelligence

We are pleased to announce that Oracle Autonomous Data Warehouse now comes with spatial intelligence! If you are completely new to Oracle Autonomous Data Warehouse (where have you been for the last 18 months?) then here is a quick recap of the key features: What is  Oracle Autonomous Data Warehouse Oracle Autonomous Data Warehouse provides a self-driving, self-securing, self-repairing cloud service that eliminate the overhead and human errors associated with traditional database administration. Oracle Autonomous Data Warehouse takes care of configuration, tuning, backup, patching, encryption, scaling, and more. Additional information can be found at https://www.oracle.com/database/autonomous-database.html. Special Thanks... This post has been prepared by David Lapp who is part of the Oracle Spatial and Graph product management team.He is extremely well known within our spatial and graph community. If you want to follow David's posts on the Spatial and Graph blog then use this link and the spatial and graph blog is here. Spatial Features The core set of Spatial features have been enabled on Oracle Autonomous Data Warehouse.  Highlights of the enabled features are; native storage and indexing of point/line/polygon geometries, spatial analysis and processing, such as proximity, containment, combining geometries, distance/area calculations, geofencing to monitor objects entering and exiting areas of interest, and linear referencing to analyze events and activities located along linear networks such as roads and utilities. For details on enabled Spatial features, please see the Oracle Autonomous Data Warehouse documentation.   Loading Your Spatial Data into ADW In Oracle Autonomous Data Warehouse, data loading is typically performed using either Oracle Data Pump or Oracle/3rd party data integration tools. There are a few different ways to load and configure your spatial data sets: Load existing spatial data Load GeoJSON, WKT, or WKB and convert to Spatial using SQL.  Load coordinates and convert to Spatial using SQL.  Obviously the files containing your spatial data sets can be located in your on-premise data center or maybe your desktop computer, but for the fastest data loading performance Oracle Autonomous Data Warehouse also supports loading from files stored in Oracle Cloud Infrastructure Object Storage and other cloud file stores. Details can be found here: https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/load-data.html. Configuring Your Spatial Data Routine Spatial data configuration is performed using Oracle SQL Developer GUIs or SQL commands for: Insertion of Spatial metadata Creation of Spatial index Validation of Spatial data   Example Use Case The Spatial features enabled for Oracle Autonomous Data Warehouse support the most common use cases in data warehouse contexts. Organizations such as insurance, finance, and public safety require data warehouses to perform a wide variety of analytics. These data warehouses provide the clues to answer questions such as: What are the major risk factors for a potential new insurance policy? What are the patterns associated with fraudulent bank transactions? What are the predictors of various types of crimes?  In all of these data warehouse scenarios, location is an important factor, and the Spatial features of Oracle Autonomous Data Warehouse enable building and analyzing the dimensions of geographic data. Using the insurance scenario as an example, the major steps for location analysis are: Load historical geocoded policy data including outcomes such as claims and fraud Load geospatial reference data for proximity such as businesses and transportation features Use Spatial to calculate location-based metrics For example lets find the number of restaurants within 5 miles, and the distance to the nearest restaurant:   -- Count within distance -- Use a SQL statement with SDO_WITHIN_DISTANCE   -- and DML to build the result data SELECT policy_id, count(*) as no_restaurant_5_mi  FROM policies, businesses WHERE businesses.type = 'RESTAURANT' AND SDO_WITHIN_DISTANCE(          businesses.geometry,          policies.geometry,         'distance=5 UNIT=mile') = 'TRUE' GROUP BY policy_id; POLICY_ID  NO_RESTAURANT_5_MI 81902842   5 86469385   1 36378345   3 36323540   3 36225484   2 40830185   5 40692826   1 ...   Now we can expand the above query to use the SDO_NN function to do further analysis and find the closest restaurant within the group of restaurants that are within a mile radius of a specific location. Something like the following: -- Distance to nearest -- The SDO_NN function does not perform an implicit join -- so use PL/SQL with DML to build the result data DECLARE  distance_mi NUMBER; BEGIN FOR item IN (SELECT * FROM policies)   LOOP   execute immediate    'SELECT sdo_nn_distance(1) FROM businesses '||   'WHERE businesses.type = ''RESTAURANT'' '||   'AND SDO_NN(b.ora_geometry,:1,'||   '''sdo_batch_size=10 unit=mile'', 1) = ''TRUE'' '||   'AND ROWNUM=1'  INTO distance_mi USING item.geometry;  DBMS_OUTPUT.PUT_LINE(item.policy_id||' '||distance_mi); END LOOP; END;   POLICY_ID RESTAURANT_MI 81902842 4.100 86469385 1.839 36378345 4.674 36323540 3.092 36225484 1.376 40830185 2.237 40692826 4.272 44904642 2.216 ...   Generate the desired spectrum of location-based metrics by stepping through combinations of proximity targets (i.e., restaurants, convenience stores, schools, hospitals, police stations ...) and distances (i.e., 0.25 mi, 0.5 mi, 1 mi, 3 mi, 5 mi...). Combine these location-based metrics with traditional metrics (i.e., value of property, age of policy holder, household income ...) for analytics to identify predictors of outcomes. To enable geographic aggregation, start with a geographic hierarchy with geometry at the most detailed level. For example, a geographic hierarchy where ZONE rolls up to SUB_REGION which rolls up to REGION: DESCRIBE geo_hierarchy Name                     Type ---------------------    ------------------------------------------------- ZONE                     VARCHAR2(30) GEOMETRY                 SDO_GEOMETRY SUB_REGION               VARCHAR2(30) REGION                   VARCHAR2(30) Use Spatial to calculate containment (things found within a region) within the detailed level, which by extension associates the location with all levels of the geo-hierarchy for aggregations: -- Calculate containment -- -- The SDO_ANYINTERACT function performs an implicit join -- so, use a SQL statement with DML to build the result data -- SELECT policy_id, zone FROM policies, geo_hierarchy WHERE SDO_ANYINTERACT(policies.geometry, geo_hierarchy.geometry) = 'TRUE'; POLICY_ID ZONE 81902842 A23 86469385 A21 36378345 A23 36323540 A23 36225484 B22 40830185 C05 40692826 C10 44904642 B16 ...   With these and similar operations, analytics may be performed including the calculation of additional location-based metrics and aggregation by geography.     Summary For important best practices and further details on the use of these and many other Spatial operations, please refer to the Oracle Autonomous Data Warehouse documentation: https://www.oracle.com/database/autonomous-database.html.    

We are pleased to announce that Oracle Autonomous Data Warehouse now comes with spatial intelligence! If you are completely new to Oracle Autonomous Data Warehouse (where have you been for the last 18...

Autonomous

Using Oracle Management Cloud to Monitor Autonomous Databases

How To Monitor All Your Autonomous Database Instances  The latest release of Autonomous Database (which, as you should already know, covers both Autonomous Data Warehouse and Autonomous Transaction processing) has brought integration with Oracle Management Cloud (OMC). This is great news for cloud DBAs and cloud Fleet Managers since it means you can now monitor all your Autonomous Database instances from within a single, integrated console.  So What Is Oracle Management Cloud? Oracle Management Cloud is a suite of integrated monitoring, management services that can bring together information about all your autonomous database instances so you can monitor and manage everything from a single console. In a much broader context where you need to manage a complete application ecosystem or data warehouse ecosystem then Oracle Management Cloud can help you eliminate multiple management/system information silos and infrastructure data, resolve issues faster across your complete cloud ecosystem, and run IT like a business.  What About My Service Console? Each Autonomous Database instance has its own service console for managing and monitoring that specific instance (application database, data mart, data warehouse, sandbox etc). It has everything you, as a DBA or business user, needs to understand how your database is performing and using resources. To date this has been the console that everyone has used for monitoring. But, as you can see, this service console only shows you what’s happening within a specific instance. If you have not looked at the Service Console before then checkout Section 9 in the documentation (in this case for Autonomous Data Warehouse but the same applies to ATP) Managing and Monitoring Performance of Autonomous Data Warehouse. As more and more business teams deploy more and more Autonomous Database instances for their various projects, the challenge has been for anyone tasked with monitoring all these instances: how to get a high-level overview of what’s been deployed and in use across a whole organization? That’s where Oracle Management Cloud (OMC) comes in….the great news is that OMC monitoring of Autonomous Databases is completely free!   OMC Takes Monitoring To The Next Level  The purpose of this post is to look at the newly released integration between Autonomous Database and the Oracle Database Management part of Oracle Management Cloud’s. Let’s look at how to set up Oracle Database Management to discover your Autonomous Databases, how to monitor your Autonomous Database instances and check for predefined alerts and how to use the Performance Hub page. This is what we are aiming to setup in this blog post….the Oracle Database Fleet Home page which as you can see is telling me that I have 8 autonomous database instances - 6 ADW and 2 ATP instances - and 7 of those instances are up and running and one is currently either starting up or shutting down (identified as yellow).       Getting Started… Before you get started with OMC it’s worth taking a step back and thinking about how you want to manage your OMC instance. My view is that it makes sense to create a completely new, separate cloud account which will own your OMC instance (or instances if you want to have more than one). It’s not a requirement but in my opinion it keeps things nice and simple and your DBAs and fleet managers then typically won’t need access to the individual cloud accounts being used by each team for their autonomous database projects. So the first step is probably going to be registering a new cloud account and setting up users to access your OMC instance. Once you have a cloud account setup then there is some initial user configuration that needs to be completed before you can start work with OMC instance. The setup steps are outlined in the documentation - see here. To help you I have also covered these steps in the PowerPoint presentation which is provided at the end of this blog post. Creating an OMC Instance Starting from your My Services home page, click on the big “Create Instance” button and find the “Management Cloud” service in the list of all available services…  …this will allow you to provide the details for your new OMC instance   Oracle Cloud will send you an email notification as soon as your new OMC instance is ready, but it only takes a few minutes and then you can navigate to the “Welcome” page for your new instance which looks like this:   The next step is setup “Discovery Profiles” for each your cloud accounts. This will require a lot of the IDs and information that were part of the user account setup process so you may want to look back over that stage of this process for quick refresh. As you can see below, a discovery profile can be configured to look for only autonomous data warehouse instances or only autonomous transaction processing instances or it can search for both within a specific cloud account. Of course if you have multiple cloud accounts (maybe each department or project team has their own cloud account) then you will need to create discovery profiles for each cloud account. This gives you the ultimate flexibility in terms of setting up OMC in a way that best suits how you and your team want to work. There is more detailed information available in the OMC documentation, see here. The discovery process starts as soon as you click on the “Start” button in the top right-hand corner of the page. It doesn’t take long but the console provides feedback on the status of the discovery job. Once the job or jobs have completed you can navigate to the “Oracle Database” option in the left-hand menu which will bring you to this page - Oracle Database Fleet Home.       You can customise the graphs and information displayed on this page. For example the heat map in the middle of the page can display metrics for: DB Time, execution rate, network I/O, space used or transaction rate. You can switch between different types of views: listing the data as a table rather than a graph:   and because there is so much available data there is a Filter menu that allows you to focus on instances that are up or down, ADW instances vs. ATP instances, database version, and you can even focus in on a specific data center or group of data centers. Once you have setup your filters you can bookmark that view by saving the filters…   In the section headed “Fleet Members”, clicking on one of the instances listed in name column will drill into the performance analysis for that instance. This takes all the information from the Fleet Home page and brings the focus down to that specific instance. For example, selecting my demo instance which is the last row in the table above brings me to this page…   You will notice that this contains a mixture of information from the OCI console page and service console page for my demo instance so it provides me with a great overview of how many CPUs are allocated, database version, amount of storage allocated, database activity and a list of any SQL statements. A sort of mash-up of my OCI console page and service console page. If you then go to the Performance Hub page, we can start to investigate what’s currently happening within my demo instance…   as with the previous screens I can customize the metrics displayed on the graph although this time there is a huge library of metrics to choose from:   and OMC allows me to drill-in to my currently running SQL statement (highlighted in blue in the above screenshot) to look at the resource usage..   and I can get right down to the SQL execution plan…   Take Monitoring To The Next Level With OMC As you can see Oracle Management Cloud takes monitoring of your autonomous database instances to whole new level. Now you can get a single, integrated view for managing all your autonomous database instances, across all your different cloud accounts and across all your data centres. For cloud DBAs and cloud Fleet Managers this is definitely the way to go and more importantly OMC is free for Autonomous Database customers. If you are already using OMC to monitor other parts of your Oracle Cloud deployments (mobile apps, GoldenGate, data integration tools, IaaS, SaaS) then monitoring Autonomous Database instances can now be included in your day-to-day use of OMC. Which means you can use a single console to manage a complete application ecosystem and/or a complete data warehouse ecosystem. For cloud DBAs and cloud Fleet Managers life just got a whole lot easier! Happy monitoring with Oracle Management Cloud.   Where To Get More Information: Step-by-Step setup guide in PDF format is here. Autonomous Data Warehouse documentation is here. Autonomous Transaction Processing documentation is here. OMC documentation for Autonomous Database is here. 

How To Monitor All Your Autonomous Database Instances  The latest release of Autonomous Database (which, as you should already know, covers both Autonomous Data Warehouse and Autonomous Transaction proc...

Autonomous

Loading data into Autonomous Data Warehouse using Datapump

Oracle introduced Autonomous Data warehouse over a year ago, and one of the most common questions that customers ask me is how they can move their data/schema's to ADW (Autonomous Data Warehouse) with minimal efforts. My answer to that is to use datapump, also known as expdp/impdp. ADW doesn't support traditional import and export, so you have to use datapump. Oracle suggests using schema and parallel as a parameter while using datapump. Use the parallel depending upon the number of OCPU that you have for your ADW instance. Oracle also suggests to exclude index, cluster, indextype, materialized_view, materialized_view_log, materialized_zonemap, db_link. This is done in order to save space and speed up the data load process. At present you can only use data_pump_dir as an option for directory. This is the default directory created in ADW.  Also you don't have to worry about the performance of the database since ADW uses technologies like storage indexes, Machine learning, etc to achieve the optimal performance. You can use the file stored on Oracle Object Storage, Amazon S3 storage and Azure Blob Storage as your dumpfile location. I will be using Oracle Object storage in this article.  We will be using the steps below to load data: 1) Export the schema of your current database using expdp 2) Upload the .dmp file to object storage 3) Create Authentication Token  4) Login into ADW using SQLPLUS 5) Create credentials in Autonomous Data Warehouse to connect to Object Storage 6) Run import using Datapump 7) Verify the import Instead of writing more, let's show you how easy it is to do it.    Step 1 : Export the schema of your current database using expdp Use the expdp on your current database to run export. Copy that dump file put it in a location from where you can upload it to object storage.  Step 2 : Upload the .dmp file to object storage.  In order to upload the .dmp file on object storage log in into your cloud console and click object storage:  Once in object storage, select the compartment that you want to use and create a bucket. I am going to use compartment  "Liftandshift" and create bucket "LiftandShiftADW".   Next click on the bucket and click upload to upload the .dmp file. At this point either you can use CLI (Command line Interface) or GUI (Graphic User interface) to upload the .dmp file. If your .dmp file is larger that 2Gib then you have to use CLI. I am going to use GUI since i have a small schema for the demonstration purpose.  Select the .dmp file that you want to upload to object storage and then click upload object. Once you're done, your .dmp file will show up under objects in your Bucket Details Section Step 3 : Create Authentication Token Authentication Token will help us access Object Storage from Autonomous DB.  Under Governance and Administration Section, Click on Identity tab and go to users     Click on authorized user id and then click on Auth Token under resources on the left side to generate the Auth token. Click Generate Token, give it a description, and then click Generate token again and it will create the token for you. Remember to save the token. Once the token is created and saved, you won't be able to retrieve it again.  You can click on the copy button and copy the token to a notepad. Remember to save the token because you will not be able to see the token again. Once done, you can hit the close button on the screen.  Step 4 : Login into ADW using SQLPLUS Go to ADW homepage and click on the ADW database you have created.  Once in the database page click on DB Connection. Click on the Download button to download the wallet. Once the zip file is downloaded, hit the close button.   Download the latest version of instant-client from Oracle website : https://www.oracle.com/technetwork/database/database-technologies/instant-client/downloads/index.html Unzip all the files in one location. I used the location "C:\instantclient\instantclient_18_3" on my system. Once unzipped you will be able to use sqlplus.exe and impdp.exe at that location. Also move the compressed wallet file to that location and unzip the file.  Next update the entries in the sqlnet.ora file and point it to the location of your wallet. I have changed mine to "C:\instantclient\instantclient_18_3" as shown below. Test the connectivity using sqlplus.exe and make sure you are able to connect using the user-id admin. Step 5: Create credentials in Autonomous Data Warehouse to connect to Object Storage Use the below script to create credentials in ADW, and use the Authentication token created earlier as the password. BEGIN   DBMS_CLOUD.CREATE_CREDENTIAL(     credential_name => 'DEF_CRED_NAME',     username => 'oracleidentitycloudservice/ankur.saini@oracle.com',     password => '<password>'                                   <------------ (Use Authentication token Value here instead of the password)   ); END; / Step 6 : Run import using Datapump Since my ADW instance is built using 1 OCPU, I won't be using parallel as an option. I used the script below to run the import: ./impdp.exe admin/<Password>@liftandshift_high directory=data_pump_dir credential=def_cred_name dumpfile= https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/orasenatdoracledigital05/AnkurObject/hrapps.dmp exclude=index, cluster, indextype, materialized_view, materialized_view_log, materialized_zonemap, db_link   Step 7: Verify the import Login into the database using sqlplus or sqldeveloper and verify the import.  You can see how easy it is to move the data to ADW, and that there is not a huge learning curve. Now you can be more productive and focus on your business. Reference: https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/load-data.html#GUID-297FE3E6-A823-4F98-AD50-959ED96E6969    

Oracle introduced Autonomous Data warehouse over a year ago, and one of the most common questions that customers ask me is how they can move their data/schema's to ADW (Autonomous Data Warehouse) with...

Autonomous

So you have your CSV, TSV and JSON data lying in your Oracle Cloud Object Store. How do you get it over into your Autonomous Database?

You have finally gotten your ducks in a row to future proof your data storage, and uploaded all your necessary production data into Oracle Cloud's object store. Splendid! Now, how do you get this data into your Autonomous Database? Here I provide some practical examples of how to copy over your data from the OCI object store to your Autonomous Data Warehouse (ADW). You may use a similar method to copy data into your Autonomous Transaction Processing (ATP) instance too. We will dive into the meanings of some of the widely used parameters, which will help you and your teams derive quick business value by creating your Data Warehouse in a jiffy! An extremely useful feature of the fully managed ADW service is the ability to copy data lying in your external object store quickly and easily. The DBMS_CLOUD.COPY_DATA API procedure enables this behavior of copying (or loading) your data into your database from data files lying in your object store, enabling your ADW instance to run queries and analyses on said data. A few pre-cursor requirements to get us running these analyses: Make sure you have a running ADW instance with a little storage space, a credentials wallet and a working connection to your instance. If you haven’t done this already you can follow this simple Lab 1 tutorial. Use this link to download the data files for the following examples. You will need to unzip and upload these files to your Object Store. Once again, if you don’t know how to do this, follow Lab 3 Step 4 in this tutorial, which uploads files to a bucket in the Oracle Cloud Object Store, the most streamlined option. You may also use AWS or Azure object stores if required, you may refer to the documentation for more information on this. You will provide the URLs of the files lying in your object store to the API. If you already created your object store bucket’s URL in the lab you may use that, else to create this, use the URL below and replace the placeholders <region_name>, <tenancy_name> and <bucket_name> with your object store bucket’s region, tenancy and bucket names. The easiest way to find this information is to look at your object’s details in the object store, by opening the right-hand menu and clicking “Object details” (see screenshot below).  https://objectstorage.<region_name>..oraclecloud.com/n/<tenancy_name>/b/<bucket_name>/o/ Note: You may also use a SWIFT URL for your file here if you have one.   Have the latest version of SQL Developer installed (ADW requires at least v18.3 and above).   Comma Separated Value (CSV) Files   CSV files are one of the most common file formats out there. We will begin by using a plain and simple CSV format file for Charlotte’s (NC) Weather History dataset, which we will use as the data for our first ADW table. Open this Weather History ‘.csv’ file in a text editor to have a look at the data. Notice each field is separated by a comma, and each row ends by going to the next line. (ie. Which implies a newline ‘\n’ character). Also note that the first line is not data, but metadata (column names).   Let us now write a script to create a table, with the appropriate column names, in our ADW instance, and copy over this data file lying in our object store into it. We will specify the format of the file as CSV. The format parameter in the DBMS_CLOUD.COPY_DATA procedure takes a JSON object, which can be provided in two possible formats. format => '{"format_option" : “format_value” }' format => json_object('format_option' value 'format_value')) The second format option has been used in the script below. set define on define base_URL = <paste Object Store or SWIFT URL created above here> create table WEATHER_REPORT_CSV (REPORT_DATE VARCHAR2(20),     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2)); begin    DBMS_CLOUD.COPY_DATA(   table_name =>'WEATHER_REPORT_CSV',   credential_name =>'OBJ_STORE_CRED',     file_uri_list =>   '&base_URL/Charlotte_NC_Weather_History.csv',   format =>     json_object('type' value 'csv',      'skipheaders' value '1',     'dateformat' value 'mm/dd/yy')); end; / Let us breakdown and understand this script. We are first creating the WEATHER_REPORT_CSV table with the appropriate named columns for our destination table. We are then invoking the “COPY_DATA” procedure in the DBMS_CLOUD API  package and providing it the table name we created in our Data Warehouse, our user credentials (we created this in the pre-requisites), the object store file list that contains our data, and a format JSON object that describes the format of our file to the API.  The format parameter is a constructed JSON object with format options ‘type’ and ‘skipheaders’. The type specifies the file format as CSV, while skipheaders tells the API how many rows are metadata headers which should be skipped. In our file, that is 1 row of headers. The 'dateformat' parameter specifies the format of the date column in the file we are reading from; We will look at this parameter in more detailed examples below. Great! If this was successful, we have our first data warehouse table containing data from an object store file. If you do see errors during this copy_data process, follow Lab 3 Step 12 to troubleshoot them with the help of the necessary log file. If required, you can also drop this table with the “DROP TABLE” command. On running this copy data without errors, you now have a working data warehouse table. You may now query and join the WEATHER_REPORT_CSV with other tables in your Data Warehouse instance with the regular SQL or PL/SQL you know and love. As an example, let us find the days in our dataset during which it was pleasant in Charlotte. SELECT * FROM WEATHER_REPORT_CSV where actual_mean_temp > 69 and        actual_mean_temp < 74;   Tab Separated Value (TSV) Files   Another popular file format involves tab delimiters or TSV files. In the files you downloaded look for the Charlotte Weather History ‘.gz’ file. Unzip, open and have look at the ".tsv" file in it in a text editor as before. You will notice each row in this file is ended by a pipe ‘|’ character instead of a newline character, and the fields are separated by tabspaces. Oftentimes applications you might work with will output data in less intelligible formats such as this one, and so below is a slightly more advanced example of how to pass such data into DBMS_CLOUD. Let’s run the following script: create table WEATHER_REPORT_TSV (REPORT_DATE VARCHAR2(20),     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2));   begin   DBMS_CLOUD.COPY_DATA(     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object(                           'removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                            'recorddelimiter' value '''|''',                            'skipheaders' value '1'                           )  ); end; /   SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Let us understand the new parameters here: 'ignoremissingcolumns' value 'true': Notice there is no data for the last column “AVERAGE_PRECIPITATION”. This parameter allows the copy data script to skip over columns from the column list, that have no data in the data file. 'removequotes' value 'true': The first column ‘date’ has data surrounded by double quotes. For this data to be converted to an Oracle date type, these quotes need to be removed. Note that when using the type parameter for CSV files as we did in the first example, this removequotes option is true by default. 'dateformat' value 'mm/dd/yy': If we expect a date column to be converted and stored into an Oracle date column (after removing the double quotes of course), we should provide the date column’s format. If we don’t provide a format, the date column will look for the database's default date format. You can see the dateformat documentation here. 'delimiter' value '\t': Fields in this file are tab delimited, so the delimiter we specify is the special character. 'recorddelimiter' value '''|''': Each record or row in our file is delimited by a pipe ‘|’ symbol, and so we specify this parameter which separates out each row. Note that unlike the delimiter parameter, the recorddelimiter must be enclosed in single quotes as shown here. A nuance here is that the last row in your dataset doesn’t need the record delimiter when it is the default newline character, however it does for other character record delimiters to indicate the end of that row. Also note that since ADW is LINUX/UNIX based, source data files with newline as record delimiters, that have been created on Windows, must use “\r\n” as the format option. Both these nuances will likely have updated functionality in future releases. 'rejectlimit' value '1': We need this parameter here to fix an interesting problem. Unlike with the newline character, if we don’t specify a pipe record delimiter here at the very end of the file, we get an error because the API doesn’t recognize where the last row’s, last column ends. If we do specify the pipe record delimiter however, the API expects a new line because the record has been delimited, and we get a null error for the last non-existent row. To fix situations like this, where we know we might have one or more problem rows, we use the reject limit parameter to allow some number of rows to be rejected. If we use ‘unlimited’ as our reject limit, then any number of rows may be rejected. The default reject limit is 0. 'compression' value 'gzip': Notice the .tsv file is zipped into a gzip “.gz” file, which we have used in the URL. We use this parameter so the file will be unzipped appropriately before the table is created. As before, once this is successful, the table structure has been created following which the data is loaded into the table from the data file in the object store. We then proceed to query the table in your Data Warehouse.   Field Lists - For more Granular parsing options:   A more advanced feature of the DBMS_CLOUD.COPY_DATA is the Field_List parameter, which borrows it’s feature set from the Field_List parameter of the Oracle Loader access driver. This parameter allows you to specify more granular information about the fields being loaded. For example, let’s use “Charlotte_NC_Weather_History_Double_Dates.csv” from the list of files in our object store. This file is similar to our first CSV example, except it has a copy of the date column in a different date format. Now, if we were to specify a date format in the format parameter, it would apply to universally to all date columns. With the field_list parameter, we can specify two different date formats for the two date columns. We do need to list all the columns and their types when including the field_list; Not mentioning any type parameters simply uses default Varchar2 values. create table WEATHER_REPORT_DOUBLE_DATE (REPORT_DATE VARCHAR2(20),     REPORT_DATE_COPY DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER(5,2));   begin  DBMS_CLOUD.COPY_DATA(     table_name =>'WEATHER_REPORT_DOUBLE_DATE',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History_Double_Dates.csv',     format => json_object('type' value 'csv',  'skipheaders' value '1'),     field_list => 'REPORT_DATE DATE ''mm/dd/yy'',                    REPORT_DATE_COPY DATE ''yyyy-mm-dd'',                    ACTUAL_MEAN_TEMP,                   ACTUAL_MIN_TEMP,                   ACTUAL_MAX_TEMP,                   AVERAGE_MIN_TEMP,                   AVERAGE_MAX_TEMP,                   AVERAGE_PRECIPITATION'  ); end; /   SELECT * FROM WEATHER_REPORT_DOUBLE_DATE where actual_mean_temp > 69 and actual_mean_temp < 74; It's important to recognize that the date format parameters are to provide the API with the information to read the data file. The output format from your query will be your Database default (based on your NLS Parameters). This can also be formatted in your query using TO_CHAR.   JSON Files   You may be familiar with JSON files for unstructured and semi-structured data. The "PurchaseOrders.txt" file contains JSON Purhcase Order data, which when parsed and formatted looks like the following. Using JSON data in an ADW instance can be as simple as putting each JSON document into a table row as a BLOB, and using the powerful, native JSON features that the Oracle Database provides to parse and query it. You can also view the JSON documentation for additional features here.  Let’s try this! Copy and run the script below: CREATE TABLE JSON_DUMP_FILE_CONTENTS (json_document blob); begin  DBMS_CLOUD.COPY_DATA(    table_name =>'JSON_DUMP_FILE_CONTENTS',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/PurchaseOrders.dmp',    field_list => 'json_document CHAR(5000)' ); end; / COLUMN Requestor FORMAT A30 COLUMN ShippingInstructions FORMAT A30 SELECT JSON_VALUE(json_document,'$.Requestor') as Requestor,        JSON_VALUE(json_document,'$.ShippingInstructions.Address.city') as ShippingInstructions FROM JSON_DUMP_FILE_CONTENTS where rownum < 50; The query above lists all the PO requestors and the city where their shipment is to be delivered. Here, we have simply created one column ‘json_document’ in the table ‘JSON_FILE_CONTENTS’. We do not incur the time it takes to validate these JSON document, and are instead directly querying the table using the Database’s JSON_VALUE feature. This means the check for well-formed JSON data will be on the fly, which would fail unless you properly skip over the failed data. Here, 'COPY_DATA' will not check for valid JSON data, but will simply check that the data is of the correct native datatype (less than 5000 characters long), that is the datatype of the table’s column. For better performance on large JSON data files, using this ADW table we can also make use of the Database’s JSON features to parse and insert the JSON data into a new table ‘j_purchaseorder’ ahead of time, as below. Note that this insert statement actually brings the data into your ADW instance. You benefit from doing this as it checks to make sure your JSON data is well-formed and valid ahead of time, and therefore incur less of a performance impact when you query this JSON data from your ADW instance. CREATE TABLE j_purchaseorder  (id          VARCHAR2 (32) NOT NULL,   date_loaded TIMESTAMP (6) WITH TIME ZONE,   po_document BLOB   CONSTRAINT ensure_json CHECK (po_document IS JSON));   INSERT INTO j_purchaseorder (id, date_loaded, po_document) SELECT SYS_GUID(), SYSTIMESTAMP, json_document FROM json_file_contents    WHERE json_document IS JSON;   We can now query down JSON paths using the JSON simplified syntax as with the following query:     SELECT po.po_document.Requestor,          po.po_document.ShippingInstructions.Address.city                  FROM j_purchaseorder po;   Beyond Copying Data into your Autonomous Data Warehouse Here, we've gone through simple examples of how to copy your Oracle object store data into your Autonomous Data Warehouse instance. In following posts, we will walk through more ways you might use to load your data, from on-premise or cloud based storage, as well as more detail on how you might troubleshoot any data loading errors you may encounter. See you in the next one!

You have finally gotten your ducks in a row to future proof your data storage, and uploaded all your necessary production data into Oracle Cloud's object store. Splendid! Now, how do you get this data...

Autonomous

Oracle Autonomous Databases - Accessing Apache Avro Files

Apache Avro is a common data format in big data solutions.  Now, these types of files are easily accessible to Oracle Autonomous Databases.  One of Avro's key benefits is that it enables efficient data exchange between applications and services.  Data storage is compact and efficient – and the file format itself supports schema evolution.  It does this by including the schema within each file – with an explanation of the characteristics of each field. In a previous post about Autonomous Data Warehouse and access parquet, we talked about using a utility called parquet-tools to review parquet files.  A similar tool – avro-tools – is available for avro files.  Using avro-tools, you can create avro files, extract the schema from a file, convert an avro file to json, and much more (check out the Apache Avro home for details).  A schema file is used to create the avro files.  This schema file describes the fields, data types and default values.  The schema becomes part of the generated avro file – which allows applications to read the file and understand its contents.  Autonomous Database uses this schema to automate table creation.  Similar to parquet sources, Autonomous Database will read the schema to create the columns with the appropriate Oracle Database data types.  Avro files may include complex types – like arrays, structs, maps and more; Autonomous Database supports Avro files that contain Oracle data types. Let’s take a look at an example.  Below, we have a file - movie.avro - that contains information about movies (thanks to Wikipedia for providing info about the movies).  We’ll use the avro-tools utility to extract the schema: $ avro-tools getschema movie.avro {   "type" : "record",   "name" : "Movie",   "namespace" : "oracle.avro",   "fields" : [ {     "name" : "movie_id",     "type" : "int",     "default" : 0   }, {     "name" : "title",     "type" : "string",     "default" : ""   }, {     "name" : "year",     "type" : "int",     "default" : 0   }, {     "name" : "budget",     "type" : "int",     "default" : 0   }, {     "name" : "gross",     "type" : "double",     "default" : 0   }, {     "name" : "plot_summary",     "type" : "string",     "default" : ""   } ] } The schema is in an easy to read JSON format.  Here, we have movie_id, , title, year, budget, gross and plot_summary columns. The data has been loaded into an Oracle Object Store bucket called movies.  The process for making this data available to ADW is identical to the steps for parquet – so check out that post for details.  At a high level, you will: Create a credential that is used to authorize access to the object store bucket Create an external table using dbms_cloud.create_external_table. Query the data! 1.  Create the credential begin   DBMS_CLOUD.create_credential (     credential_name => 'OBJ_STORE_CRED',     username => '<user>',     password => '<password>'   ) ; end; / 2.  Create the table begin     dbms_cloud.create_external_table (     table_name =>'movies_ext',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'https://objectstorage.ca-toronto-1.oraclecloud.com/n/<tenancy>/b/<bucket>/o/*',     format =>  '{"type":"avro",  "schema": "first"}'     ); end; / Things got a little easier when specifying the URI list.  Instead of transforming the URI into a specific format, you can use the same path that is found in the OCI object browser: Object Details -> URL Path (accessed from the OCI Console:  Oracle Cloud -> Object Storage – “movies” bucket): This URL Path was specified in the file-uri-list parameter – although a wildcard was used instead of the specific file name. Now that the table is created, we can look at its description: SQL> desc movies_ext Name         Null? Type          ------------ ----- -------------- MOVIE_ID           NUMBER(10)    TITLE              VARCHAR2(4000) YEAR               NUMBER(10)    BUDGET             NUMBER(10)    GROSS              BINARY_DOUBLE PLOT_SUMMARY       VARCHAR2(4000) And, run queries against that table: SQL> select title, year, budget, plot_summary from movies_ext where title like 'High%'; All of this is very similar to parquet – especially from a usability standpoint.  With both file formats, the metadata in the file is used to automate the creation of tables.  However, there is a significant difference when it comes to processing the data.  Parquet is a columnar format that has been optimized for queries.  Column projection and predicate pushdown is used to enhance performance by minimizing the amount of data that is scanned and subsequently transferred from the object store to the database.  The same is not true for Avro; the entire file will need to be scanned and processed.  So, if you will be querying this data frequently – consider alternative storage options and use tools like dbms_cloud.copy_data to easily load the data into Autonomous Database.

Apache Avro is a common data format in big data solutions.  Now, these types of files are easily accessible to Oracle Autonomous Databases.  One of Avro's key benefits is that it enables efficient...

Autonomous

How To Update The License Type For Your Autonomous Database

How To Update The License Type For Your Autonomous Database As of today (Wednesday April 17) you can now quickly and easily change the type licensing for your Autonomous Database from BYOL to a new cloud subscription or vice-versa. It’s as easy as 1-2-3. So assuming you already have already created an autonomous database instance, how do you go about changing your licensing? Let me show you! Step 1 - Accessing the management console First stage involves signing into your cloud account using your tenancy name and cloud account. Then you can navigate to either your “Autonomous Data Warehouse” or “Autonomous Transaction Processing” landing pad as shown below: Let’s now change the type of license for the autonomous database instance “pp1atpsep2”. If we click on the blue text of the instance name which is in the first column of the table, this will take us to the instance management console page as shown below: Notice in the above image that, on the righthand side, the console shows the current license type as set to "Bring Your Own License" which is often referred to as a BYOL-license. Step 2 - Selecting “Update License Type” from Action menu Now click on the “Actions” button in the row of menu buttons as shown below:   Step 3 - Change the “License Type” The pop-up form shows the current type of license associated with our autonomous database instance “pp1atpsep2”, which in this case is set to BYOL If you want more information about what is and is not covered within a BYOL-license then visit the BYOL FAQ page which is here. In this case we are going to flip to using a new cloud subscription, as shown below: That's it! All that's left to do is click on the blue Update button and the new licensing model will be applied to our autonomous database instance. At this point your autonomous database will switch into “Updating” mode, as shown below. However, the database is still up and accessible. There is no downtime. When the update is complete the status will return to “Available” and the console will show that the license type has changed to "License Included" as shown below. Summary Congratulations you have successfuly swapped your BYOL license to a new cloud subscription license for your autonomous database with absolutely no downtime or impact on your users. In this post I have showed you how to quickly and easily change the type of license associated with your autonomous database. An animation of the complete end-to-end process is shown below:   Featured Neon "Change" image courtesy of wikipedia  

How To Update The License Type For Your Autonomous Database As of today (Wednesday April 17) you can now quickly and easily change the type licensing for your Autonomous Database from BYOL to a new...

Hadoop Best Practices

Big Data. See How Easily You Can Do Disaster Recovery

Earlier I've written about Big Data High Availability in different aspects and I intentionally avoided the Disaster Recovery topic. High Availability answers on the question how system should process in case of failure one of the component (like Name Node or KDC) within one system (like one Hadoop Cluster), Disaster Recovery answers on the question what to do in case if entire system will fail (Hadoop cluster or even Data Center will go down). In this blog, I also would like to talk about backup and how to deal with human mistakes (it's not particularly DR topics, but quite close).  Also, I'd like to introduce few terms. From Wikipedia: Business Continuity:  Involveskeeping all essential aspects of a business functioning despite significant disruptive events.  Disaster Recovery: (DR)  A set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a disaster. Step 1. Protect system from human errors. HDFS snapshots. HDFS snapshots functionality has been a while in Hadoop portfolio. This is a great way to protect system from human mistakes. There are few simple steps to enable it (full snapshot documentation you could find here).  - go to Cloudera Manager and drill down into the HDFS service: - then go to the "File Browser" and navigate to the directory, which you would like to protect by snapshots - click on the "Enable Snapshots" button: as soon as command finished, you have directory, protected by snapshots! you may take snapshots on demand: or you may create a snapshot policy, which will be periodically repeated (it's recommended). In order to make it work you have to go to a Cloudera Manager -> Backup -> Snapshot Policies:   - Click on the "Create Policy" (Note: you have to enable Snapshots for certain directory before creating policy) - and fill up the form: easy, but very powerful. It's a good time for a demo. Let's imagine, that we have directory with critical datasets on HDFS: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir1 drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir2   then occasionally user deleted one of the directory: [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /tmp/snapshot_demo/dir1 Deleted /tmp/snapshot_demo/dir1 [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 1 items drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:12 /tmp/snapshot_demo/dir2 fortunately, it's quite easy to restore state of this dir using snapshots: - go to a Cloudera Manager -> HDFS -> File Browser - choose option "Restore from snapshot": - choose appropriate snapshot and click "Restore": - check what you have: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir2 Note: snapshot revert you to the stage where you've made it. For example, if you add some directory and then restore to a snapshot, you will not have this directory, which I've created after taking snapshot: [root@destination_cluster15 ~]# hadoop fs -mkdir /tmp/snapshot_demo/dir3 [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 3 items drwxr-xr-x   - hdfs    supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs    supergroup          0 2019-02-13 14:32 /tmp/snapshot_demo/dir2 drwxr-xr-x   - bdruser supergroup          0 2019-02-13 14:35 /tmp/snapshot_demo/dir3   and restore from the early taken snapshot: after recovery done: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2   I have only two directories.  Another one common case - when user change file permissions or file owner by accident and want to return it back: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2 [root@destination_cluster15 ~]# hadoop fs -chown yarn:yarn /tmp/snapshot_demo/* [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - yarn yarn          0 2019-02-13 14:36 /tmp/snapshot_demo/dir1 drwxr-xr-x   - yarn yarn          0 2019-02-13 14:36 /tmp/snapshot_demo/dir2 restore from snapshot and have previous file owner: [root@destination_cluster15 ~]# hadoop fs -ls /tmp/snapshot_demo/ Found 2 items drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:38 /tmp/snapshot_demo/dir1 drwxr-xr-x   - hdfs supergroup          0 2019-02-13 14:38 /tmp/snapshot_demo/dir2 Conclusion: snapshots is very powerful tool for protect your file system from human mistakes. It stores only delta (changes), so that means that it will not consume many space in case if you don't delete data frequently. Step 2.1. Backup data. On-premise backup. NFS Storage. Backups in Hadoop world is ticklish topic. The reason is time to recovery. How long will it take to bring data back to the production system? Big Data systems tend to be not so expensive and have massive datasets, so it may be easier to have second cluster (details on how to do this coming later in this blog). But if you have some reasons to do backups, you may consider either NFS storage (in case if you want to take backup on-premise in your datacenter) or Object Store (if you want to take backup outside of your data center) in Oracle Cloud Infrastructure (OCI) as an options. In case of NFS storage (like Oracle ZFS), you have to mount your NFS storage at the same directory on every Hadoop node. Like this: Run on each BDA node: [root]#  mount nfs_storage_ip:/stage/files /tmp/src_srv Now you have share storage on every server and it means that every single Linux server has the same directory. It allows you to run distcp command (that originally was developed for coping big amount of data between HDFS filesystems). For start parallel copy, just run: $ hadoop distcp -m 50 -atomic hdfs://nnode:8020/tmp/test_load/* file:///tmp/src_srv/files/; You will create MapReduce job that will copy from one place (local file system) to HDFS with 50 mappers. Step 2.2. Backup data. Cloud. Object Storage. Object Store is key element for every cloud provider, oracle is not an exception. Documentation for Oracle Object Storage you could find here. Object Store provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to they need. Need some extra space? Simply load data into Object Store. There is no difference in process to copy 1GB or 1PB of data - Scalability. It's infinitely scale. At least theoretically :) - Durability and Availability. Object Store is first class citizen in all Cloud Stories, so all vendors do all their best to maintain 100% availability and durability. If some diet will go down, it shouldn't worry you. If some node with OSS software will go down, it shouldn't worry you. As user you have to put data there and read data from Object Store.  - Cost. In a Cloud Object Store is most cost efficient solution. Nothing comes for free and as downside I may highlight: - Performance in comparison with HDFS or local block devices. Whenever you read data from Object Store, you read it over the network. - Inconsistency of performance. You are not alone on object store and obviously under the hood it uses physical disks, which have own throughput. If many users will start to read and write data to/from Object Store, you may get performance which is different with what you use to have a day, week, month ago - Security. Unlike filesystems Object Store has not file grain permissions policies and customers will need to reorganize and rebuild their security standards and policies. Before running backup, you will need to configure OCI Object Store in you Hadoop system.  After you config your object storage, you may check the bucket that you've intent to copy to: [root@source_clusternode01 ~]# hadoop fs -ls oci://BDAx7Backup@oraclebigdatadb/   Now you could trigger actual copy by running either distcp: [root@source_clusternode01 ~]# hadoop distcp -Dmapred.job.queue.name=root.oci -Dmapreduce.task.timeout=6000000 -m 240 -skipcrccheck -update -bandwidth 10240 -numListstatusThreads 40 /user/hive/warehouse/parq.db/store_returns oci://BDAx7Backup@oraclebigdatadb/ or ODCP - oracle build tool. You could find more info about ODCP here.   [root@source_clusternode01 ~]# odcp --executor-cores 3 --executor-memory 9 --num-executors 100  hdfs:///user/hive/warehouse/parq.db/store_sales oci://BDAx7Backup@oraclebigdatadb/   and after copy done, you will be able to see all your data in OCI Object Store bucket:   Step 2.3. Backup data. Big Data Appliance metadata. it's easies section for me because Oracle support engineers made a huge effort writing support note, which tells customer on how to take backups of metadata. For more details please refer to: Customer RecommendedHow to Backup Critical Metadata on Oracle Big Data Appliance Prior to Upgrade V2.3.1 and Higher Releases (Doc ID 1623304.1)  Step 2.4. Backup data. MySQL Separately, I'd like to mention that MySQL backup is very important and you could get familiar with this here: How to Redirect a MySQL Database Backup on BDA Node 3 to a Different Node on the BDA or on a Client/Edge Server (Doc ID 1926159.1) Step 3. HDFS Disaster Recovery. Recommended Architecture Here I'd like to share Oracle recommended architecture for Disaster Recovery setup: We do recommend to have same Hardware and Software environment for Production and DR environments. If you want to have less powerful nodes on the DR side, you should answer to yourself on the question - what you are going to do in case of disaster? What is going to happen if you will switch all production applications to the DR side. Will it be capable to handle this workload? Also, one very straight recommendation from Oracle is to have small BDA (3-6 nodes) in order to perform tests  on it. Here is the rough separation of duties for these three clusters: Production (Prod): - Running production workload Disaster Recovery (DR): - Same (or almost the same) BDA hardware configuration - Run non-critical Ad-Hoc queries - Switch over in case of unplanned (disaster) or planned (upgrade) outages of the prod Test: Use small BDA cluster (3-6 nodes) to test different things, such as: - Upgrade - Change settings (HDFS, YARN) - Testing of new engines (add and test Hbase, Flume..) - Testing integration with other systems (AD, Database) - Test Dynamic pools … Note: for test environment you also consider Oracle cloud offering Step 3.1 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Now we are approaching most interesting part of the blog. Disaster recovery.  Cloudera offers the tool out of the box, which called Big Data Disaster Recovery (BDR), which allows to Hadoop administrators easily, using web interface create replication policies and schedule the data replication. Let me show the example how to do this replication with BDR. I have two BDA clusters source_cluster and destination_cluster. under the hood BDR uses special version of distcp, which has many performance and functional optimizations. Step 3.1.1 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Network throughput It's crucial to understand network throughput between clusters. To measure network throughput you may use any tool convenient for you. I personally prefer iperf. Note: iperf has two modes - UDP and TCP. Use TCP in order to make measurements between servers in context of BDR, because it uses TCP connections. After installation it's quite easy to run it: On probation make one machine (let's say on destination cluster) as server and run iperf in listening mode: [root@destination_cluster15 ~]# iperf -s on the source machine run client command, which will send some TCP traffic to server machine for 10 minutes with maximum bandwidth: [root@source_clusternode01 ~]# iperf -c destination_cluster15 -b 10000m -t 600   after getting this numbers you may understand what could you count on when you will run the copy command.   Step 3.1.2 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Ports. it's quite typical when different Hadoop clusters are in different data centers over firewalls. Before start running  BDR jobs, please make sure, that all necessary ports are open from both sides.   Step 3.1.3 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Kerberos Naming Recommendations. We (as well as Cloudera) generally recommend you use different KDC realms with trust between them and different realm names for each cluster. All user principals obtain their credentials from AD, MIT KDC will store service principals. More details on BDA security good practices you could find here. I'll assume that both clusters are kerberized (kerberos is almost default nowadays) and we will need to do some config around this. Detailed steps on how to setup Kerberos setting for two clusters, which use different KDCs you could find here. If you want to know how to set up trusted relationships between clusters you could refer here. I just briefly want to highlight most important steps. 1) On the source cluster go to "Administration -> Settings -> Search for ("KDC Server Host") and set up hostname for Source KDC. Do the same for "KDC Admin Server Host". It's important because when destination cluster comes to the source and ask for KDC it does not read the /etc/krb5.conf as you may think. It read KDC address from this property. 2) Both clusters are in the same domain. It's quite probable and quite often case. You may conclude this by seen follow error message: "Peer cluster has domain us.oracle.com and realm ORACLE.TEST but a mapping already exists for this domain us.oracle.com with realm US.ORACLE.COM. Please use hostname(s) instead of domain(s) for realms US.ORACLE.COM and ORACLE.TEST, so there are no conflicting domain to realm mapping." it's easy to fix by adding in /etc/krb5.conf exact names of the hosts under "domain_realm" section: [domain_realm] source_clusternode01.us.oracle.com = ORACLE.TEST source_clusternode02.us.oracle.com = ORACLE.TEST source_clusternode03.us.oracle.com = ORACLE.TEST source_clusternode04.us.oracle.com = ORACLE.TEST source_clusternode05.us.oracle.com = ORACLE.TEST source_clusternode06.us.oracle.com = ORACLE.TEST destination_cluster13.us.oracle.com = US.ORACLE.COM destination_cluster14.us.oracle.com = US.ORACLE.COM destination_cluster13.us.oracle.com = US.ORACLE.COM .us.oracle.com = US.ORACLE.COM us.oracle.com = US.ORACLE.COM   Note: here you do Host-Realm mapping, because I have two different REALMs and two different KDCs, but only one domain. In case if I'll use any host outside of the given list, I need to specify default realm for the domain (last two rows)   3) at a destination cluster add REALM for the source cluster in /etc/krb5.conf: [root@destination_cluster13 ~]# cat /etc/krb5.conf ... [realms]  US.ORACLE.COM = {   kdc = destination_cluster13.us.oracle.com:88   kdc = destination_cluster14.us.oracle.com:88   admin_server = destination_cluster13.us.oracle.com:749   default_domain = us.oracle.com  } ORACLE.TEST = { kdc = source_clusternode01.us.oracle.com admin_server = source_clusternode01.us.oracle.com default_domain = us.oracle.com } ... try to obtain credentials and explore source Cluster HDFS: [root@destination_cluster13 ~]# kinit oracle@ORACLE.TEST Password for oracle@ORACLE.TEST:  [root@destination_cluster13 ~]# klist  Ticket cache: FILE:/tmp/krb5cc_0 Default principal: oracle@ORACLE.TEST   Valid starting     Expires            Service principal 02/04/19 22:47:42  02/05/19 22:47:42  krbtgt/ORACLE.TEST@ORACLE.TEST     renew until 02/11/19 22:47:42 [root@destination_cluster13 ~]# hadoop fs -ls hdfs://source_clusternode01:8020 19/02/04 22:47:54 WARN security.UserGroupInformation: PriviledgedActionException as:oracle@ORACLE.TEST (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)]   it fails, but it's not a big surprise - Cluster don't have trusted relationships. let's fix this.  Note: Big Data Disaster Recovery doesn't require trusted kerberos relationships between clusters (distcp does), but in order to make it easier to debug and to some other operation activities, I'd recommend to add it on. On the destination cluster: [root@destination_cluster13 ~]# kadmin.local  kadmin.local:  addprinc krbtgt/ORACLE.TEST@US.ORACLE.COM WARNING: no policy specified for krbtgt/ORACLE.TEST@US.ORACLE.COM; defaulting to no policy Enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Re-enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Principal "krbtgt/ORACLE.TEST@US.ORACLE.COM" created. on the source Cluster: [root@source_clusternode01 ~]# kadmin.local  kadmin.local:  addprinc krbtgt/ORACLE.TEST@US.ORACLE.COM WARNING: no policy specified for krbtgt/ORACLE.TEST@US.ORACLE.COM; defaulting to no policy Enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Re-enter password for principal "krbtgt/ORACLE.TEST@US.ORACLE.COM":  Principal "krbtgt/ORACLE.TEST@US.ORACLE.COM" created.   make sure that you create same user within the same passwords on both KDCs. try to explore destination's HDFS: [root@destination_cluster13 ~]# hadoop fs -ls hdfs://source_clusternode01:8020 Found 4 items drwx------   - hbase hbase               0 2019-02-04 22:34 /hbase drwxr-xr-x   - hdfs  supergroup          0 2018-03-14 06:46 /sfmta drwxrwxrwx   - hdfs  supergroup          0 2018-10-31 15:41 /tmp drwxr-xr-x   - hdfs  supergroup          0 2019-01-07 09:30 /user Bingo! it works. Now we have to do the same on both clusters to allow reverse direction: 19/02/05 02:02:02 INFO util.KerberosName: No auth_to_local rules applied to oracle@US.ORACLE.COM 19/02/05 02:02:03 WARN security.UserGroupInformation: PriviledgedActionException as:oracle@US.ORACLE.COM (auth:KERBEROS) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Fail to create credential. (63) - No service creds)] same error and same fix for it. I just simply automate this by running follow commands on the both KDCs: delprinc -force krbtgt/US.ORACLE.COM@ORACLE.TEST delprinc -force krbtgt/ORACLE.TEST@US.ORACLE.COM addprinc -pw "welcome1" krbtgt/US.ORACLE.COM@ORACLE.TEST addprinc -pw "welcome1" krbtgt/ORACLE.TEST@US.ORACLE.COM   4) make sure that in /var/kerberos/krb5kdc/kdc.conf you have: default_principal_flags = +renewable, +forwardable   Step 3.1.4 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). SSL The next assumption is that your's Cloudera manager is working over encrypted channel and if you will try to do add source peer, most probably, you'll get an exception:   in order to fix this: a. Check certificate for Cloudera Manager (run this command on the destination cluster): [root@destination_cluster13 ~]# openssl s_client -connect source_clusternode05.us.oracle.com:7183 CONNECTED(00000003) depth=0 C = , ST = , L = , O = , OU = , CN = source_clusternode05.us.oracle.com verify error:num=18:self signed certificate verify return:1 depth=0 C = , ST = , L = , O = , OU = , CN = source_clusternode05.us.oracle.com verify return:1 --- Certificate chain  0 s:/C=/ST=/L=/O=/OU=/CN=source_clusternode05.us.oracle.com    i:/C=/ST=/L=/O=/OU=/CN=source_clusternode05.us.oracle.com --- Server certificate -----BEGIN CERTIFICATE----- MIIDYTCCAkmgAwIBAgIEP5N+XDANBgkqhkiG9w0BAQsFADBhMQkwBwYDVQQGEwAx CTAHBgNVBAgTADEJMAcGA1UEBxMAMQkwBwYDVQQKEwAxCTAHBgNVBAsTADEoMCYG A1UEAxMfYmRheDcyYnVyMDlub2RlMDUudXMub3JhY2xlLmNvbTAeFw0xODA3MTYw MzEwNDVaFw0zODA0MDIwMzEwNDVaMGExCTAHBgNVBAYTADEJMAcGA1UECBMAMQkw BwYDVQQHEwAxCTAHBgNVBAoTADEJMAcGA1UECxMAMSgwJgYDVQQDEx9iZGF4NzJi dXIwOW5vZGUwNS51cy5vcmFjbGUuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8A MIIBCgKCAQEAkLwi9lAsbiWPVUpQNAjtGE5Z3pJOExtJMSuvnj02FC6tq6I09iJ0 MsTu6+Keowv5CUlhfxTy1FD19ZhX3G7OEynhlnnhJ+yjprYzwRDhMHUg1LtqWib/ osHR1QfcDfLsByBKO0WsLBxCz/+OVm8ZR+KV/AeZ5UcIsvzIRZB4V5tWP9jziha4 3upQ7BpSvQhd++eFb4wgtiBsI8X70099ZI8ctFpmPjxtYHQSGRGdoZZJnHtPY4IL Vp0088p+HeLMcanxW7CSkBZFn9nHgC5Qa7kmLN4EHhjwVfPCD+luR/k8itH2JFw0 Ub+lCOjSSMpERlLL8fCnETBc2nWCHNQqzwIDAQABoyEwHzAdBgNVHQ4EFgQUkhJo 0ejCveCcbdoW4+nNX8DjdX8wDQYJKoZIhvcNAQELBQADggEBAHPBse45lW7TwSTq Lj05YwrRsKROFGcybpmIlUssFMxoojys2a6sLYrPJIZ1ucTrVNDspUZDm3WL6eHC HF7AOiX4/4bQZv4bCbKqj4rkSDmt39BV+QnuXzRDzqAxad+Me51tisaVuJhRiZkt AkOQfAo1WYvPpD6fnsNU24Tt9OZ7HMCspMZtYYV/aw9YdX614dI+mj2yniYRNR0q zsOmQNJTu4b+vO+0vgzoqtMqNVV8Jc26M5h/ggXVzQ/nf3fmP4f8I018TgYJ5rXx Kurb5CL4cg5DuZnQ4zFiTtPn3q5+3NTWx4A58GJKcJMHe/UhdcNvKLA1aPFZfkIO /RCqvkY= -----END CERTIFICATE----- b. Go to the source cluster and find a file which has this certificate (run this command on the source cluster): [root@source_clusternode05 ~]# grep -iRl "MIIDYTCCAkmgAwIBAgIEP5N+XDANBgkqhkiG9w0BAQsFADBhMQkwBwYDVQQGEwAx" /opt/cloudera/security/|grep "cert" /opt/cloudera/security/x509/node.cert /opt/cloudera/security/x509/ssl.cacerts.pem   c. Make sure that each node has different certificate by calculating the hash (run this command on the source cluster): [root@source_clusternode05 ~]# dcli -C "md5sum /opt/cloudera/security/x509/node.cert" 192.168.8.170: cc68d7f5375e3346d312961684d728c0  /opt/cloudera/security/x509/node.cert 192.168.8.171: 9259bb0102a1775b164ce56cf438ed0e  /opt/cloudera/security/x509/node.cert 192.168.8.172: 496fd4e12bdbfc7c6aab35d970429a72  /opt/cloudera/security/x509/node.cert 192.168.8.173: 8637b8cfb5db843059c7a0aeb53071ec  /opt/cloudera/security/x509/node.cert 192.168.8.174: 4aabab50c256e3ed2f96f22a81bf13ca  /opt/cloudera/security/x509/node.cert 192.168.8.175: b50c2e40d04a026fad89da42bb2b7c6a  /opt/cloudera/security/x509/node.cert [root@source_clusternode05 ~]#    d. rename this certificates (run this command on the source cluster):   [root@source_clusternode05 ~]# dcli -C cp /opt/cloudera/security/x509/node.cert /opt/cloudera/security/x509/node_'`hostname`'.cert e. Check the new names (run this command on the source cluster): [root@source_clusternode05 ~]# dcli -C "ls /opt/cloudera/security/x509/node_*.cert" 192.168.8.170: /opt/cloudera/security/x509/node_source_clusternode01.us.oracle.com.cert 192.168.8.171: /opt/cloudera/security/x509/node_source_clusternode02.us.oracle.com.cert 192.168.8.172: /opt/cloudera/security/x509/node_source_clusternode03.us.oracle.com.cert 192.168.8.173: /opt/cloudera/security/x509/node_source_clusternode04.us.oracle.com.cert 192.168.8.174: /opt/cloudera/security/x509/node_source_clusternode05.us.oracle.com.cert 192.168.8.175: /opt/cloudera/security/x509/node_source_clusternode06.us.oracle.com.cert f. Pull those certificates from source cluster to one node of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# for i in {1..6}; do export NODE_NAME=source_clusternode0$i.us.oracle.com; scp root@$NODE_NAME:/opt/cloudera/security/x509/node_$NODE_NAME.cert /opt/cloudera/security/jks/node_$NODE_NAME.cert; done; g. propagate it on the all nodes of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# for i in {4..5}; do scp /opt/cloudera/security/jks/node_bda*.cert root@destination_cluster1$i:/opt/cloudera/security/jks; done;   h. on the destination host option truster password and trustore location (run this command on the destination cluster): [root@destination_cluster13 ~]# bdacli getinfo cluster_https_truststore_path Enter the admin user for CM (press enter for admin):  Enter the admin password for CM:  /opt/cloudera/security/jks/cdhs49.truststore   [root@destination_cluster13 ~]# bdacli getinfo cluster_https_truststore_password Enter the admin user for CM (press enter for admin):  Enter the admin password for CM:  dl126jfwt1XOGUlNz1jsAzmrn1ojSnymjn8WaA7emPlo5BnXuSCMtWmLdFZrLwJN i. and add them environment variables on all hosts of the destination cluster (run this command on the destination cluster): [root@destination_cluster13 ~]# export TRUSTORE_PASSWORD=dl126jfwt1XOGUlNz1jsAzmrn1ojSnymjn8WaA7emPlo5BnXuSCMtWmLdFZrLwJN [root@destination_cluster13 ~]# export TRUSTORE_FILE=/opt/cloudera/security/jks/cdhs49.truststore j. now we are ready to copy add certificates to the destination clusters trustore (run this command on the destination cluster) do this on all hosts of the destination cluster: [root@destination_cluster13 ~]# for i in {1..6}; do export NODE_NAME=export NODE_NAME=source_clusternode0$i.us.oracle.com; keytool -import -noprompt  -alias $NODE_NAME -file /opt/cloudera/security/jks/node_$NODE_NAME.cert -keystore $TRUSTORE_FILE -storepass $TRUSTORE_PASSWORD; done; Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore Certificate was added to keystore   k. to validate that we add it, run (run this command on the destination cluster): [root@destination_cluster13 ~]# keytool -list -keystore $TRUSTORE_FILE -storepass $TRUSTORE_PASSWORD Keystore type: jks Keystore provider: SUN   Your keystore contains 9 entries   destination_cluster14.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): B3:F9:70:30:77:DE:92:E0:A3:20:6E:B3:96:91:74:8E:A9:DC:DF:52 source_clusternode02.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 3F:6E:B9:34:E8:F9:0B:FF:CF:9A:4A:77:09:61:E9:07:BF:17:A0:F1 source_clusternode05.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): C5:F0:DB:93:84:FA:7D:9C:B4:C9:24:19:6F:B3:08:13:DF:B9:D4:E6 destination_cluster15.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): EC:42:B8:B0:3B:25:70:EF:EF:15:DD:E6:AA:5C:81:DF:FD:A2:EB:6C source_clusternode03.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 35:E1:07:F0:ED:D5:42:51:48:CB:91:D3:4B:9B:B0:EF:97:99:87:4F source_clusternode06.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 16:8E:DF:71:76:C8:F0:D3:E3:DF:DA:B2:EC:D5:66:83:83:F0:7D:97 destination_cluster13.us.oracle.com, May 30, 2018, trustedCertEntry,  Certificate fingerprint (SHA1): 76:C4:8E:82:3C:16:2D:7E:C9:39:64:F4:FC:B8:24:40:CD:08:F8:A9 source_clusternode01.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): 26:89:C2:2B:E3:B8:8D:46:41:C6:C0:B6:52:D2:C4:B8:51:23:57:D2 source_clusternode04.us.oracle.com, Feb 1, 2019, trustedCertEntry,  Certificate fingerprint (SHA1): CB:98:23:1F:C0:65:7E:06:40:C4:0C:5E:C3:A9:78:F3:9D:E8:02:9E [root@destination_cluster13 ~]#  l. now do the same on the others node of destination cluster   Step 3.1.5 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Create replication user On the destination cluster you will need to configure replication peer. [root@destination_cluster15 ~]# dcli -C "useradd bdruser -u 2000" [root@destination_cluster15 ~]# dcli -C "groupadd supergroup -g 2000" [root@destination_cluster15 ~]# dcli -C "usermod -g supergroup bdruser" and after this verify that this user belongs to the supergroup: [root@destination_cluster15 ~]# hdfs groups bdruser bdruser : supergroup Step 3.1.6 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Create separate job for encrypted zones It's possible to copy data from encrypted zone, but there is the trick with it. If you will try to do this, you will find the error in the BDR logs: java.io.IOException: Checksum mismatch between hdfs://distcpSourceNS/tmp/EZ/parq.db/customer/000001_0 and hdfs://cdhs49-ns/tmp/EZ/parq.db/customer/.distcp.tmp.4101922333172283041 Fortunately, this problem could easily be solved. You just need to skip calculating checksums for Encrypted Zones: This is a good practice to create separate Job to copy data from encrypted zones and exclude directories with Encryption from general backup job. Example. You have some directory, which you want to exclude (/tmp/excltest/bad) from common copy job. For do this, you need go to "Advanced" settings and add "Path Exclusion": In my example you need to put .*\/tmp\/excltest\/bad+.* you may this regexp it by creating follow directory structure and add Path Exclusion. [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/ [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/good1 [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/good2 [root@source_clusternode05 ~]# hadoop fs -mkdir /tmp/excltest/bad Note: it maybe quite hard to create and validate regular expression (this is Java), for this purposes you may use this on-line resource. Step 3.1.7 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Enable Snapshots on the Source Cluster Replication without snapshots may fail. Distcp automatically created snapshot before coping. Some replications, especially those that require a long time to finish, can fail because source files are modified during the replication process. You can prevent such failures by using Snapshots in conjunction with Replication. This use of snapshots is automatic with CDH versions 5.0 and higher. To take advantage of this, you must enable the relevant directories for snapshots (also called making the directory snapshottable).  When the replication job runs, it checks to see whether the specified source directory is snapshottable. Before replicating any files, the replication job creates point-in-time snapshots of these directories and uses them as the source for file copies.  What happens when you copy data with out snapshots. test case: 1) Start coping (decent amount of data) 2) in the middle of the copy process, delete from the source files 3) get an error: ERROR distcp.DistCp: Job failed to copy 443 files/dirs. Please check Copy Status.csv file or Error Status.csv file for error messages INFO distcp.DistCp: Used diff: false WARN distcp.SnapshotMgr: No snapshottable directories have found. Reason: either run-as-user does not have permissions to get snapshottable directories or source path is not snapshottable. ERROR org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /bdax7_4/store_sales/000003_0 to overcome this do: 1) On the source go to CM -> HDFS -> File Browser, pick right directory and click on: 2) after this when you will run the job, it automatically will take a snapshot and will copy from it: 3) if you will delete data, your copy job will finish it from the snapshot that it took. Note, you can't delete entire directory, but you could delete all files from it: [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /bdax7_4 rm: The directory /bdax7_4 cannot be deleted since /bdax7_4 is snapshottable and already has snapshots [root@destination_cluster15 ~]# hadoop fs -rm -r -skipTrash /bdax7_4/* Deleted /bdax7_4/new1.file Deleted /bdax7_4/store_sales Deleted /bdax7_4/test.file [root@destination_cluster15 ~]# hadoop fs -ls /bdax7_4 [root@destination_cluster15 ~]# 4) copy is successfully completed! Step 3.1.8 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Rebalancing data HDFS tries to keep data evenly distributed across all nodes in a cluster. But after intensive write it maybe useful to run rebalance. Default rebalancing threshold is 10%, which is a bit high. It make sense to change "Rebalancing Threshold" from 10 to 2 (Cloudera Manager -> HDFS -> Instances -> Balancer -> configuration) also, in order to speed up rebalance speed, we could increase value of "dfs.datanode.balance.max.concurrent.moves" from 10 to 1000 (Number of block moves to permit in parallel). After make this changes, save it and run rebalancing: In case of heavily used clusters for reading/writing/deleting HDFS data we may hit disbalance within one node (when data unevenly distributed across multiple disks within one node). Here is the Cloudera Blog about it. Shortly, we have to go to "Cloudera Manager - > HDFS -> Configuration -> HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml" and add dfs.disk.balancer.enabled as a name and true as the value. Sometimes, you may have real data skew problem, which could easily be fixed by running rebalance: Note: If you want to visualize data distribution, you could check this tool developed at CERN. Step 3.1.9 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Hive There is also some option for coping Hive Metadata and as well as actual Data. Note: if you would like to copy some particular database schema you need to specify it in the copy wizard as well as specify regular expression which tables you want to copy ([\w].+ for all tables).  For example, this replication policy will copy all tables from database "parq". if you will leave it blank, you will not copy anything. More examples of regular expression you could find here. 3.1.10 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Sentry Sentry is the default authorization mechanism in Cloudera and you may want to replicate authorization rules on the second cluster. Unfortunately, there is no any mechanism embedded into BDR and you have to come with own solution.  Note: you have to configure Sentry in certain way to make it work with BDR. Please refer here to get more details. Step 3.1.11 HDFS Disaster Recovery. Big Data Disaster Recovery (BDR). Advantages and Disadvantages of BDR Generally speaking, I could to recommend use BDR as long as it meet your needs and requirements. Here is a brief summary of advantages and disadvantages of BDR. Advantages (+): - It's available out of the box, no need to install - It's free of charge - It's relative easy to configure and start working with basic examples Disadvantages (-): - It's not real-time tool. User have to schedule batch jobs, which will run every time period - There is no transparent way to fail over. In case of failing of primary side, uses have to manually switch over their applications into the new cluster - BDR (distcp under the cover) is MapReduce job, which takes significant resources. - Because of previous one and because of MR nature in case of coping one big file it will not be parallelized (will be copied in one thread) - Hive changes not fully replicated (drop tables have to be backup manually) - In case of replication big number of files (or Hive table with big number of partitions) it takes long time to finish. I can say that it's near to impossible to replicate directory if it has around 1 million objects (files/directories)  - It's only Cloudera to Cloudera or Cloudera to Object Store copy. No way copy to Hortonworks (but after merging this companies it's not a huge problem anymore) Step 3.2 HDFS Disaster Recovery. Wandisco If you met one of the challenges, that I've explained before, it make sense to take a look not the alternative solution called Wandisco.  Step 3.2.1 HDFS Disaster Recovery. Wandisco. Edge (proxy) node In case of Wandisco you will need to prepare some proxy nodes on the source and destination side. We do recommend to use one of the Big Data Appliance node and here you may refer to MOS note, which will guide you on how to make one of the node available for been proxy node: How to Remove Non-Critical Nodes from a BDA Cluster (Doc ID 2244458.1) Step 3.2.2 HDFS Disaster Recovery. Wandisco. Installation WANdisco Fusion is enterprise class software. It requires careful environment requirements gathering for the installation, especially with multi-homed networking as in Oracle BDA.  Once the environment is fully understood, care must be taken in completing the installation screens by following the documentation closely. Note: if you have some clusters which you want deal by BDR and they don't have Wandisco fusion software, you have to install Fusion client on it Step 3.2.3 HDFS Disaster Recovery. Wandisco. Architecture For my tests I've used two Big Data Appliances (Starter Rack - 6 nodes). Wandisco required to install their software on the edge nodes and I've converted Node06 into the Edge node for Wandisco purposes. Final architecture looks like this: Step 3.2.4 HDFS Disaster Recovery. Wandisco. Replication by example Here I'd like to show how you have to set up replication between two clusters. You need to install Wandisco fusion software on both clusters. As soon as you install fusion on the second (DR) cluster you need to do Induction (peering) with the first (Prod) cluster. As a result of installation, you have WebUI for Wandisco Fusion (it's recommended to be installed in on the Edge Node), you have to go there and setup replication rules.  go to replication bookmark and click on "create" button: after this specify path, which you would like to replicate and choose source of truth: after this click on "make consistent" button to kick-off replication: you could monitor list of the files and permissions which is not replicated yet: and you could monitor performance of the replication in real time: on the destination cluster you may see files, which not been replicated yet (metadata only) with prefix "_REPAIR_" [root@dstclusterNode01 ~]#  hadoop fs -ls /tmp/test_parq_fusion/store_returns/ 19/03/26 22:48:38 INFO client.FusionUriUtils: fs.fusion.check.underlyingFs: [true], URI: [hdfs://gcs-lab-bdax72orl-ns], useFusionForURI: [true] 19/03/26 22:48:38 INFO client.FusionCommonFactory: Initialized FusionHdfs with URI: hdfs://gcs-lab-bdax72orl-ns, FileSystem: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_142575995_1, ugi=hdfs/dstclusterNode01.us.oracle.com@US.ORACLE (auth:KERBEROS)]], instance: 1429351083, version: 2.12.4.3 Found 26 items drwxrwxrwx   - hdfs supergroup          0 2019-03-26 18:29 /tmp/test_parq_fusion/store_returns/.fusion -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000000_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000001_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000002_0._REPAIR_ -rw-r--r--   3 hdfs supergroup          0 2019-03-26 22:48 /tmp/test_parq_fusion/store_returns/000003_0._REPAIR_ if you put some file on one of the side, it will appear automatically on the other side (no action required) Step 3.2.5 HDFS Disaster Recovery. Wandisco. Hive The Fusion Plugin for Live Hive enables WANdisco Fusion to replicate Hive’s metastore, allowing WANdisco Fusion to maintain a replicated instance of Hive’s metadata and, in future, support Hive deployments that are distributed between data centers. The Fusion Plugin for Live Hive extends WANdisco Fusion by replicating Apache Hive metadata. With it, WANdisco Fusion maintains a Live Data environment including Hive content, so that applications can access, use, and modify a consistent view of data everywhere, spanning platforms and locations, even at petabyte scale. WANdisco Fusion ensures the availability and accessibility of critical data everywhere. Here you could find more details. Step 3.2.6 HDFS Disaster Recovery. Wandisco. Sentry Use the Fusion Plugin for Live Sentry to extend the WANdisco Fusion server with the ability to replicate policies among Apache Sentry Policy Provider instances. Coordinate activities that modify Sentry policy definitions among multiple instances of the Sentry Policy Provider across separate clusters to maintain common policy enforcement in each cluster. The Fusion Plugin for Live Sentry uses WANdisco Fusion for coordination and replication. Here you could find more details. Step 3.2.7 HDFS Disaster Recovery. Wandisco. Advantages and Disadvantages Talking about Wandisco disadvantages I have to say that it was very hard to install it. Wandisco folks promised that it will be enhanced in a future, but time will show. Advantages (+): - It's realtime. You just load data into the one cluster and another cluster immediately pick up the changes - it's Active-Active replication. You could load data in both clusters and data-sync will be done automatically - Sentry policies replication - Use less resources than BDR - Easy to manage replication policies by WebUI - Wandisco supports cross Hadoop distros replication - Wandisco is multiple endpoint (or multi-target) replication.  A replication rule isn't limited to just source and target (e.g. Prod, DR, Object Store) Disadvantages (-): - A common trade-off for additional features can often be additional complexity during installation.  This is the case with WANdisco Fusion - it costs extra money (BDR is free) - It requires special Hadoop client. As consequence if you want to replicate data with BDR on some remote clusters, you need to install Wandisco Fusion Hadoop client on it Step 3.3 HDFS Disaster Recovery. Conclusion I'll leave it to customer to decide which replication approach is better, I'd just say that it's good approach start with Big Data Disaster Recovery (because it's free and ready to use out of the box) and if customer will have some challenges with it try Wandisco software.  Step 4.1 HBase Disaster Recovery in this blog post I've focused on the HDFS and Hive data replication. If you want to replicate HBase on the remote cluster, all details on how to do this you could find here. Step 5.1 Kafka Disaster Recovery Kafka is another place where users may store the data and want to replicate it. Cloudera recommends to use Mirror Maker in order to do this. Step 6.1 Kudu Disaster Recovery There is another option available for customers to store their data - Kudu. As of today (03/01/2019) Kudu doesn't have solution to replicate data on the Disaster Recovery side. Step 7.1 Solr Disaster Recovery Solr or Cloudera search is another one engine to store data. You may get familiar with DR best practices, by reading this blog from Cloudera. 

Earlier I've written about Big Data High Availability in different aspects and I intentionally avoided the Disaster Recovery topic. High Availability answers on the question how system should process...

Autonomous

Querying external CSV, TSV and JSON files from your Autonomous Data Warehouse

I would like to provide here some practical examples and best practices of how to make use of the powerful data loading and querying features of the Autonomous Data Warehouse (ADW) in the Oracle Cloud. We will dive into the meaning of the more widely used parameters, which will help you and your teams derive business value out of your Data Warehouse in a jiffy! An extremely useful feature of this fully managed service is the ability to directly query data lying in your external object store, without incurring the time and cost of physically loading your data from an object store into your Data Warehouse instance. The DBMS_CLOUD.CREATE_EXTERNAL_TABLE package & procedure enables this behavior, creating a table structure over your external object store data, and allowing your ADW instance to directly run queries and analyses on it. A few pre-cursor requirements to get us running these analyses: Make sure you have a running ADW instance, a credentials wallet and a working connection to your instance. If you haven’t done this already follow Lab 1 in this tutorial. Use this link to download the data files for the following examples. You will need to unzip and upload these files to your Object Store. Once again, if you don’t know how to do this, follow Lab 3 Step 4 in this tutorial, which uploads files to a bucket in the Oracle Cloud Object Store, the most streamlined option. You may also use AWS or Azure object stores if required, you may refer to the documentation for more information on this. If you are using the Oracle Cloud Object Store as in Lab 3 above, you will need Swift URLs for the files lying in your object store. If you already created your object store bucket’s URL in the lab you may use that, else to create this use the URL below and replace the placeholders <region_name>, <tenancy_name> and <bucket_name> with your object store bucket’s region, tenancy and bucket names. The easiest way to find this information is to look at your object’s details in the object store, by opening the right-hand menu and clicking “Object details” (see screenshot below).   https://swiftobjectstorage.<region_name>.oraclecloud.com/v1/<tenancy_name>/<bucket_name> Note: In coming updates you will be able to use this object store URL directly in the DBMS_CLOUD API calls, instead of a SWIFT URL. Have the latest version of SQL Developer preferably (ADW requires v18.3 and above).   Comma Separated Value (CSV) Files   CSV files are one of the most common file formats out there. We will begin by using a plain and simple CSV format file for Charlotte’s (NC) Weather History dataset, which we will use as the data for our first external table. Open this Weather History ‘.csv’ file in a text editor to have a look at the data. Notice each field is separated by a comma, and each row ends by going to the next line. (ie. Which implies a newline ‘\n’ character). Also note that the first line is not data, but metadata (column names).   Let us now write a script to create an external table in our ADW over  this data file lying in our object store. We will specify all the column names, and the format of the file as CSV. The format parameter in the DBMS_CLOUD.CREATE_EXTERNAL_TABLE procedure takes a JSON object, which can be provided in two possible formats. format => '{"format_option" : “format_value” }' format => json_object('format_option' value 'format_value')) The second format option has been used in the script below. set define on define base_URL = <paste SWIFT URL created above here> BEGIN DBMS_CLOUD.CREATE_EXTERNAL_TABLE( table_name =>'WEATHER_REPORT_CSV', credential_name =>'OBJ_STORE_CRED', file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.csv', format => json_object('type' value 'csv', 'skipheaders' value '1',   'dateformat' value 'mm/dd/yy'), column_list => 'REPORT_DATE DATE, ACTUAL_MEAN_TEMP NUMBER, ACTUAL_MIN_TEMP NUMBER, ACTUAL_MAX_TEMP NUMBER, AVERAGE_MIN_TEMP NUMBER, AVERAGE_MAX_TEMP NUMBER, AVERAGE_PRECIPITATION NUMBER' ); END; / Let us breakdown and understand this script. We are invoking the “CREATE_EXTERNAL_TABLE” procedure in the DBMS_CLOUD API  package. We are then providing the table name we want in our Data Warehouse, our user credentials (we created this in the pre-requisites), the object store file list that contains our data, a format JSON object that describes the format of our file to the API, and a list of named columns for our destination table. The format parameter is a constructed JSON object with format options ‘type’ and ‘skipheaders’. The type specifies the file format as CSV, while skipheaders tells the API how many rows are metadata headers which should be skipped. In our file, that is 1 row of headers. The 'dateformat' parameter specifies the format of the date column in the file we are reading from; We will look at this parameter in more detailed examples below. Great! If this was successful, we have our first external table. Once you have created an external table, it’s a good idea to validate that this external table structure works with your underlying data, before directly querying the table and possibly hitting a runtime error. Validating the table creates logs of any errors in case your external table was created incorrectly, which helps debug and fix any issues. Use the rowcount option in VALIDATE_EXTERNAL_TABLE if your data is large, to limit the validation to the specified number of rows. BEGIN  DBMS_CLOUD.VALIDATE_EXTERNAL_TABLE (       table_name => 'WEATHER_REPORT_CSV'  ); END; / If you do see errors during validation, follow Lab 3 Step 12 to troubleshoot them with the help of the log file. If required, you can also drop this table like you would any other table with the “DROP TABLE” command. On running this validation without errors, you now have a working external table which sits on top of the data in your object store. You may now query and join the WEATHER_REPORT_CSV table as though it is any other table in your Data Warehouse! Let us find the days in our dataset during which it was pleasant in Charlotte. SELECT * FROM WEATHER_REPORT_CSV where actual_mean_temp > 69 and        actual_mean_temp < 74;   Tab Separated Value (TSV) Files   Another popular file format involves tab delimiters or TSV files. In the files you downloaded look for the Charlotte Weather History ‘.gz’ file. Unzip, open and have look at the ".tsv" file in it in a text editor as before. You will notice each row in this file is ended by a pipe ‘|’ character instead of a newline character, and the fields are separated by tabspaces. Oftentimes applications you might work with will output data in less intelligible formats such as this one, and so below is a slightly more advanced example of how to pass such data into DBMS_CLOUD. Let’s run the following script: BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object('removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                           'recorddelimiter' value '''|''',                           'skipheaders' value '1'),     column_list => 'REPORT_DATE DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER'  ); END; / SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Whoops! You just hit a runtime error. An important lesson here is that we ran a query directly, without validating the external table like in the previous example. Thus we ran into an error even though the “CREATE_EXTERNAL_TABLE” went through without errors. This is because the “CREATE_EXTERNAL_TABLE” procedure simply creates a table structure (or metadata) over the data, but will not actually check to see whether the data itself is valid; That occurs at validation or runtime. Without validation, our only option would be to visually decipher the problem with the code. Here’s the real working script this time: DROP TABLE WEATHER_REPORT_TSV; BEGIN  DBMS_CLOUD.CREATE_EXTERNAL_TABLE (     table_name =>'WEATHER_REPORT_TSV',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'&base_URL/Charlotte_NC_Weather_History.gz',     format => json_object('ignoremissingcolumns' value 'true',                           'removequotes' value 'true',                           'dateformat' value 'mm/dd/yy',                           'delimiter' value '\t',                           'recorddelimiter' value '''|''',                           'skipheaders' value '1',                           'rejectlimit' value '1',                           'compression' value 'gzip'),     column_list => 'REPORT_DATE DATE,     ACTUAL_MEAN_TEMP NUMBER,     ACTUAL_MIN_TEMP NUMBER,     ACTUAL_MAX_TEMP NUMBER,     AVERAGE_MIN_TEMP NUMBER,     AVERAGE_MAX_TEMP NUMBER,     AVERAGE_PRECIPITATION NUMBER' ); END; / SELECT * FROM WEATHER_REPORT_TSV where actual_mean_temp > 69 and        actual_mean_temp < 74; Let us understand the new parameters here, and why our previous script failed: 'ignoremissingcolumns' value 'true': Notice there is no data for the last column “AVERAGE_PRECIPITATION”. This parameter allows the create external table script to skip over columns from the column list, that have no data in the data file. 'removequotes' value 'true': The first column ‘date’ has data surrounded by double quotes. For this data to be converted to an Oracle date type, these quotes need to be removed. Note that when using the type parameter for CSV files as we did in the first example, this removequotes option is true by default. 'dateformat' value 'mm/dd/yy': If we expect a date column to be converted and stored into an Oracle date column (after removing the double quotes of course), we should provide the date column’s format. If we don’t provide a format, the date column will look for the database's default date format. You can see the dateformat documentation here. 'delimiter' value '\t': Fields in this file are tab delimited, so the delimiter we specify is the special character. 'recorddelimiter' value '''|''': Each record or row in our file is delimited by a pipe ‘|’ symbol, and so we specify this parameter which separates out each row. Note that unlike the delimiter parameter, the recorddelimiter must be enclosed in single quotes as shown here. A nuance here is that the last row in your dataset doesn’t need the record delimiter when it is the default newline character, however it does for other character record delimiters to indicate the end of that row. Also note that since ADW is LINUX/UNIX based, source data files with newline as record delimiters, that have been created on Windows, must use “\r\n” as the format option. Both these nuances will likely have updated functionality in future releases. 'rejectlimit' value '1': We need this parameter here to fix an interesting problem. Unlike with the newline character, if we don’t specify a pipe record delimiter here at the very end of the file, we get an error because the API doesn’t recognize where the last row’s, last column ends. If we do specify the pipe record delimiter however, the API expects a new line because the record has been delimited, and we get a null error for the last non-existent row. To fix situations like this, where we know we might have one or more problem rows, we use the reject limit parameter to allow some number of rows to be rejected. If we use ‘unlimited’ as our reject limit, then any number of rows may be rejected. The default reject limit is 0. 'compression' value 'gzip': Notice the .tsv file is zipped into a gzip “.gz” file, which we have used in the URL. We use this parameter so the file will be unzipped appropriately before the table is created. As before, once this is successful, the external table structure has been created on top of the data in the object store. It may be validated using the VALIDATE_EXTERNAL_TABLE procedure. In the script above we have already queried it as a table in your Data Warehouse.   Field Lists - For more Granular parsing options:   A more advanced feature of the DBMS_CLOUD.CREATE_EXTERNAL_TABLE is the Field_List parameter, which borrows it’s feature set from the Field_List parameter of the Oracle Loader access driver. This parameter allows you to specify more granular information about the fields being loaded. For example, let’s use “Charlotte_NC_Weather_History_Double_Dates.csv” from the list of files in our object store. This file is similar to our first CSV example, except it has a copy of the date column in a different date format. Now, if we were to specify a date format in the format parameter, it would apply to universally to all date columns. With the field_list parameter, we can specify two different date formats for the two date columns. We do need to list all the columns and their types when including the field_list; Not mentioning any type parameters simply uses default Varchar2 values. BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (    table_name =>'WEATHER_REPORT_DOUBLE_DATE',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/Charlotte_NC_Weather_History_Double_Dates.csv',    format => json_object('type' value 'csv',  'skipheaders' value '1'),    field_list => 'REPORT_DATE DATE ''mm/dd/yy'',                   REPORT_DATE_COPY DATE ''yyyy-mm-dd'',                   ACTUAL_MEAN_TEMP,                   ACTUAL_MIN_TEMP,                   ACTUAL_MAX_TEMP,                   AVERAGE_MIN_TEMP,                   AVERAGE_MAX_TEMP,                   AVERAGE_PRECIPITATION',    column_list => 'REPORT_DATE DATE,                    REPORT_DATE_COPY DATE,                    ACTUAL_MEAN_TEMP NUMBER,                    ACTUAL_MIN_TEMP NUMBER,                    ACTUAL_MAX_TEMP NUMBER,                    AVERAGE_MIN_TEMP NUMBER,                    AVERAGE_MAX_TEMP NUMBER,                    AVERAGE_PRECIPITATION NUMBER'  ); END; /   SELECT * FROM WEATHER_REPORT_DOUBLE_DATE where actual_mean_temp > 69 and actual_mean_temp < 74; It's important to recognize that the date format parameters are to provide the API with the information to read the data file. The output format from your query will be your Database default (based on your NLS Parameters). This can also be formatted in your query using TO_CHAR.   JSON Files   You may be familiar with JSON files for unstructured and semi-structured data. The "PurchaseOrders.txt" file contains JSON Purhcase Order data, which when parsed and formatted looks like the following. Using JSON data in an ADW instance can be as simple as putting each JSON document into a table row as a BLOB, and using the powerful, native JSON features that the Oracle Database provides to parse and query it. You can also view the JSON documentation for additional features here.  Let’s try this! Copy and run the script below: BEGIN   DBMS_CLOUD.CREATE_EXTERNAL_TABLE (    table_name =>'JSON_FILE_CONTENTS',    credential_name =>'OBJ_STORE_CRED',    file_uri_list =>'&base_URL/PurchaseOrders.txt',    column_list => 'json_document blob',    field_list => 'json_document CHAR(5000)' ); END; /   COLUMN Requestor FORMAT A30 COLUMN ShippingInstructions FORMAT A30 SELECT JSON_VALUE(json_document,'$.Requestor') as Requestor,        JSON_VALUE(json_document,'$.ShippingInstructions.Address.city') as ShippingInstructions FROM JSON_FILE_CONTENTS where rownum < 50; The query above lists all the PO requestors and the city where their shipment is to be delivered. Here, we have simply created one column ‘json_document’ in the external table ‘JSON_FILE_CONTENTS’. We do not incur the time it takes to validate these JSON document, and are instead directly querying the external table using the Database’s JSON_VALUE feature. This means the check for well-formed JSON data will be on the fly, which would fail unless you properly skip over the failed data. Here, ‘VALIDATE_EXTERNAL_TABLE’ will not check for valid JSON data, but will simply check that the data is of the correct native datatype (less than 5000 characters long), that is the datatype of the table’s column. For better performance on large JSON data files, using this external table we can also make use of the Database’s JSON features to parse and insert the JSON data into a new table ‘j_purchaseorder’ ahead of time, as below. Note that this insert statement actually brings the data into your ADW instance. You benefit from doing this as it checks to make sure your JSON data is well-formed and valid ahead of time, and therefore incur less of a performance impact when you query this JSON data from your ADW instance. CREATE TABLE j_purchaseorder  (id          VARCHAR2 (32) NOT NULL,   date_loaded TIMESTAMP (6) WITH TIME ZONE,   po_document BLOB   CONSTRAINT ensure_json CHECK (po_document IS JSON));   INSERT INTO j_purchaseorder (id, date_loaded, po_document) SELECT SYS_GUID(), SYSTIMESTAMP, json_document FROM json_file_contents    WHERE json_document IS JSON; We can now query down JSON paths using the JSON simplified syntax as with the following query:   SELECT po.po_document.Requestor,          po.po_document.ShippingInstructions.Address.city                  FROM j_purchaseorder po;   Copying Data into your Autonomous Data Warehouse Here, we've gone through examples to access your object store data via external tables in your ADW. In following posts, I will walk you through examples on how to use the DBMS_CLOUD.COPY_DATA API to copy that data from your files directly into your Data Warehouse, as well as how to diagnose issues while loading your ADW with data using the bad and log files. See you in the next one!

I would like to provide here some practical examples and best practices of how to make use of the powerful data loading and querying features of the Autonomous Data Warehouse (ADW) in the Oracle...

Autonomous

Oracle Autonomous Data Warehouse - Access Parquet Files in Object Stores

Parquet is a file format that is commonly used by the Hadoop ecosystem.  Unlike CSV, which may be easy to generate but not necessarily efficient to process, parquet is really a “database” file type.  Data is stored in compressed, columnar format and has been designed for efficient data access.  It provides predicate pushdown (i.e. extract data based on a filter expression), column pruning and other optimizations. Autonomous Database now supports querying and loading data from parquet files stored in object stores and takes advantage of these query optimizations.  Let’s take a look at how to create a table over a parquet source and then show an example of a data access optimization – column pruning. We’ll start with a parquet file that was generated from the ADW sample data used for tutorials (download here).  This file was created using Hive on Oracle Big Data Cloud Service.  To make it a little more interesting, a few other fields from the customer file were added (denormalizing data is fairly common with Hadoop and parquet).   Review the Parquet File A CSV file can be read by any tool (including the human eye ) – whereas you need a little help with parquet.  To see the structure of the file, you can use a tool to parse its contents.  Here, we’ll use parquet-tools (I installed it on a Mac using brew – but it can also be installed from github): $ parquet-tools schema sales_extended.parquet message hive_schema {   optional int32 prod_id;   optional int32 cust_id;   optional binary time_id (UTF8);   optional int32 channel_id;   optional int32 promo_id;   optional int32 quantity_sold;   optional fixed_len_byte_array(5) amount_sold (DECIMAL(10,2));   optional binary gender (UTF8);   optional binary city (UTF8);   optional binary state_province (UTF8);   optional binary income_level (UTF8); } You  can see the parquet file’s columns and data types, including prod_id, cust_id, income_level and more.  To view the actual contents of the file, we’ll use another option to the parquet-tools utility: $ parquet-tools head sales_extended.parquet prod_id = 13 cust_id = 987 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Adelaide state_province = South Australia income_level = K: 250,000 - 299,999 prod_id = 13 cust_id = 1660 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Dolores state_province = CO income_level = L: 300,000 and above prod_id = 13 cust_id = 1762 time_id = 1998-01-10 channel_id = 3 promo_id = 999 quantity_sold = 1 amount_sold = 1232.16 gender = M city = Cayuga state_province = ND income_level = F: 110,000 - 129,999 The output is truncated – but you can get a sense for the data contained in the file. Create an ADW Table We want to make this data available to our data warehouse.  ADW makes it really easy to access parquet data stored in object stores using external tables.   You don’t need to know the structure of the data (ADW will figure that out by examining the file) – only the location of the data and an auth token that provides access to the source.  In this example, the data is stored in an Oracle Cloud Infrastructure Object Store bucket called “tutorial_load_adw”: Using the DBMS_CLOUD package, we will first create a credential using an auth token that has access to the data:   begin   DBMS_CLOUD.create_credential (     credential_name => 'OBJ_STORE_CRED',     username => user@oracle.com',     password => 'the-password'   ) ; end; / Next, create the external table.  Notice, you don’t need to know anything about the structure of the data.  Simply point to the file, and ADW will examine its properties and automatically derive the schema: begin     dbms_cloud.create_external_table (     table_name =>'sales_extended_ext',     credential_name =>'OBJ_STORE_CRED',     file_uri_list =>'https://swiftobjectstorage.<datacenter>.oraclecloud.com/v1/<obj-store-namespace>/<bucket>/sales_extended.parquet',     format =>  '{"type":"parquet",  "schema": "first"}'     ); end; / A couple of things to be aware of.  First, the URI for the file needs to follow a specific format – and this is well documented here (https://docs.oracle.com/en/cloud/paas/autonomous-data-warehouse-cloud/user/dbmscloud-reference.html#GUID-5D3E1614-ADF2-4DB5-B2B2-D5613F10E4FA ).  Here, we’re pointing to a particular file.  But, you can also use wildcards (“*” and “?”) or simply list the files using comma separated values. Second, notice the format parameter.  Specify the type of file is “parquet”.  Then, you can instruct ADW how to derive the schema (columns and their data types):  1) analyze the schema of the first parquet file that ADW finds in the file_uri_list or 2) analyze all the schemas for all the parquet files found in the file_uri_list.  Because these are simply files captured in an object store – there is no guarantee that each file’s metadata is exactly the same.  “File1” may contain a field called “address” – while “File2” may be missing that field.  Examining each file to derive the columns is a bit more expensive (but it is only run one time) – but may be required if the first file does not contain all the required fields. The data is now available for query: desc sales_extended_ext; Name           Null? Type            -------------- ----- --------------  PROD_ID              NUMBER(10)      CUST_ID              NUMBER(10)      TIME_ID              VARCHAR2(4000)  CHANNEL_ID           NUMBER(10)      PROMO_ID             NUMBER(10)      QUANTITY_SOLD        NUMBER(10)      AMOUNT_SOLD          NUMBER(10,2)    GENDER               VARCHAR2(4000)  CITY                 VARCHAR2(4000)  STATE_PROVINCE       VARCHAR2(4000)  INCOME_LEVEL         VARCHAR2(4000) select prod_id, quantity_sold, gender, city, income_level from sales_extended_ext where rownum < 10; Query Optimizations with Parquet Files As mentioned at the beginning of this post, parquet files support column pruning and predicate pushdown.  This can drastically reduce the amount of data that is scanned and returned by a query and improve query performance.  Let’s take a look at an example of column pruning.  This file has 11 columns – but imagine there were 911 columns instead and you were interested in querying only one.  Instead of scanning and returning all 911 columns in the file – column pruning will only process the single column that was selected by the query. Here, we’ll query similar data – one file is delimited text while the other is parquet (interestingly, the parquet file is a superset of the text file – yet is one-fourth the size due to compression).  We will vary the number of columns used for each query:  Query a single parquet column Query all parquet columns Query a single text column Query all the text columns The above table was captured from the ADW Monitored SQL Activity page.  Notice that the I/O bytes for text remains unchanged – regardless of the number of columns processed.  The parquet queries on the other hand process the columnar source efficiently – only retrieving the columns that were requested by the query.  As a result, the parquet query eliminated nearly 80% of the data stored in the file.  Predicate pushdown can have similar results with large data sets – filtering the data returned by the query. We know that people will want to query this data frequently and will require optimized access.  After examining the data, we now know it looks good and will load it into a table using another DBMS_CLOUD procedure – COPY_DATA.  First, create the table and load it from the source: CREATE TABLE SALES_EXTENDED ( PROD_ID NUMBER, CUST_ID NUMBER, TIME_ID VARCHAR2(30), CHANNEL_ID NUMBER, PROMO_ID NUMBER, QUANTITY_SOLD NUMBER(10,0), AMOUNT_SOLD NUMBER(10,2), GENDER VARCHAR2(1), CITY VARCHAR2(30), STATE_PROVINCE VARCHAR2(40), INCOME_LEVEL VARCHAR2(30) ); -- Load data begin dbms_cloud.copy_data( table_name => SALES_EXTENDED', credential_name =>'OBJ_STORE_CRED', file_uri_list =>'https://swiftobjectstorage.<datacenter>.oraclecloud.com/v1/<obj-store-namespace>/<bucket>/sales_extended.parquet', format => '{"type":"parquet", "schema": "first"}' ); end; / The data has now been loaded.  There is no mapping between source and target columns required; the procedure will do a column name match.  If a match is not found, the column will be ignored. That’s it!  ADW can now access the data directly from the object store – providing people the ability to access data as soon as it lands – and then for optimized access load it into the database.  

Parquet is a file format that is commonly used by the Hadoop ecosystem.  Unlike CSV, which may be easy to generate but not necessarily efficient to process, parquet is really a “database” file type. ...

Hadoop Best Practices

Big Data Resource management Looks Hard, But it isn't

Hadoop is an ecosystem consisting of multiple different components in which each component (or engine) consumes certain resources. There are a few resource management techniques that allow administrators to define how finite resources are divided across multiple engines. In this post, I'm going to talk about these different techniques in detail. Talking about resource management, I'll divide the topic in these sub-topics: 1. Dealing with low latency engines (realtime/near-realtime) 2. Division of resources across multiple engines within a single cluster 3. Division of resources between different users within a single technology 1. Low latency engines These engines assume low latency response times. Examples are NoSQL databases (HBase, MongoDB, Cassandra...) or message based systems like Kafka. These systems should be placed on a dedicated cluster if you have real and specific low latency Service Level Agreements (SLAs). Yes, for highly utilized HBase or highly utilized Kafka with the strict SLA we do recommend to put it on a dedicated cluster, otherwise the SLAs can't be met. 2. Division of resources across multiple engines in a single cluster It's quite common to put multiple processing engines (such as YARN, Impala, etc.) in a single cluster. As soon as this happens, administrators face the challenge of how to divide resources among these engines. The short answer for Cloudera clusters is "Static Service Pools". In Cloudera Manager you can find the "Static Service Pool" configuration option here: You use this functionality to divide resources between different processing engines, such as: YARN Impala Big Data SQL Etc. When applied, under the hood, these engines are using Linux cgroup to partition resources across these engines.  I've already explained how to setup Static Service Pools in context of Big Data SQL, please review this for more details on configuring Static Resource Pools. 3. Division of resources between different users within a single technology 3.1 Resource division within YARN. Dynamic service pool To work with resource allocation in Yarn, Cloudera Manager offers "Dynamic Service Pools". Its purpose is to divide resources within different user groups inside YARN (a single engine): Because you work on Yarn, many different engines are impacted - e.g. those engines that run inside the Yarn framework. For example Spark, Hive (MapReduce) etc. The following steps are planned to be automated for Big Data Appliance and Big Data Cloud Service in an upcoming but if you want to apply this beforehand or on your own Cloudera clusters, here are the high level steps for this: a) Enable Fair Scheduler Preemption b) Configure example pools: for Low Priority, Medium Priority and High Priority jobs c) Setup placement rules The following sections dive deeper into the details. 3.1.1. Enable Fair Scheduler Preemption To enable fair scheduler preemption go to Cloudera Manager -> YARN -> Configuration. Then set: - yarn.scheduler.fair.preemption=true - yarn.scheduler.fair.preemption.cluster-utilization-threshold=0.7 The utilization threshold after which preemption kicks in. The utilization is computed as the maximum ratio of usage to capacity among all resources. Next we must configure: - yarn.scheduler.fair.allow-undeclared-pools = false When set to true, pools specified in applications but not explicitly configured, are created at runtime with default settings. When set to false, applications specifying pools not explicitly configured run in a pool named default. This setting applies when an application explicitly specifies a pool and when the application runs in a pool named with the username associated with the application - yarn.scheduler.fair.user-as-default-queue = false When set to true, the Fair Scheduler uses the username as the default pool name, in the event that a pool name is not specified. When set to false, all applications are run in a shared pool, called default. Note: parameter "Enable ResourceManager ACLs" should be set to true by default, but its worth checking it, just in case. "yarn.admin.acl" shouldn't be equal '*'. set it equal to "yarn" After modifying these configuration settings you will need to restart the YARN cluster to activate these settings in your cluster. Next step is the configuration of the Dynamic Service Pools. 3.1.2. Configure example pools Go to Cloudera Manager -> Dynamic Resource Pool Configuration. Here we recommend (in future BDA/BDCS versions we will create these by default for you) to create three pools: low medium high We also recommend that you remove the root.default pool as shown below: Different pools will use different resources (CPU and Memory). To illustrate this I'll run 3 jobs and put these into the different pools: high, medium, low: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.medium 1000 1000000000000 p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1000 1000000000000 after running for a while, navigate to Cloudera Manager -> YARN -> Resource Pool and take a look at "Fair Share VCores" (or memory): In this diagram we can see that vCores are allocated according to our configured proportion: 220/147/73 roughly the same as 15/10/5. Second important configuration is limit of maximum pool usage: we recommend to put a cap on the resource pool so small applications can jump into the cluster even if a long-running job has been launched. Here is a small test case: - Run Long running job in root.low pool (which takes days to be done): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 Check resource usage: This graph shows that we have some unused CPU, as we wanted. Also, notice below that we have some pending containers which shows that our application wants to run more jobs, but as expected YARN disallows to do this: So, despite that we have spare resource in our cluster, YARN disallows to use it because of the capping maximum resource usage for certain pools. - Now run some small job, which belong to the different pool: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.medium; hive> select count(1) from date_dim; ... Now, jump to the resource usage page (Cloudera Manager -> YARN -> Resource Pools): Here we can see that the number of pending containers for the first job hasn't changed. This is the reserve we allocated for newly running small jobs. This enables short start up times for small jobs so, no preemption needed and end users will not feel like their jobs hangs. Third key things of Resource Management configuration is Preemption configuration. We recommend to configure different preemption levels for each different pool (double check you've enabled preemption earlier at the cluster level). There are two configuration settings to change: Fair Share Preemption Threshold - This is a value between 0 and 1. If set to x and the fair share of the resource pool is F, we start preempting resources from other resource pools if the allocation is under (x * F). In other words, it defines in lack of which portion of the resources we start to do preemption. Fair Share Preemption Timeout - The number of seconds a resource pool is under its fair share before it will try to preempt containers to take resources from other resource pools. In other words, this setting defines when YARN will Start to preempt.  To configure, go to Cloudera Manager -> Dynamic Service Pools -> Edit for certain pool -> Preemption. We suggest the following settings: For high. Immediately start preemption if job didn't get all requested resources, which is implemented as below: For medium. Wait 60 seconds before starting preemption if the job didn't get at least 80% of the requested resources: And for low. Wait a 180 seconds before starting preemption if a job didn't get at least 50% of its requested resources: These parameters define how aggressively containers will be preempted (how quickly job will get required resources). Here is my test case - I've run some long running job in root.low pool and run some job in parallel, assign it to low, medium and high pool respectively. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.low; hive> select count(1) from store_sales; ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.medium; hive> select count(1) from store_sales; ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} hive> set mapred.job.queue.name=root.high; hive> select count(1) from store_sales; ... as a measure of the result we could consider elapsed time (which will consist of waiting time, which will be different, according our configuration, plus elapsed time which also will be different because of resource usage). This table shows the result: Pool name Elapsed time, min Low 14.5 Medium 8.8 High 8 In the graph below you can see how preemption was accomplished: There is another aspect of preemption, which is whether a pool can be preempted or not. To set this up, go to: Cloudera Manager -> Dynamic Resource Pool -> root.high pool, there click on "Edit": After this click on "Preemption" and disable preemption from root.high pool. That will mean that nobody can preempt tasks from this pool: Note: this setting makes root.high pool incredibly strong and you may have to consider enabling preemption again. 3.1.3. Setup Placement rules Another key component of Dynamic Resource management is Placement rules. Placement rules define which pool a job will be assigned to. By default, we suggest keeping it simple To configure, go to Cloudera Manager -> Dynamic Service Pools -> Placement Rules: With this configuration your user may belong to one of the secondary groups, which we named low/medium/high. If not, you can define the pool assigned for the job at runtime. If you don't do that, by default the job will be allocated resources in the low pool. So, if administrator knows what to do she will put user in certain pool, if users know what to do, they will specify certain pool (low, medium or high). We recommend administrators defining this for the system. For example, I do have user alex, who belongs to secondary group "medium": p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# id alex uid=1002(alex) gid=1006(alex) groups=1006(alex),1009(medium) so, if I'll try to specify a different group (consider it as user wants to cheat the system settings) at runtime it will not overwrite medium group: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# sudo -u alex hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1000 1000000000000 3.1.4. How to specify resource pool at runtime? While there are a few engines, let's focus on MapReduce (Hive) and Spark. Earlier, in this blog I've showed how to specify a pool for MapReduce job, with mapred.job.queue.name parameter. You can specify it with the -D parameter when you launch the job from the console: [root@bdax72bur09node01]# hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.low 1000 1000000000000 Or in case of hive you can set it up as a parameter: hive> set mapred.job.queue.name=root.low;                                             Another engine is Spark, and here you can simply add the "queue" parameter: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01]# spark-submit --class org.apache.spark.examples.SparkPi --queue root.high spark-examples.jar in Spark2-Shell console you need to specify the same parameter: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 lib]# spark2-shell --queue root.high 3.1.5. Group Mapping The first thing that Resource Manager is checking is user secondary group. How do you define this? I've posted it earlier in my security Blog Post, but in the nutshell it is defined either with LDAP mapping or UnixShell and defined under "hadoop.security.group.mapping" parameter in Cloudera Manager (HDFS -> Configuration): Below is a list of useful commands, which could be used for managing user and groups on BDA/BDCS/BDCC (note all users and groups have to exist on each cluster node and have same id): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} // Add new user # dcli -C "useradd  -u 1102 user1" // Add new group # dcli -C "groupadd -g 1111 high" // Add user to the group # dcli -C "usermod -a -G high user1" // Remove user from the group # dcli -C "gpasswd -d user1 high" // Delete user # dcli -C "userdel user1" // Delete group # dcli -C "groupdel high" Oracle and Cloudera recommend to use "org.apache.hadoop.security.ShellBasedUnixGroupsMapping" plus SSSD to replicate users from Active Directory. From Cloudera's documentation: "The LdapGroupsMapping library may not be as robust a solution needed for large organizations in terms of scalability and manageability, especially for organizations managing identity across multiple systems and not exclusively for Hadoop clusters. Support for the LdapGroupsMapping library is not consistent across all operating systems." 3.1.6. Control user access to certain pool (ACL) You may want to restrict the group of users that has access to certain pools (especially to high and medium). You accomplish this with ACLs in Dynamic Resource Pools. First, just an a reminder, that you have to set up "yarn.admin.acl" to something different than '*'. Set it equal to "yarn". Before setting restrictions for certain pools, you need to set up a restriction for the root pool: due Cloudera Manager bug, you can't leave ACL for submission and admin empty (otherwise everyone will allow to submit jobs in any pool), so set it up equal to desired groups or as workaround set it up to ",". After this you are ready to create rules for the other pools. Let's start from low. Here as soon as it plays role of default pool in our config, we should allow everyone to submit jobs there: next, let's move to medium. There we will config access for users, who belongs to groups medium and etl: and finally, we are ready to config high pool. Here we will allow to submit jobs for users who belong to groups managers or high: let's do a quick test. let's take some user, who is not belongs to privileged groups: medium, high, etl, managers. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# hdfs groups user2 user2 : user2 and run some jobs after: [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi 1 10 ... 18/11/19 21:22:31 INFO mapreduce.Job: Job job_1542663460911_0026 running in uber mode : false 18/11/19 21:22:31 INFO mapreduce.Job:  map 0% reduce 0% 18/11/19 21:22:36 INFO mapreduce.Job:  map 100% reduce 0% after this check that this user was place in root.low queue (everyone is allowed to run jobs there): so far so good. Now, let's try to submit job from the same user to high priority pool: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1 10 ... 18/11/19 21:27:33 WARN security.UserGroupInformation: PriviledgedActionException as:user2 (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1542663460911_0027 to YARN : User user2 cannot submit applications to queue root.high to put this job in pool root.high, we need to add this user to any group, which is listed in ACL for pool root.high, let's use managers (create it first): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# dcli -C "groupadd -g 1112 managers" [root@bdax72bur09node01 hadoop-mapreduce]# dcli -C "usermod -a -G managers user2" second try and validation: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# hdfs groups user2 user2 : user2 managers [root@bdax72bur09node01 hadoop-mapreduce]# sudo -u user2 hadoop jar hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=root.high 1 10 ... 18/11/19 21:34:00 INFO mapreduce.Job: Job job_1542663460911_0029 running in uber mode : false 18/11/19 21:34:00 INFO mapreduce.Job:  map 0% reduce 0% 18/11/19 21:34:05 INFO mapreduce.Job:  map 100% reduce 0% wonderful! everything works as expected. Once again, here is important to set up root pool to some values. If you don't want put the list of available groups in the root pool and want to put it later or you may want to have a pool, like root.low where everyone could submit their jobs, simply use workaround with "," character. 3.1.7. Analyzing of resource usage The tricky thing with the pools, that sometimes many people or divisions use the same pool and it's hard to define who get which portion.  p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 hadoop-mapreduce]# cat getresource_usage.sh #!/bin/bash   STARTDATE=`date -d " -1 day " +%s%N | cut -b1-13` ENDDATE=`date +%s%N | cut -b1-13` result=`curl -s http://bdax72bur09node04:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` if [[ $result =~ "standby RM" ]]; then result=`curl -s http://bdax72bur09node05:8088/ws/v1/cluster/apps?finishedTimeBegin=$STARTDATE&finishedTimeEnd=$ENDDATE` fi #echo $result echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "user|coreSeconds" | awk ' /user/ { user = $2 } /vcoreSeconds/ { arr[user]+=$2 ; } END { for (x in arr) {print "yarn." x ".cpums="arr[x]} } '   echo $result | python -m json.tool | sed 's/["|,]//g' | grep -E "user|memorySeconds" | awk ' /user/ { user = $2 } /memorySeconds/ { arr1[user]+=$2 ; } END { for (y in arr1) {print "yarn." y ".memorySeconds="arr1[y]} } '   3.2.1. Impala Admission Control Another popular engine in the Hadoop world is Impala. Impala has own mechanism to control resources - called admission control. Many MPP systems recommend queueing queries in case of high concurrency instead of running it in parallel. Impala is not an exception. For config this, go to Cloudera Manager -> Dynamic Resource Pools -> Impala Admission Control: Admission Control has few key parameters for configuring queue: Max Running Queries - Maximum number of concurrently running queries in this pool Max Queued Queries - Maximum number of queries that can be queued in this pool Queue Timeout - The maximum time a query can be queued waiting for execution in this pool before timing out so, you will be able to place up to Max Running Queries queries, after this rest Max Queued Queries queries will be queued. They will stay in the queue for Queue Timeout, after this, it will be canceled. Example. I've config Max Running Queries =3 (allow to run 3 simultaneous SQLs), Max Queued Queries =2, which allows running two simultaneous queries and Queue Timeout is default 60 seconds.  After this, I've run 6 queries and wait for a minute. Three queries were successfully executed, three other queries failed for different reasons: One query been rejected right away, because it were not place for it in a queue (3 running, 2 queued next rejected). Two others were in queue for a 60 seconds, but as soon as rest queries were not executed within this timeout, they were failed as well.

Hadoop is an ecosystem consisting of multiple different components in which each component (or engine) consumes certain resources. There are a few resource management techniques that allow...

Hadoop Best Practices

Use Big Data Appliance and Big Data Cloud Service High Availability, or You'll Blame Yourself Later

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features are always available out of the box on your systems, and that no extra steps are required from your end. One of the key value-adds of leveraging a hardened system from Oracle. A special shout-out to Sandra and Ravi from our team, for helping with this blog post. For this post on HA, we'll subdivide the content into the following topics: High Availability in the Hardware Components of the system High Availability within a single node Hadoop Components High Availability 1. High Availability in Hardware Components When we are talking about an on-premise solution, it is important to understand the fault tolerance and HA built into the actual hardware you have on the floor. Based on Oracle Exadata and the experience we have in managing mission critical systems, a BDA  is built out of components to handle hardware faults and simply stay up and running. Networking is redundant, power supplies in the racks are redundant, ILOM software tracks the health of the system and ASR pro-actively logs SRs if needed on hardware issues. You can find a lot more information here. 2. High availability within a single node Talking about high availability within a single node, I'd like to focus on disk failures. In large clusters, disk failures do occur but should - in general - nor cause any issues for BDA and BDCS customers. First let's have a look at the disk representation (minus data directories) for the Oracle system: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# df -h|grep -v "/u" Filesystem      Size  Used Avail Use% Mounted on devtmpfs        126G     0  126G   0% /dev tmpfs           126G  8.0K  126G   1% /dev/shm tmpfs           126G   67M  126G   1% /run tmpfs           126G     0  126G   0% /sys/fs/cgroup /dev/md2        961G   39G  874G   5% / /dev/md6        120G  717M  113G   1% /ssddisk /dev/md0        454M  222M  205M  53% /boot /dev/sda1       191M   16M  176M   9% /boot/efi /dev/sdb1       191M     0  191M   0% /boot/rescue-efi cm_processes    126G  309M  126G   1% /run/cloudera-scm-agent/process Next, let's take a look where critical services store their data. - Name Node. Aparently most critical HDFS component. It stores FSimage file and edits on the hard disks, let's check where: [root@bdax72bur09node02 ~]# df -h /opt/hadoop/dfs/nn Filesystem      Size  Used Avail Use% Mounted on /dev/md2        961G   39G  874G   5% / - Journal Node: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} span.s2 {font-variant-ligatures: no-common-ligatures; color: #1effff} span.s3 {font-variant-ligatures: no-common-ligatures; color: #4c7aff} [root@bdax72bur09node02 ~]# df /opt/hadoop/dfs/jn Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk [root@bdax72bur09node02 ~]# ls -l /opt/hadoop/dfs/jn lrwxrwxrwx 1 root root 15 Jul 15 22:58 /opt/hadoop/dfs/jn -> /ssddisk/dfs/jn - Zookeeper: [root@bdax72bur09node02 ~]# df /var/lib/zookeeper Filesystem     1K-blocks   Used Available Use% Mounted on /dev/md6       124800444 733688 117704132   1% /ssddisk all these services store their data on RAIDs /dev/md2 and /dev/md6. Let's take a look on what it consist of: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md2 /dev/md2: ...      Array Size : 1023867904 (976.44 GiB 1048.44 GB)      Used Dev Size : 1023867904 (976.44 GiB 1048.44 GB)       Raid Devices : 2      Total Devices : 2    ...     Active Devices : 2 ...     Number   Major   Minor   RaidDevice State        0       8        3        0      active sync   /dev/sda3        1       8       19        1      active sync   /dev/sdb3 so, md2 is one terabyte mirror RAID. We are save if one of the disks will fail. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node02 ~]# mdadm --detail /dev/md6 /dev/md6: ...      Array Size : 126924800 (121.04 GiB 129.97 GB)      Used Dev Size : 126924800 (121.04 GiB 129.97 GB)       Raid Devices : 2      Total Devices : 2 ...    Active Devices : 2 ...    Number   Major   Minor   RaidDevice State        0       8      195        0      active sync   /dev/sdm3        1       8      211        1      active sync   /dev/sdn3 so, md6 is mirror SSD RAID. We are save if one of the disks will fail. fine, let's go next! 3. High Availability of Hadoop Components 3.1 Default service distribution on BDA/BDCS We briefly took a look at the hardware layout of BDA/BDCS and how we layout data on disk. In this section, let's look at the Hadoop software details. By default, when you deploy BDCS or configure and create a BDA cluster, you will have the following service distribution by default: Node01 Node02 Node03 Node04 Node05 to nn Balancer - Cloudera Manager Server - - Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent Cloudera Manager Agent DataNode DataNode DataNode DataNode DataNode Failover Controller Failover Controller - Oozie - JournalNode JournalNode JournalNode - - - MySQL Backup MySQL Primary - - NameNode NameNode Navigator Audit Server and Navigator Metadata Server - - NodeManager (in clusters of eight nodes or less) NodeManager (in clusters of eight nodes or less) NodeManager NodeManager NodeManager - - SparkHistoryServer Oracle Data Integrator Agent - - - ResourceManager ResourceManager - ZooKeeper ZooKeeper ZooKeeper - - Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Big Data SQL (if enabled) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) Kerberos KDC (if MIT Kerberos is enabled and on-BDA KDCs are being used) JobHistory - - Sentry Server (if enabled) Sentry Server (if enabled) - - - Hive Metastore - - Hive Metastore - Active Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) Passive Navigator Key Trustee Server (if HDFS Transparent Encryption is enabled) - - - - HttpFS - - - Hue Server - - Hue Server - Hue Load Balancer - - Hue Load Balancer - let me talk about High Availability implementation of some of this services. This configuration may change in the future, you could check some updates here. 3.2 Service with configured High Availability by default As of today (November 2018) we support high availability features for certain Hadoop components: 1) Name Node 2) YARN 3) Kerberos Distribution Center 4) Sentry 5) Hive Metastore Service 6) HUE 3.2.1 Name Node High Availability As you may know Oracle Solutions based on Cloudera Hadoop distribution. Here you could find detailed explanation about how HDFS high availability is achieved, but the good news that all those configuration steps done by default on BDA and BDCS and you simply have it by default. Let me show a small demo for NameNode high availability. First, let's check list of the nodes, which runs this service: [root@bdax72bur09node01 ~]# hdfs getconf -namenodes bdax72bur09node01.us.oracle.com bdax72bur09node02.us.oracle.com the easiest way to determine which node is active is to go to Cloudera Manager -> HDFS -> Instances: in my case bdax72bur09node02 node is active. I'll run hdfs list command in the cycle and reboot active namenode and we will take a look on how will system behave: [root@bdax72bur09node01 ~]# for i in seq {1..100}; do hadoop fs -ls hdfs://gcs-lab-bdax72-ns|tail -1; done; drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks 18/11/01 19:53:53 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 1 fail over attempts. Trying to fail over immediately. ... 18/11/01 19:54:16 INFO retry.RetryInvocationHandler: Exception while invoking getFileInfo of class ClientNamenodeProtocolTranslatorPB over bdax72bur09node02.us.oracle.com/192.168.8.171:8020 after 5 fail over attempts. Trying to fail over after sleeping for 11022ms. java.net.ConnectException: Call From bdax72bur09node01.us.oracle.com/192.168.8.170 to bdax72bur09node02.us.oracle.com:8020 failed on connection exception: java.net.ConnectException: Connection timed out; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:786)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)     at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2167)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1265)     at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)     at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1261)     at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)     at org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)     at org.apache.hadoop.fs.Globber.glob(Globber.java:151)     at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1715)     at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)     at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)     at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:102)     at org.apache.hadoop.fs.shell.Command.run(Command.java:165)     at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)     at org.apache.hadoop.fs.FsShell.main(FsShell.java:372) Caused by: java.net.ConnectException: Connection timed out ...   drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks drwxr-xr-x   - root root          0 2018-09-11 17:21 hdfs://gcs-lab-bdax72-ns/user/root/benchmarks so, as we can see due unavailability of one of the name nodes, second one take over it responsibility. Customer will experience short service outage. In Cloudera Manager we can see that NameNode service on node02 is not available: but despite on this, users could keep continue to work with the cluster without outages or any extra actions. 3.2.2 YARN High Availability YARN is another key Hadoop component and it's also highly available by default within Oracle Solution. Cloudera Requires to make some configuration, but with BDA and BDCS all these steps done after service deployment. Let's do the same test with YARN resource manager. In Cloudera Manager we define nodes, which run YARN resource manager service and try to reboot active one (reproduce hardware fail): I'll run some MapReduce code and will restart bdax72bur09node04 node (which contains active resource manager). [root@bdax72bur09node01 hadoop-mapreduce]# hadoop jar hadoop-mapreduce-examples.jar pi 1 1 Number of Maps  = 1 Samples per Map = 1 Wrote input for Map #0 Starting Job 18/11/01 20:08:03 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm16 18/11/01 20:08:03 INFO input.FileInputFormat: Total input paths to process : 1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: number of splits:1 18/11/01 20:08:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541115989562_0002 18/11/01 20:08:04 INFO impl.YarnClientImpl: Submitted application application_1541115989562_0002 18/11/01 20:08:04 INFO mapreduce.Job: The url to track the job: http://bdax72bur09node04.us.oracle.com:8088/proxy/application_1541115989562_0002/ 18/11/01 20:08:04 INFO mapreduce.Job: Running job: job_1541115989562_0002 18/11/01 20:08:07 INFO retry.RetryInvocationHandler: Exception while invoking getApplicationReport of class ApplicationClientProtocolPBClientImpl over rm16. Trying to fail over immediately. java.io.EOFException: End of File Exception between local host is: "bdax72bur09node01.us.oracle.com/192.168.8.170"; destination host is: "bdax72bur09node04.us.oracle.com":8032; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)     at java.lang.reflect.Constructor.newInstance(Constructor.java:423)     at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)     at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)     at org.apache.hadoop.ipc.Client.call(Client.java:1508)     at org.apache.hadoop.ipc.Client.call(Client.java:1441)     at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)     at com.sun.proxy.$Proxy13.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)     at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)     at com.sun.proxy.$Proxy14.getApplicationReport(Unknown Source)     at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408)     at org.apache.hadoop.mapred.ResourceMgrDelegate.getApplicationReport(ResourceMgrDelegate.java:302)     at org.apache.hadoop.mapred.ClientServiceDelegate.getProxy(ClientServiceDelegate.java:154)     at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:323)     at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:423)     at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:698)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:326)     at org.apache.hadoop.mapreduce.Job$1.run(Job.java:323)     at java.security.AccessController.doPrivileged(Native Method)     at javax.security.auth.Subject.doAs(Subject.java:422)     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)     at org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)     at org.apache.hadoop.mapreduce.Job.isComplete(Job.java:621)     at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1366)     at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1328)     at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)     at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)     at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)     at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)     at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     at java.lang.reflect.Method.invoke(Method.java:498)     at org.apache.hadoop.util.RunJar.run(RunJar.java:221)     at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.io.EOFException     at java.io.DataInputStream.readInt(DataInputStream.java:392)     at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)     at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006) 18/11/01 20:08:07 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15 18/11/01 20:08:09 INFO mapreduce.Job: Job job_1541115989562_0002 running in uber mode : false 18/11/01 20:08:09 INFO mapreduce.Job:  map 0% reduce 0% 18/11/01 20:08:23 INFO mapreduce.Job:  map 100% reduce 0% 18/11/01 20:08:29 INFO mapreduce.Job:  map 100% reduce 100% 18/11/01 20:08:29 INFO mapreduce.Job: Job job_1541115989562_0002 completed successfully well, in the logs we clearly can see that we were failing over to second resource manager. In Cloudera Manager we can see that node03 took over active role: so, looking entire node, which contain Resource Manager users will not lose ability to submit their jobs. 3.2.3 Kerberos Distribution Center (KDC) In fact majority of production Hadoop Clusters running in secure mode, which means Kerberized Clusters. Kerberos Distribution Center is the key component for it. The good news when we install Kerberos with BDA or BDCS, you automatically get standby on your BDA/BDCS. 3.2.4 Sentry High Availability If Kerberos is authentication method (define who you are), that quite frequently users want to use some Authorization tool in couple with it. In case of Cloudera almost default tool is Sentry. Since BDA4.12 software release we have support of Sentry High Availability out of the box. Cloudera has detailed documentation, which explains how it works.  3.2.5 Hive Metastore Service High Availability When we are talking about hive, it's very important to keep in mind that it consist of many components. it's easy to see in Cloudera Manager: and whenever you deal with some hive tables, you have to go through many logical layers: for keep it simple, let's consider one case, when we use beeline to query some hive tables. So, we need to have HiveServer2, Hive Metastore Service and Metastore backend RDBMS available. Let's connect and make sure that data is available: 0: jdbc:hive2://bdax72bur09node04.us.oracle.c (closed)> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Connected to: Apache Hive (version 1.1.0-cdh5.14.2) Driver: Hive JDBC (version 1.1.0-cdh5.14.2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+   Now, let's shut down HiveServer2 and will make sure that we can't connect to database: 1: jdbc:hive2://bdax72bur09node04.us.oracle.c> !connect jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Connecting to jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default; Enter username for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Enter password for jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;:  Could not open connection to the HS2 server. Please check the server URI and if the URI is correct, then ask the administrator to check the server status. Error: Could not open client transport with JDBC Uri: jdbc:hive2://bdax72bur09node04.us.oracle.com:10000/default;: java.net.ConnectException: Connection refused (Connection refused) (state=08S01,code=0) 1: jdbc:hive2://bdax72bur09node04.us.oracle.c>  as we expected we couldn't perform connection. We have to go to Cloudera Manager -> Hive -> Instances -> Add Role and add extra HiveServer2 (add it to node05): After this we will need to install balancer: [root@bdax72bur09node06 ~]# yum -y install haproxy Loaded plugins: langpacks Resolving Dependencies --> Running transaction check ---> Package haproxy.x86_64 0:1.5.18-7.el7 will be installed --> Finished Dependency Resolution   Dependencies Resolved   ==========================================================================================================================================================================================================  Package                                        Arch                                          Version                                             Repository                                         Size ========================================================================================================================================================================================================== Installing:  haproxy                                        x86_64                                        1.5.18-7.el7                                        ol7_latest                                        833 k   Transaction Summary ========================================================================================================================================================================================================== Install  1 Package   Total download size: 833 k Installed size: 2.6 M Downloading packages: haproxy-1.5.18-7.el7.x86_64.rpm                                                                                                                                                    | 833 kB  00:00:01      Running transaction check Running transaction test Transaction test succeeded Running transaction   Installing : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Verifying  : haproxy-1.5.18-7.el7.x86_64                                                                                                                                                            1/1    Installed:   haproxy.x86_64 0:1.5.18-7.el7                                                                                                                                                                              Complete! now we will need to config haproxy. Go to configuration file: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node06 ~]# vi /etc/haproxy/haproxy.cfg this is example of my haproxy.cfg: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} global     log         127.0.0.1 local2       chroot      /var/lib/haproxy     pidfile     /var/run/haproxy.pid     maxconn     4000     user        haproxy     group       haproxy     daemon       # turn on stats unix socket     stats socket /var/lib/haproxy/stats   #--------------------------------------------------------------------- # common defaults that all the 'listen' and 'backend' sections will # use if not designated in their block #--------------------------------------------------------------------- defaults     mode                    http     log                     global     option                  httplog     option                  dontlognull     option http-server-close     option forwardfor       except 127.0.0.0/8     option                  redispatch     retries                 3     timeout http-request    10s     timeout queue           1m     timeout connect         10s     timeout client          1m     timeout server          1m     timeout http-keep-alive 10s     timeout check           10s     maxconn                 3000   #--------------------------------------------------------------------- # main frontend which proxys to the backends #--------------------------------------------------------------------- frontend  main *:5000     acl url_static       path_beg       -i /static /images /javascript /stylesheets     acl url_static       path_end       -i .jpg .gif .png .css .js     use_backend static          if url_static p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} #--------------------------------------------------------------------- # static backend for serving up images, stylesheets and such #--------------------------------------------------------------------- backend static     balance     roundrobin     server      static 127.0.0.1:4331 check   #--------------------------------------------------------------------- # round robin balancing between the various backends #--------------------------------------------------------------------- listen hiveserver2 :10005     mode tcp     option tcplog     balance source server hiveserver2_1 bdax72bur09node04.us.oracle.com:10000 check server hiveserver2_2 bdax72bur09node05.us.oracle.com:10000 check   Then go to Cloudera Manager and setup balancer hostname/port (accordingly how we config it in our previous step): after all these changes been done try to connect again: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} beeline> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ 3 rows selected (2.08 seconds) Great! it work. Try to shutdown one of the HiveServer2: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 0: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> show databases; ... +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ this is works! Now let's move on and let's have a look what do we have for Hive Metastore Service high availability. The really great news, that we do have enable it by default with BDA and BDCS: for showing this, I'll try to shutdown one by one service consequently and will see you connection to beeline would work. Shutdown service on node01 and try to connect/query through beeline: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ works, now i'm going to startup service on node01 and shutdown on the node04: 1: jdbc:hive2://bdax72bur09node06.us.oracle.c> !connect jdbc:hive2://bdax72bur09node06.us.oracle.com:10005/default; ... INFO  : OK +----------------+--+ | database_name  | +----------------+--+ | csv            | | default        | | parq           | +----------------+--+ it works again! so, we are safe with Hive Metastore service. BDA and BDCS use MySQL RDBMS as database layout. As of today there is no High Availability for MySQL database, so we are using Master - Slave replication (in future we hope to have HA for MySQL), which allows us switch to Slave in case of Master failing. Today, you will need to perform node migration in case of failing master node (node03 by default), I'll explain this later in this blog. To find out where is MySQL Master, run this: [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node03 to find out slave, run: [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02   3.2.6 HUE High Availability Hue is quite a popular tool for working with Hadoop data. It's also possible to run Hue in HA mode, Cloudera explains it here, but with BDA and BDCS you will have it out of the box since 4.12 software version. By default you have HUE and HUE balancer available on node01 and node04: in case of unavailability of Node01 or Node04, users could easily, without any extra actions keep using HUE, just for switching to different balancer URL. 3.3 Migrate Critical Nodes One of the greatest features of Big Data Appliance is the capability to migrate all roles of critical services. For example, some nodes may contain many critical services, like node03 (Cloudera Manager, Resource Manager, MySQL store...). Fortunately, BDA has the simple way to migrate all roles from critical to a non-critical node. All details you may find in MOS (Node Migration on Oracle Big Data Appliance V4.0 OL6 Hadoop Cluster to Manage a Hardware Failure (Doc ID 1946210.1)). Let's consider a case, when we lose (because of Hardware failing, for example) one of the critical server - node03, which contains MySQL Active RDBMS and Cloudera Manager. For fix this we need to migrate all roles of this node to some other server. For perform migration all roles from node03, just run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# bdacli admin_cluster migrate bdax72bur09node03 all details you could find in the MOS note, but briefly: 1) This is two major type of migrations: - Migration of critical nodes - Reprovisioning of non-critical nodes 2) When you migrate critical nodes, you couldn't choose non-critical on which you will migrate services (mammoth will do this for you, generally it will be the first available non-critical node) 3) after hardware server will be back to cluster (or new one will be added), you should reprovision it as non critical. 4) You don't need to switch services back, just leave it as it is after migration done, the new node will take over all roles from failing one. In my example, I've migrated one of the critical node, which has Active MySQL RDBMS and Cloudera Manager. To check where is active RDBMS, you may run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]#  json-select --jpx=MYSQL_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node05 Note: for find slave RDBMS, run: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 tmp]# json-select --jpx=MYSQL_BACKUP_NODE /opt/oracle/bda/install/state/config.json bdax72bur09node02 and Cloudera Manager runs on node05: Resource Manager also was migrated to the node05: Migration process does decommission of the node. After failing node will come back to the cluster, we will need to reprovision it (deploy non-critical services). In other words, we need to make re-commision of the node. 3.4 Redundant Services there are certain Hadoop services, which configured on BDA in redundant way. You shouldn't worry about high availability of these services: - Data Node. By default, HDFS configured for being 3 times redundant. If you will lose one node, you will have tow more copies. - Journal Node. By default, you have 3 instances of JN configured. Missing one is not a big deal. - Zookeeper. By default, you have 3 instances of JN configured. Missing one is not a big deal. 4. Services with no configured High Availability by default There are certain services on BDA, which doesn't have High Availability Configuration by default: - Oozie. If you need to have High Availability for Oozie, you may check Cloudera's documentation - Cloudera Manager. It's also possible to config Cloudera Manager for High Availability, like it's explained here, but I'd recommend use node migration, like I show above - Impala. By default, neither BDA nor BDCS don't have Impala configured by default (yet), but it's quite important. All detailed information you could find here, but briefly for config HA for Impala, you need: a. Config haproxy (I've extend existing haproxy config, doen for HiveServer2), by adding: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} listen impala :25003     mode tcp     option tcplog     balance leastconn       server symbolic_name_1 bdax72bur09node01.us.oracle.com:21000 check     server symbolic_name_2 bdax72bur09node02.us.oracle.com:21000 check     server symbolic_name_3 bdax72bur09node03.us.oracle.com:21000 check     server symbolic_name_4 bdax72bur09node04.us.oracle.com:21000 check     server symbolic_name_5 bdax72bur09node05.us.oracle.com:21000 check     server symbolic_name_6 bdax72bur09node06.us.oracle.com:21000 check b. Go to Cloudera Manager -> Impala - Confing -> search for "Impala Daemons Load Balancer" and add haproxy host there: c. Login into Impala, using haproxy host:port: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 bin]# impala-shell -i bdax72bur09node06:25003 ... Connected to bdax72bur09node06:25003 ... [bdax72bur09node06:25003] > talking about Impala, it's worth to mention that there are two more services - Impala Catalog Service and Impala State Store. It's not mission critical services. From Cloudera's documentation: The Impala component known as the statestore checks on the health of Impala daemons on all the DataNodes in a cluster, and continuously relays its findings to each of those daemons.  and The Impala component known as the catalog service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. ... Because the requests are passed through the statestore daemon, it makes sense to run the statestored and catalogd services on the same host. I'll make a quick test: - I've disabled Impala daemon on the node01, disable StateStore, and Catalog id - Connect to loadbalancer and run the query: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} [root@bdax72bur09node01 ~]# impala-shell -i bdax72bur09node06:25003 .... [bdax72bur09node06:25003] > select count(1) from test_table; ... +------------+ | count(1)   | +------------+ | 6659433869 | +------------+ Fetched 1 row(s) in 1.76s [bdax72bur09node06:25003] > so, as we can see Impala may work even without StateStore and Catalog Service Appendix A. Despite on that BDA has multiple High Availability features, its always useful to make a backup before any significant operations, like un upgrade. In order to get detailed information, please follow My Oracle Support (MOS) note: How to Backup Critical Metadata on Oracle Big Data Appliance Prior to Upgrade V2.3.1 and Higher Releases (Doc ID 1623304.1)       p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures}

In this blog post, I'd like to briefly review the high availability functionality in Oracle Big Data Appliance and Big Data Cloud Service. The good news on all of this is that most of these features...

Big Data

Influence Product Roadmap: Introducing the new Big Data Idea Lab

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on ideas submitted by others. Visit the Big Data Idea Lab now. What does the Idea Lab let you do, and how do we use your feedback? For all our products and services we (Product Management) define a set of features and functionality that will enhance the products and solves customer problems. We then set out to prioritize these features and functions, and a big driver of this is the impact said features have on you, our customers. Until now we really used interaction with customers as the yard stick for that impact, or that bit of prioritization. That will change with the Idea Lab, where we will have direct, recorded and scalable input available on features and ideas.  Of course we are also looking for input into new features and things we had not thought about. That is the other part of the Idea Lab: giving us new ideas, new functions and features and anything that you think would help you use our products better in your company. As we progress in releasing new functionality, the Idea Lab will be a running tally of our progress, and we promise to keep you updated on where we are going in roadmap posts on this blog (see this example: Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now), and on the Idea Lab. So, please use this Idea Lab, submit and vote, and visit often to see what is new and keep us tracking towards better products. And thanks in advance for your efforts!

You have idea, you have feedback, you want to be involved in the products and services you use. Of course, so here is the new Idea Lab for Big Data, where you can submit your ideas and vote on...

Autonomous

Thursday at OpenWorld 2018 - Your Must-See Sessions

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of the whole conference is today - Using Analytic Views for Self-Service Business Intelligence, which is at 9:00am in Room 3005, Moscone West. Multi-Dimensional models inside the database are very powerful and totally cool. AVs uniquely deliver sophisticated analytics from very simple SQL. If you only get to one session today then make it this one! Of course, today is your final chance to get some much-needed real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 1:30pm - 2:30 at the Marriott Marquis (Yerba Buena Level) - Salon 9B. The product management team will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   THURSDAY's MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  Technorati Tags: Analytics, Autonomous, Big Data, Cloud, Conference, Data Warehousing, OpenWorld, SQL Analytics

  Day three is a wrap so now is the perfect time to start planning your Day 4 session at  OpenWorld 2018. Here’s your absolutely Must-See agenda for Thursday at OpenWorld 2018... My favorite session of...

Autonomous

Wednesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is two additional chances to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 12:45pm - 1:45pm and then again at 3:45pm - 4:45pm, both at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   WEDNESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Wednesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 3 session at  OpenWorld 2018. The list is packed full...

Autonomous

Tuesday at OpenWorld 2018 - Your Must-See Sessions

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of really excellent speakers from Oracle product management talking about Autonomous Database and the Cloud. These sessions are what Oracle OpenWorld is all about: the chance to learn from the real technical experts. Highlight of today is the chance to get some real hands-on time with Autonomous Data Warehouse in the ADW Hands-on Lab at 3:45 PM - 4:45 PM at the Marriott Marquis (Yerba Buena Level) - Salon 9B. We will help you sign up for a free cloud account and then get you working on our completely free workshop. Don't miss it, just bring your own laptop!   TUESDAY'S MUST-SEE GUIDE    Don't worry if you are not able to join us in San Francisco for this year's conference because I will be providing a comprehensive review after the conference closes on Thursday. The review will include links to download the presentations for each of my Must-See sessions and links to any hands-on lab content as well. Have a great conference. If you are here in San Francisco then enjoy the conference - it's going to be an awesome conference this year.  

  Here’s your absolutely Must-See agenda for Tuesday at OpenWorld 2018... Day one is a wrap so now is the perfect time to start planning your Day 2 session at  OpenWorld 2018. The list is packed full of...

Autonomous

Managing Autonomous Data Warehouse Using oci-curl

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means you can use OCI REST APIs to manage your ADW instances as an alternative to using the OCI web interface. I want to provide a few examples to do this using the bash function oci-curl provided in the OCI documentation. This was the easiest method for me to use, you can also use the OCI command line interface, or the SDKs to do the same operations. oci-curl oci-curl is a bash function provided in the documentation that makes it easy to get started with the REST APIs. You will need to complete a few setup operations before you can start calling it. Start by copying the function code from the documentation into a shell script on your machine. I saved it into a file named oci-curl.sh, for example. You will see the following section at the top of the file. You need to replace these four values with your own. TODO: update these values to your own local tenancyId="ocid1.tenancy.oc1..aaaaaaaaba3pv6wkcr4jqae5f15p2b2m2yt2j6rx32uzr4h25vqstifsfdsq"; local authUserId="ocid1.user.oc1..aaaaaaaat5nvwcna5j6aqzjcaty5eqbb6qt2jvpkanghtgdaqedqw3rynjq"; local keyFingerprint="20:3b:97:13:55:1c:5b:0d:d3:37:d8:50:4e:c5:3a:34"; local privateKeyPath="/Users/someuser/.oci/oci_api_key.pem"; How to find or generate these values is explained in the documentation here, let's walk through those steps now. Tenancy ID The first one is the tenancy ID. You can find your tenancy ID at the bottom of any page in the OCI web interface as indicated in this screenshot. Copy and paste the tenancy ID into the tenancyID argument in your oci-curl shell script. Auth User ID This is the OCI ID of the user who will perform actions using oci-curl. This user needs to have the privileges to manage ADW instances in your OCI tenancy. You can find your user OCI ID by going to the users screen as shown in this screenshot. Click the Copy link in that screen which copies the OCI ID for that user into the clipboard. Paste it into the authUserId argument in your oci-curl shell script. Key Fingerprint The first step for getting the key fingerprint is to generate an API signing key. Follow the documentation to do that. I am running these commands on a Mac and for demo purposes, I am not using a passphrase, see the documentation for Windows commands and for using a passphrase to encrypt the key file. mkdir ~/.oci openssl genrsa -out ~/.oci/oci_api_key.pem 2048 chmod go-rwx ~/.oci/oci_api_key.pem openssl rsa -pubout -in ~/.oci/oci_api_key.pem -out ~/.oci/oci_api_key_public.pem For your API calls to authenticate against OCI you need to upload the public key file. Go to the user details screen for your user on the OCI web interface and select API keys on the left. Click the Add Public Key button and copy and paste the contents of the file oci_api_key_public.pem into the text field, click Add to finish the upload. After you upload your key you will see the fingerprint of it in the user details screen as shown below. Copy and paste the fingerprint text into the keyFingerprint argument in your oci-curl shell script. Private Key Path Lastly, change the privateKeyPath argument in your oci-curl shell script to the path for the key file you generated in the previous step. For example, I set it as below in my machine. local privateKeyPath="/Users/ybaskan/.oci/oci_api_key.pem"; At this point, I save my updated shell script as oci-curl.sh and I will be calling this function to manage my ADW instances. Create an ADW instance Let's start by creating an instance using the function. Here is my shell script for doing that, createdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com post ./request.json "/20160918/autonomousDataWarehouses" Note that I first source the file oci-curl.sh which contains my oci-curl function updated with my OCI tenancy information as explained previously. I am calling the CreateAutonomousDataWarehouse REST API to create a database. Note that I am running this against the Phoenix data center (indicated by the first argument, database.us-phoenix-1.oraclecloud.com), if you want to create your database in other data centers you need to use the relevant endpoint listed here. I am also referring to a file named request.json which is a file that contains my arguments for creating the database. Here is the content of that file. { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "dbName" : "adwdb1", "displayName" : "adwdb1", "adminPassword" : "WelcomePMADWC18", "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "licenseModel" : "LICENSE_INCLUDED" } As seen in the file I am creating a database named adwdb1 with 1 CPU and 1TB storage. You can create your database in any of your compartments, to find the compartment ID which is required in this file, go to the compartments page on the OCI web interface, find the compartment you want to use and click the Copy link to copy the compartment ID into the clipboard. Paste it into the compartmentId argument in your request.json file. Let's run the script to create an ADW instance. ./createdb.sh { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : null, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "PROVISIONING", "serviceConsoleUrl" : null, "timeCreated" : "2018-09-06T19:56:48.077Z" As you see the lifecycle state is listed as provisioning which indicates the database is being provisioned. If you now go to the OCI web interface you will see the new database as being provisioned. Listing ADW instances Here is the script, listdb.sh, I use to list the ADW instances in my compartment. I use the ListAutonomousDataWarehouses REST API for this. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com get "/20160918/autonomousDataWarehouses?compartmentId=ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a" As you see it has one argument, compartmentId, which I set to the ID of my compartment I used in the previous example when creating a new ADW instance. When you run this script it gives you a list of databases and information about them in JSON which looks pretty ugly. ./listdb.sh [{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"adwdb1","definedTags":{},"displayName":"adwdb1","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW","timeCreated":"2018-09-06T19:56:48.077Z"},{"compartmentId":"ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a","connectionStrings":{"high":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com","low":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com","medium":"adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com"},"cpuCoreCount":1,"dataStorageSizeInTBs":1,"dbName":"testdw","definedTags":{},"displayName":"testdw","freeformTags":{},"id":"ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq","licenseModel":"LICENSE_INCLUDED","lifecycleDetails":null,"lifecycleState":"AVAILABLE","serviceConsoleUrl":"https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW","timeCreated":"2018-07-31T22:39:14.436Z"}] You can use a JSON beautifier to make it human-readable. For example, I use Python to view the same output in a more readable format. ./listdb.sh | python -m json.tool [ { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "adwdb1", "definedTags": {}, "displayName": "adwdb1", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated": "2018-09-06T19:56:48.077Z" }, { "compartmentId": "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings": { "high": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_high.adwc.oraclecloud.com", "low": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_low.adwc.oraclecloud.com", "medium": "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_testdw_medium.adwc.oraclecloud.com" }, "cpuCoreCount": 1, "dataStorageSizeInTBs": 1, "dbName": "testdw", "definedTags": {}, "displayName": "testdw", "freeformTags": {}, "id": "ocid1.autonomousdwdatabase.oc1.phx.abyhqljtcioe5c5sjteosafqfd37biwde66uqj2pqs773gueucq3dkedv3oq", "licenseModel": "LICENSE_INCLUDED", "lifecycleDetails": null, "lifecycleState": "AVAILABLE", "serviceConsoleUrl": "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=TESTDW&service_type=ADW", "timeCreated": "2018-07-31T22:39:14.436Z" } ] Scaling an ADW instance To scale an ADW instance you need to use the UpdateAutonomousDataWarehouse REST API with the relevant arguments. Here is my script, updatedb.sh, I use to do that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com put ./update.json "/20160918/autonomousDataWarehouses/$1" As you see it uses the file update.json as the request body and also uses the command line argument $1 as the database OCI ID. The file update.json has the following argument in it. { "cpuCoreCount" : 2 } I am only using cpuCoreCount as I want to change my CPU capacity, you can use other arguments listed in the documentation if you need to. To find the database OCI ID for your ADW instance you can either look at the output of the list databases API I mentioned above or you can go the ADW details page on the OCI web interface which will show you the OCI ID. Now, I call it with my database ID and the scale operation is submitted. ./updatedb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "SCALE_IN_PROGRESS", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" } If you go to the OCI web interface again you will see that the status for that ADW instance is shown as Scaling in Progress. Stopping and Starting an ADW Instance To stop and start ADW instances you need to use the StopAutonomousDataWarehouse and the StartAutonomousDataWarehouse REST APIs. Here is my stop database script, stopdb.sh. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/stop As you see it takes one argument, $1, which is the database OCI ID as I used in the scale example before. It also refers to the file empty.json which is an empty JSON file with the below content. { } As you will see this requirement is not mentioned in the documentation, but the call will give an error if you do not provide the empty JSON file as input. Here is the script running with my database OCI ID. ./stopdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STOPPING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Likewise, you can start the database using a similar call. Here is my script, startdb.sh, that does that. #!/bin/bash . ./oci-curl.sh oci-curl database.us-phoenix-1.oraclecloud.com POST ./empty.json /20160918/autonomousDataWarehouses/$1/actions/start Here it is running for my database. ./startdb.sh ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a { "compartmentId" : "ocid1.tenancy.oc1..aaaaaaaaro2vctz2hianklgq77hguo6jzcs6ezyheouqfsald4x3nubpwr2a", "connectionStrings" : { "high" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_high.adwc.oraclecloud.com", "low" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_low.adwc.oraclecloud.com", "medium" : "adwc.uscom-west-1.oraclecloud.com:1522/nqpobuuiaedvlyf_adwdb1_medium.adwc.oraclecloud.com" }, "cpuCoreCount" : 1, "dataStorageSizeInTBs" : 1, "dbName" : "adwdb1", "definedTags" : { }, "displayName" : "adwdb1", "freeformTags" : { }, "id" : "ocid1.autonomousdwdatabase.oc1.phx.abyhqljszen442afj6ogitgs3dwk2iunhv7zdndllf6o6is6xg2ku5a7uf3a", "licenseModel" : "LICENSE_INCLUDED", "lifecycleDetails" : null, "lifecycleState" : "STARTING", "serviceConsoleUrl" : "https://adwc.uscom-west-1.oraclecloud.com/console/index.html?tenant_name=OCID1.TENANCY.OC1..AAAAAAAARO2VCTZ2HIANKLGQ77HGUO6JZCS6EZYHEOUQFSALD4X3NUBPWR2A&database_name=ADWDB1&service_type=ADW", "timeCreated" : "2018-09-06T19:56:48.077Z" Other Operations on ADW Instances I gave some examples of common operations on an ADW instance, to use REST APIs for other operations you can use the same oci-curl function and the relevant API documentation. For demo purposes, as you saw I have hardcoded some stuff like OCIDs, you can further enhance and parameterize these scripts to use them generally for your ADW environment. Next, I will post some examples of managing ADW instances using the command line utility oci-cli.

Every now and then we get questions about how to create and manage an Autonomous Data Warehouse (ADW) instance using REST APIs. ADW is an Oracle Cloud Infrastructure (OCI) based service, this means...

See How Easily You Can Query Object Store with Big Data Cloud Service (BDCS)

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to they need. Need some extra space? Simply load data into Object Store - Scalability. It's infinitely scale. At least theoretically :) - Durability and Availability. Object Store is first class citizen in all Cloud Stories, so all vendors do all their best to maintain 100% availability and durability. If some diet will go down, it shouldn't worry you. If some node with OSS software will go down, it shouldn't worry you. As user you have to put data there and read data from Object Store.  - Cost. In a Cloud Object Store is most cost efficient solution. Nothing comes for free and as downside I may highlight: - Performance in comparison with HDFS or local block devices. Whenever you read data from Object Store, you read it over the network. - Inconsistency of performance. You are not alone on object store and obviously under the hood it uses physical disks, which have own throughput. If many users will start to read and write data to/from Object Store, you may get performance which is different with what you use to have a day, week, month ago - Security. Unlike filesystems Object Store has not file grain permissions policies and customers will need to reorganize and rebuild their security standards and policies. based on conclusions that we made above we may conclude, that Object Store is well suitable as way to share data across many systems as well as historical layer for certain Information management systems. If we will compare Object Store with HDFS (they are both are Schema on Read system, which simply store data and define schema on runtime, when user run a query), I'm personally could differentiate it like HDFS is "Write once, read many", Object Store is "Write once, read few". So, it's more historical (cheaper and slower) than HDFS.  In context of Information Data Management Object Store may make place on the bottom of Pyramid: How to copy data to Object Store Well, let's imagine that we do have Big Data Cloud Service (BDCS) and want to archive some data from HDFS to Object Store (for example, because we run out of capacity on HDFS). There are multiple ways to do this (I've wrote about this earlier here), but I'll pick ODCP - oracle build tool for coping data between multiple sources, including HDFS and Object Store. Full documentation you could find here, but I only show a brief example how I did it on my test cluster. First we will need to define Object store on client node(in my case it's one of BDCS node), where we will run a client: [opc@node01 ~]$ export CM_ADMIN=admin [opc@node01 ~]$ export CM_PASSWORD=Welcome1! [opc@node01 ~]$ export CM_URL=https://cmhost.us2.oraclecloud.com:7183 [opc@node01 ~]$ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey@oracle.com" --swift-password "MyPassword-" --swift-storageurl "https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage After this we may check, that it appears: [opc@node01 ~]$ bda-oss-admin list_swift_creds -t PROVIDER  USERNAME                                                    STORAGE URL                              bdcstorage storage-a424392:alexey.filanovskiy@oracle.com               https://storage-a422222.storage.oraclecloud.com/auth/v2.0/tokens after we will need to copy data from HDFS to Object Store: [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/parq.db/ swift://tpcds-parq.bdcstorage/parq.db ... [opc@node01 ~]$ odcp hdfs:///user/hive/warehouse/csv.db/ swift://tpcds-parq.bdcstorage/csv.db now we have data in Object store: [opc@node01 ~]$  hadoop fs -du -h  swift://tpcds-parq.bdcstorage/parq.db ... 74.2 K   74.2 K   swift://tpcds-parq.bdcstorage/parq.db/store 14.4 G   14.4 G   swift://tpcds-parq.bdcstorage/parq.db/store_returns 272.8 G  272.8 G  swift://tpcds-parq.bdcstorage/parq.db/store_sales 466.1 K  466.1 K  swift://tpcds-parq.bdcstorage/parq.db/time_dim ...   good time to define table in Hive Metastore, I'll show example with only one table, rest I did with script: 0: jdbc:hive2://node03:10000/default> CREATE EXTERNAL TABLE store_sales ( ss_sold_date_sk           bigint , ss_sold_time_sk           bigint , ss_item_sk                bigint , ss_customer_sk            bigint , ss_cdemo_sk               bigint , ss_hdemo_sk               bigint , ss_addr_sk                bigint , ss_store_sk               bigint , ss_promo_sk               bigint , ss_ticket_number          bigint , ss_quantity               int , ss_wholesale_cost         double , ss_list_price             double , ss_sales_price            double , ss_ext_discount_amt       double , ss_ext_sales_price        double , ss_ext_wholesale_cost     double , ss_ext_list_price         double , ss_ext_tax                double , ss_coupon_amt             double , ss_net_paid               double , ss_net_paid_inc_tax       double , ss_net_profit             double ) STORED AS PARQUET LOCATION 'swift://tpcds-parq.bdcstorage/parq.db/store_sales'   Make sure that you have required libraries in place for Hive and for Spark: [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/bin/../lib/hadoop-mapreduce/ [opc@node01 ~]$ dcli -C cp /opt/oracle/bda/bdcs/bdcs-rest-api-app/current/lib-hadoop/hadoop-openstack-spoc-2.7.2.jar /opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/jars/   Now we are ready for test!   Why should you use smart data formats? Predicate Push Down In Big Data world there are type of file formats called Smart (for example, ORC files and Parquetfiles). They have metadata in file, which allow dramatically speed up query performance for some queries. The most powerful feature is Predicate Push Down. This feature allows to filter data in place where it actually is without moving over the network. Each Parquet file page have Minimum and Maximum value, which allows us to skip entire page. Follow SQL predicates could be used for filtering data: < <= = != >= > so, it's better to see once rather than heat many times.  0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales; ... +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (339.221 seconds)   We could take a look on the resource utilization and we could note, that Network quite heavily utilized. Now, let's try to do the same with csv files: 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales; +-------------+--+ |     _c0     | +-------------+--+ | 6385178703  | +-------------+--+ 1 row selected (762.38 seconds) As we can see all the same - high network utilization, but query takes even longer. It's because CSV is row row format and we could not do column pruning.   so, let's try to feel power of Predicate Push Down and let's use some equal predicate in the query: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (74.689 seconds)   Now we can see that in case of parquet files we almost don't utilize network. Let's see how it's gonna be in case of csv files. 0: jdbc:hive2://node03:10000/default> select count(1) from csv_swift.store_sales where ss_ticket_number=50940847; ... +------+--+ | _c0  | +------+--+ | 6    | +------+--+ 1 row selected (760.682 seconds) well, as assumed, csv files don't get any benefits out of WHERE predicate. But, not all functions could be offloaded. To illustrate this I run query with cast function over parquet files: 0: jdbc:hive2://node03:10000/default> select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%'; ... +---------+--+ |   _c0   | +---------+--+ | 959269  | +---------+--+ 1 row selected (133.829 seconds)   as we can see, we move part of data set to the BDCS instance and process it there.   Column projection another feature of Parquetfiles is column format, which means that then less columns we are using, then less data we bring back to the BDCS. Let me illustrate this by running same query with one column and with 24 columns (I'll use cast function, which is not pushed down). 0: jdbc:hive2://node03:10000/default> select ss_ticket_number from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (128.887 seconds) now I run the query over same data, but request 24 columns: 0: jdbc:hive2://node03:10000/default> select  . . . . . . . . . . . . . . . . . . . . > ss_sold_date_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_sold_time_sk            . . . . . . . . . . . . . . . . . . . . > ,ss_item_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_customer_sk             . . . . . . . . . . . . . . . . . . . . > ,ss_cdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_hdemo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_addr_sk                 . . . . . . . . . . . . . . . . . . . . > ,ss_store_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_promo_sk                . . . . . . . . . . . . . . . . . . . . > ,ss_ticket_number           . . . . . . . . . . . . . . . . . . . . > ,ss_quantity                . . . . . . . . . . . . . . . . . . . . > ,ss_wholesale_cost          . . . . . . . . . . . . . . . . . . . . > ,ss_list_price              . . . . . . . . . . . . . . . . . . . . > ,ss_sales_price             . . . . . . . . . . . . . . . . . . . . > ,ss_ext_discount_amt        . . . . . . . . . . . . . . . . . . . . > ,ss_ext_sales_price         . . . . . . . . . . . . . . . . . . . . > ,ss_ext_wholesale_cost      . . . . . . . . . . . . . . . . . . . . > ,ss_ext_list_price          . . . . . . . . . . . . . . . . . . . . > ,ss_ext_tax                 . . . . . . . . . . . . . . . . . . . . > ,ss_coupon_amt              . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid                . . . . . . . . . . . . . . . . . . . . > ,ss_net_paid_inc_tax        . . . . . . . . . . . . . . . . . . . . > ,ss_net_profit              . . . . . . . . . . . . . . . . . . . . > from parq_swift.store_sales . . . . . . . . . . . . . . . . . . . . > where . . . . . . . . . . . . . . . . . . . . > cast(ss_ticket_number as string) like '%50940847%'; ... 127 rows selected (333.641 seconds)   ​I think after seen these numbers you will always put only columns that you need.   Object store vs HDFS performance Now, I'm going to show example of performance numbers for Object Store and for HDFS. It's not official benchmark, just numbers, which could give you idea how compete performance over Object store vs HDFS. Querying Object Store with Spark SQL as s bonus I'd like to show who to query object store with Spark SQL.   [opc@node01 ~]$ spark2-shell  .... scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession   scala> val warehouseLocation = "file:${system:user.dir}/spark-warehouse" warehouseLocation: String = file:${system:user.dir}/spark-warehouse   scala> val spark = SparkSession.builder().appName("SparkSessionZipsExample").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate() 18/07/09 05:36:32 WARN sql.SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@631c244c   scala> spark.catalog.listDatabases.show(false) +----------+---------------------+----------------------------------------------------+ |name      |description          |locationUri                                         | +----------+---------------------+----------------------------------------------------+ |csv       |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv.db       | |csv_swift |null                 |hdfs://bdcstest-ns/user/hive/warehouse/csv_swift.db | |default   |Default Hive database|hdfs://bdcstest-ns/user/hive/warehouse              | |parq      |null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq.db      | |parq_swift|null                 |hdfs://bdcstest-ns/user/hive/warehouse/parq_swift.db| +----------+---------------------+----------------------------------------------------+     scala> spark.catalog.listTables.show(false) +--------------------+--------+-----------+---------+-----------+ |name                |database|description|tableType|isTemporary| +--------------------+--------+-----------+---------+-----------+ |customer_demographic|default |null       |EXTERNAL |false      | |iris_hive           |default |null       |MANAGED  |false      | +--------------------+--------+-----------+---------+-----------+     scala> val resultsDF = spark.sql("select count(1) from parq_swift.store_sales where cast(ss_promo_sk as string) like '%3303%' ") resultsDF: org.apache.spark.sql.DataFrame = [count(1): bigint]   scala> resultsDF.show() [Stage 1:==>                                                  (104 + 58) / 2255]   in fact there is no difference for Spark SQL between SWIFT and HDFS. All performance considerations which I've motion above.    Parquet files. Warning! After looking on these results, you may want to convert everything in parquet files, but don't rush to do so. Parquet files is schema-on-write, which means that you do ETL when convert data to it. ETL means optimization as well as probability to make a mistake during this transformation. This is the example. I have table which has timestamps, which obviously couldn't be less than 0: hive> create table tweets_parq  ( username  string,    tweet     string,    TIMESTAMP smallint    )  STORED AS PARQUET;   hive> INSERT OVERWRITE TABLE tweets_parq select * from  tweets_flex;  we defined timestamp as smallint, which is not enough for some data: hive> select TIMESTAMP from tweets_parq ... ------------  1472648470 -6744 and as consequence we got overflow and get negative timestamp. Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format. Conclusion 1) Object Store is not competitor for HDFS. HDFS is schema on read system, which could give you good performance (but definitely lower than schema on write system, such as database). Object Store itself could give you elasticity. It's a good option for historical data, which you plan to use really not frequently. 2) Object Store add significant startup overhead, so it's not suitable for interactive queries. 3) If you put data on Object Store consider to use smart file formats such as Parquet. It could give you benefits of Predicates Push Down as well as column projection 4) Smart files such as parquet is transformation and during this transformation you could make a mistake. it's why it's better to preserve data in original format.

What is Object Store? Object Store become more and more popular storage type especially in a Cloud. It provides some benefits, such as: - Elasticity. Customers don't have to plan ahead how many space to...

Big Data

Start Planning your Upgrade Strategy to Cloudera 6 on Oracle Big Data Now

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should provide significant benefits for customer applications. This post describes Oracle's strategy to support our customers in up-taking C6 quickly and efficiently, with minimal disruption to their infrastructure. The Basics One of the key differences with C6 are its core versions, which are summarized here for everyones benefit: Apache Hadoop 3.0 Apache Hive 2.1 Apache Parquet 1.9 Apache Spark 2.2 Apache Solr 7.0 Apache Kafka 1.0 Apache Sentry 2.0 Cloudera Manager 6.0 Cloudera Navigator 6.0 and much more... for full details, always check the Cloudera download bundles or Oracle's documentation. Now what that this all mean for Oracle's Big Data platform (cloud and on-premises) customers? Upgrading the Platform This is the part where running Big Data Cloud Service, Big Data Appliance and Big Data Cloud at Customer makes a big difference. As with minor updates, where we move the entire stack (OS, JDK, MySQL, Cloudera CDH and everything else), we will also do this for your CDH 5.x to CDH 6.x move. What to expect: Target Version: CDH 6.0.1, which at the point of writing this post, has not been released Target Dates: November 2018 with a dependency on the actual 6.0.1 release date Automated Upgrade: Yes - as with minor releases, CDH and the entire stack (OS, MySQL, JDK) will be upgraded using the Mammoth Utility As always, Oracle is building this all in house, and we will are testing the migration across a number scenarios for technical correctness.  Application Impact The first thing to start planning for is what a version uptick like this means for your applications. Will everything work nicely as before? Well, that is where the hard work comes in: testing the actual applications on a C6 version. In general, we would recommend to configure a small BDA/BDCS/BDCC cluster and load some data (also note the paragraph below on Erasure Coding in that respect) and then do the appropriate functional testing. Once that is all running satisfactorily and per your expectations, you would start to upgrade existing clusters. What about Erasure Coding? This is the big feature that will become available in the 6.1 timeframe. Just to be clear, Erasure Coding is not in the first versions supported by Cloudera. Therefore it will also not be supported on the Oracle platforms, which is based on 6.0.1 (note the 0 in the middle :-) ). As usual, once the 6.1 is available, Oracle will offer that as a release to upgrade too, and we will at that time address the details around Erasure Coding, how to get there, and how to leverage this on the Oracle Big Data solutions. To give everyone a quick 10,000 foot guideline, keep using regular block encoding (the current HDFS structure) for best performance, and use Erasure Coding for storage savings, while understanding that more network traffic can impact raw performance. Do I have to Move? No. You do not have to move to CDH 6, nor do you need to switch to Erasure Coding. We do expect one more 5.x release, most likely 5.16, and will release this on our platforms as well. That is of course a fully supported release. It is then - generally speaking - up to your timelines to move to the C6 platform. As we move closer to the C6 release on BDA, BDCS and BDCC we will provide updates on specific versions to migrate from, dates and timelines etc. Should you have questions, contact us in the big data community. The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.  

Last week Cloudera announced the general availability of Cloudera CDH 6 (read more from Cloudera here). With that, many of the ecosystem components switched to a newer base version, which should...

Big Data

Roadmap Update: BDA 4.13 with Cloudera CDH 5.15.x

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned date, and so, we may choose to pick up that version We are adding the following features: Support for SHA-2 certificates with Kerberos Upgrade and Expand for Kafka clusters (the create was introduced in BDA 4.12) A disk shredding utility, where you easily "erase" data on the disks. We expect most customers to use this in cloud nodes Support for Active Directory in Big Data Manager We obviously will update the JDK and the Linux OS to the latest version, as well as apply the latest security updates. Same for MySQL Then there is of course the important question on timelines. Right now - subject to change and the below mentioned safe harbor statement - we are looking at mid August as the planned date, assuming we are going with 5.15.0. If you are interested in discussing or checking up on the dates, features or have other questions, see our new community: Or visit the same community using the direct link to our community home. As always, feedback and comments welcome. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.In our continued quest to keep all of you informed about the latest versions, releases and approximate timelines, here is the next update.

BDA 4.13 will have the following features, versions etc.: Of course an update to CDH. In this case we are going to uptake CDH 5.15.0. However, the release date of 5.15.1 is pretty close to our planned...

Big Data

Need Help with Oracle Big Data: Here is Where to Go!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered and comments are taken on board.  How do you find us? Easy, go to the community main page. Then do the following to start asking your questions on big data:   Once you are here, click on the thing you want from us, in this case I would assume that you want some answers on some of your questions. So, click on Answers in the menu bar, and then on Platform (PaaS): From there, just look in the data management section and choose Big Data: All set... now you are ready - provided you are a member of course - to start both asking questions, and if you know some answers helping others in the community. What do we cover in this community? Great question. Since the navigation and the title elude to Cloud, you would expect we cover our cloud service. And that is correct. But, because we are Oracle, we do have a wide portfolio and you will have questions around an entire ecosystem of tools, utilities and solutions, as well as potentially architecture questions and ideas. So, rather then limiting questions, ideas and thoughts, we figured to actually broaden the scope to what we think the community will be discussing. And so here are some of the things we hope we can cover: Big Data Cloud Service (BDCS) - of course The Cloudera stack included Specific cloud features like: Bursting/shrinking One-click Secure Clusters Easy Upgrade  Networking / Port Management  and more... Big Data Spatial and Graph, which is included in BDCS Big Data Connectors and ODI, also included in BDCS Big Data Manager and its notebook feature (Zeppelin based) and other cool features Big Data SQL Cloud Service and of course the general software features in Big Data SQL Big Data Best Practices Architecture Patterns and Reference Architectures Configuration and Tuning / Setup When to use what tools or technologies Service and Product roadmaps and announcements And more Hopefully that will trigger all of you (and us) to collaborate, discuss and get our community to be a fun and helpful one. Who is on here from Oracle? Well, hopefully a lot of people will join us, both from Oracle and from customers, partners and universities/schools. But we, as the product development team will be manning the front lines. So you have product management resources, some architects and some developers working in the community. And with that, see you all soon in the community!

We are very excited to announce our newly launched big data community on the Cloud Customer Connect Community. As of today we are live and ready to help, discuss and ensure your questions are answered...

Big Data SQL

Big Data SQL Quick Start. Kerberos - Part 26

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data SQL over Kerberized cluster, but today, I'd like to show couple typical steps how to test and debug Kerberized installation. First of all, let me tell about test environment. it's a 4 nodes: 3 nodes for Hadoop cluster (vm0[1-3]) and one for Database (vm04). Kerberos tickets should be initiated from keytab file, which should be on the database side (in case of RAC on each database node) and on each Hadoop node. Let's check that on the database node we have valid Kerberos ticket: [oracle@vm04 ~]$ id uid=500(oracle) gid=500(oinstall) groups=500(oinstall),501(dba) [oracle@scaj0602bda09vm04 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:58  07/24/18 01:15:58  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:01 let's check that we have access to HDFS from database host: [oracle@vm04 ~]$ cd $ORACLE_HOME/bigdatasql [oracle@vm04 bigdatasql]$ ls -l|grep hadoop*env -rw-r--r-- 1 oracle oinstall 2249 Jul 12 15:41 hadoop_martybda.env [oracle@vm04 bigdatasql]$ source hadoop_martybda.env  [oracle@vm04 bigdatasql]$ hadoop fs -ls ... Found 4 items drwx------   - oracle hadoop          0 2018-07-13 06:00 .Trash drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:10 .sparkStaging drwx------   - oracle hadoop          0 2018-07-12 05:17 .staging drwxr-xr-x   - oracle hadoop          0 2018-07-12 05:14 oozie-oozi [oracle@vm04 bigdatasql]$  seems everything is ok. let's do the same from Hadoop node: [root@vm01 ~]# su - oracle [oracle@scaj0602bda09vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ klist  Ticket cache: FILE:/tmp/krb5cc_1000 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 01:15:02  07/24/18 01:15:02  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 01:15:02 let's check that we have assess for the environment and also create test file on HDFS: [oracle@vm01 ~]$ echo "line1" >> test.txt [oracle@vm01 ~]$ echo "line2" >> test.txt [oracle@vm01 ~]$ hadoop fs -mkdir /tmp/test_bds [oracle@vm01 ~]$ hadoop fs -put test.txt /tmp/test_bds   now, let's jump to Database node and create external table for this file: [root@vm04 bin]# su - oracle [oracle@vm04 ~]$ . oraenv <<< orcl ORACLE_SID = [oracle] ? The Oracle base has been set to /u03/app/oracle [oracle@vm04 ~]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 06:39:06 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> CREATE TABLE bds_test (line VARCHAR2(4000))    ORGANIZATION EXTERNAL  (   TYPE ORACLE_HDFS       DEFAULT DIRECTORY       DEFAULT_DIR LOCATION ('/tmp/test_bds')   )    REJECT LIMIT UNLIMITED;      Table created.   SQL>  and for sure this is our two row file which we created on the previous step: SQL> select * from bds_test;   LINE ------------------------------------ line1 line2 Now let's go through some typical cases with Kerberos and let's talk about how to catch it.   Kerberos ticket missed on the database side Let's simulate case when Kerberos ticket is missed on the database side. it's pretty easy and for doing this we will use kdestroy command: [oracle@vm04 ~]$ kdestroy  [oracle@vm04 ~]$ klist  klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_500) extproc cache Kerberos ticket, so to apply our changes, you will need to restart extproc. First, we will need to obtain name of the extproc: [oracle@vm04 admin]$ cd $ORACLE_HOME/hs/admin [oracle@vm04 admin]$ ls -l total 24 -rw-r--r-- 1 oracle oinstall 1170 Mar 27 01:04 extproc.ora -rw-r----- 1 oracle oinstall 3112 Jul 12 15:56 initagt.dat -rw-r--r-- 1 oracle oinstall  190 Jul 12 15:41 initbds_orcl_martybda.ora -rw-r--r-- 1 oracle oinstall  489 Mar 27 01:04 initdg4odbc.ora -rw-r--r-- 1 oracle oinstall  406 Jul 12 15:11 listener.ora.sample -rw-r--r-- 1 oracle oinstall  244 Jul 12 15:11 tnsnames.ora.sample name consist of database SID and Hadoop Cluster name. So, seems our extproc name is bds_orcl_martybda. let's stop and start it: [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   oracle 16776 1 0 Jul12 ? 00:49:25 extprocbds_orcl_martybda -mt Stopping MTA process "extprocbds_orcl_martybda -mt"...   MTA process "extprocbds_orcl_martybda -mt" stopped!   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ORACLE_HOME = "/u03/app/oracle/12.1.0/dbhome_orcl" MTA init file = "/u03/app/oracle/12.1.0/dbhome_orcl/hs/admin/initbds_orcl_martybda.ora"   MTA process "extprocbds_orcl_martybda -mt" is not running!   Checking MTA init parameters...   [O]  INIT_LIBRARY=$ORACLE_HOME/lib/libkubsagt12.so [O]  INIT_FUNCTION=kubsagtMTAInit [O]  BDSQL_CLUSTER=martybda [O]  BDSQL_CONFIGDIR=/u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/databases/orcl/bigdata_config   MTA process "extprocbds_orcl_martybda -mt" started! oracle 19498 1 4 06:58 ? 00:00:00 extprocbds_orcl_martybda -mt now we reset Kerberos ticket cache. Let's try to query HDFS data: [oracle@vm04 admin]$ sqlplus / as sysdba   SQL*Plus: Release 12.1.0.2.0 Production on Mon Jul 23 07:00:26 2018   Copyright (c) 1982, 2014, Oracle.  All rights reserved.     Connected to: Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test; select * from bds_test * ERROR at line 1: ORA-29913: error in executing ODCIEXTTABLEOPEN callout ORA-29400: data cartridge error KUP-11504: error from external driver: java.lang.Exception: Error initializing JXADProvider: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "m04.vm.oracle.com/192.168.254.5"; destination host is: "vm02.vm.oracle.com":8020; remember this error. If you see it it means that you don't have valid Kerberos ticket on the database side. Let's bring everything back and make sure that our environment again works properly. [oracle@vm04 admin]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" [oracle@vm04 admin]$ /usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab [oracle@vm04 admin]$ klist  Ticket cache: FILE:/tmp/krb5cc_500 Default principal: oracle/martybda@MARTYBDA.ORACLE.COM   Valid starting     Expires            Service principal 07/23/18 07:03:46  07/24/18 07:03:46  krbtgt/MARTYBDA.ORACLE.COM@MARTYBDA.ORACLE.COM     renew until 07/30/18 07:03:46 [oracle@vm04 admin]$ mtactl stop bds_orcl_martybda   ...   [oracle@vm04 admin]$ mtactl start bds_orcl_martybda   ...   [oracle@scaj0602bda09vm04 admin]$ sqlplus / as sysdba   ...   SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>    Kerberos ticket missed on the Hadoop side Another case when Kerberos ticket is misses on the Hadoop side (for Oracle user). Let's take a look what is going to be if we have such case. For this I also will use kdestroy command tool on each Hadoop node: [oracle@vm01 ~]$ id uid=1000(oracle) gid=1001(oinstall) groups=1001(oinstall),127(hive),1002(dba) [oracle@vm01 ~]$ kdestroy after perform all this steps, let's go to the database side and run the query again: [oracle@vm04 bigdata_config]$ sqlplus / as sysdba   ...     SQL> alter session set container=PDBORCL;   Session altered.   SQL> select * from bds_test;   LINE ---------------------------------------- line1 line2   SQL>  from the first look everything looks ok, but, let's take a look what is the execution statistics:   SQL> select n.name, s.value /* , s.inst_id, s.sid */ from v$statname n, gv$mystat s where n.name like '%XT%' and s.statistic# = n.statistic#;     NAME                                      VALUE ---------------------------------------------------------------- ---------- cell XT granules requested for predicate offload                1 cell XT granule bytes requested for predicate offload           12 cell interconnect bytes returned by XT smart scan               8192 cell XT granule predicate offload retries                       3 cell XT granule IO bytes saved by storage index                 0 cell XT granule IO bytes saved by HDFS tbs extent map scan      0 and we see that "cell XT granule predicate offload retries" is not equal to 0, which means that all real processing in happens on the database side. If you query 10TB table on HDFS, you will briuse Multi-user ng back all 10TB and will process it all on the database side. Not good. So, if you missed Kerberos ticket on the Hadoop side query will finish, but SmartScan will not work.   Renewal of Kerberos tickets One of the key Kerberos pillar is that tickets have expiration time and user have to renew it. During installation Big Data SQL creates crontab job, which does this on the database side as well as on the Hadoop side. If you miss it for some reasons you could use this one as an example: [oracle@vm04 ~]$ crontab -l 15 1,7,13,19 * * * /bin/su - oracle -c "/usr/bin/kinit oracle/martybda@MARTYBDA.ORACLE.COM -k -t /u03/app/oracle/12.1.0/dbhome_orcl/bigdatasql/clusters/martybda/oracle.keytab" one note, that you always will use Oracle principal for Big Data SQL, but if you want to have fine grained  control over access to HDFS, you have to use Multiuser Authorization feature, as explained here.   Conclusion 1) Big Data SQL works over Kerberized clusters 2) You have to have Kerberos tickets on the Database side as well as on the Hadoop side 3) If you miss Kerberos ticket on the Database side query will fail 4) If you miss Kerberos ticket on the Hadoop side, query will not fail, but it will work on failback mode, when you move all blocks over the wire on the database node and process it there. You don't want to do so :)

In Hadoop world Kerberos is de facto standard of securing cluster and it's even not a question that Big Data SQL should support Kerberos. Oracle has good documentation about how to install  Big Data...

Hadoop Best Practices

Secure Kafka Cluster

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic deserved dedicated article. Now it's time to do this and this blog will be devoted by Kafka security only.  Kafka Security challenges 1) Encryption in motion. By default you communicate with Kafka cluster over unsecured network and everyone, who can listen network between your client and Kafka cluster, can read message content. the way to avoid this is use some on-wire encryption technology - SSL/TLS. Using SSL/TLS you encrypt data on a wire between your client and Kafka cluster. Communication without SSL/TLS: SSL/TLS communication:   After you enable SSL/TLS communication, you will have follow consequence of steps for write/read message to/from Kafka cluster: 2) Authentication. Well, now when we encrypt traffic between client and server, but here is another challenge - server doesn't know with whom it communicate. In other words, you have to enable some mechanisms, which will not allow to work with cluster for UNKNOWN users. The default authentication mechanism in Hadoop world is Kerberos protocol. Here is the workflow, which shows sequence of steps to enable secure communication with Kafka:   Kerberos is the trusted way to authenticate user on cluster and make sure, that only known users can access it.  3) Authorization. Next step when you authenticate user on your cluster (and you know that you are working as a Bob or Alice), you may want to apply some authorization rules, like setup permissions for certain users or groups. In other words define what user can do and what user can't do. Sentry may help you with this. Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions. 4) Rest Encryption. Another one security aspect is rest encryption. It's when you want to protect data, stored on the disk. Kafka is not purposed for long term storing data, but it could store data for a days or even weeks. We have to make sure that all data, stored on the disks couldn't be stolen and them read with out encryption key. Security implementation. Step 1 - SSL/TLS There is no any strict steps sequence for security implementation, but as a first step I will recommend to do SSL/TLS configuration. As a baseline I took Cloudera's documentation. For structuring all your security setup, create a directory on your Linux machine where you will put all files (start with one machine, but later on you will need to do the same on other's Kafka servers): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ sudo chown -R kafka:kafka /opt/kafka/security $ sudo mkdir -p /opt/kafka/security A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption. We will need to generate a key pair (a public key and associated private key). Wraps the public key into an X.509 self-signed certificate, which is stored as a single-element certificate chain. This certificate chain and the private key are stored in a new keystore entry identified by selfsigned. # keytool -genkeypair -keystore keystore.jks -keyalg RSA -alias selfsigned -dname "CN=localhost" -storepass 'welcome2' -keypass 'welcome3' if you want to check content of keystore, you may run follow command: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} # keytool -list -v -keystore keystore.jks ... Alias name: selfsigned Creation date: May 30, 2018 Entry type: PrivateKeyEntry Certificate chain length: 1 Certificate[1]: Owner: CN=localhost Issuer: CN=localhost Serial number: 2065847b Valid from: Wed May 30 12:59:54 UTC 2018 until: Tue Aug 28 12:59:54 UTC 2018 ... As the next step we will need to extract a copy of the cert from the java keystore that was just created: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # keytool -export -alias selfsigned -keystore keystore.jks -rfc -file server.cer Enter keystore password: welcome2 Then create a trust store by making a copy of the default java trust store.  Main difference between trustStore vs keyStore is that trustStore (as name suggest) is used to store certificates from trusted Certificate authorities(CA) which is used to verify certificate presented by Server in SSL Connection while keyStore is used to store private key and own identity certificate which program should present to other party (Server or client) to verify its identity. Some more details you could find here. In my case on Big Data Cloud Service I've performed follow command: # cp /usr/java/latest/jre/lib/security/cacerts /opt/kafka/security/truststore.jks put it into truststore: # ls -lrt -rw-r--r-- 1 root root 113367 May 30 12:46 truststore.jks -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer put the certificate that was just extracted from the keystore into the trust store (note: "changeit" is standard password): # keytool -import -alias selfsigned -file server.cer -keystore truststore.jks -storepass changeit check file size after (it's bigger, because includes new certificate): # ls -let -rw-r--r-- 1 root root   2070 May 30 12:59 keystore.jks -rw-r--r-- 1 root root   1039 May 30 13:01 server.cer -rw-r--r-- 1 root root 114117 May 30 13:06 truststore.jks It may seems too complicated and I decided to depict all those steps in one diagram: so far, all those steps been performed on the single (some random broker) machine. But you will need to have keystore and trustore files on each Kafka broker, let's copy It (note, current syntax is working on Big Data Appliance, Big Data Cloud Service or Big Data Cloud at Customer): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} # dcli -C "mkdir -p /opt/kafka/security" # dcli -C "chown kafka:kafka /opt/kafka/security" # dcli -C -f /opt/kafka/security/keystore.jks -d /opt/kafka/security/keystore.jks # dcli -C -f /opt/kafka/security/truststore.jks -d /opt/kafka/security/truststore.jks after doing all these steps, you need to make some configuration changes in Cloudera Manager for each node (go to Cloudera Manager -> Kafka -> Configuration): In addition to this, on each node, you have to change listeners in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" Also, make sure, that in Cloudera Manager, you have security.inter.broker.protocol equal to SSL: After node restart, when all brokers up and running, let's test it: # openssl s_client -debug -connect kafka1.us2.oraclecloud.com:9093 -tls1_2 ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Certificate chain 0 s:/CN=localhost    i:/CN=localhost --- Server certificate -----BEGIN CERTIFICATE----- MIICxzCCAa+gAwIBAgIEIGWEezANBgkqhkiG9w0BAQsFADAUMRIwEAYDVQQDEwls b2NhbGhvc3QwHhcNMTgwNTMwMTI1OTU0WhcNMTgwODI4MTI1OTU0WjAUMRIwEAYD VQQDEwlsb2NhbGhvc3QwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQCI 53T82eoDR2e9IId40UPTj3xg3khl1jdjNvMiuB/vcI7koK0XrZqFzMVo6zBzRHnf zaFBKPAQisuXpQITURh6jrVgAs1V4hswRPrJRjM/jCIx7S5+1INBGoEXk8OG+OEf m1uYXfULz0bX9fhfl+IdKzWZ7jiX8FY5dC60Rx2RTpATWThsD4mz3bfNd3DlADw2 LH5B5GAGhLqJjr23HFjiTuoQWQyMV5Esn6WhOTPCy1pAkOYqX86ad9qP500zK9lA hynyEwNHWt6GoHuJ6Q8A9b6JDyNdkjUIjbH+d0LkzpDPg6R8Vp14igxqxXy0N1Sd DKhsV90F1T0whlxGDTZTAgMBAAGjITAfMB0GA1UdDgQWBBR1Gl9a0KZAMnJEvxaD oY0YagPKRTANBgkqhkiG9w0BAQsFAAOCAQEAaiNdHY+QVdvLSILdOlWWv653CrG1 2WY3cnK5Hpymrg0P7E3ea0h3vkGRaVqCRaM4J0MNdGEgu+xcKXb9s7VrwhecRY6E qN0KibRZPb789zQVOS38Y6icJazTv/lSxCRjqHjNkXhhzsD3tjAgiYnicFd6K4XZ rQ1WiwYq1254e8MsKCVENthQljnHD38ZDhXleNeHxxWtFIA2FXOc7U6iZEXnnaOM Cl9sHx7EaGRc2adIoE2GXFNK7BY89Ip61a+WUAOn3asPebrU06OAjGGYGQnYbn6k 4VLvneMOjksuLdlrSyc5MToBGptk8eqJQ5tyWV6+AcuwHkTAnrztgozatg== -----END CERTIFICATE----- subject=/CN=localhost issuer=/CN=localhost --- No client certificate CA names sent Server Temp Key: ECDH, secp521r1, 521 bits --- SSL handshake has read 1267 bytes and written 441 bytes --- New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-GCM-SHA384 Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session:     Protocol  : TLSv1.2     Cipher    : ECDHE-RSA-AES256-GCM-SHA384     Session-ID: 5B0EAC6CA8FB4B6EA3D0B4A494A4660351A4BD5824A059802E399308C0B472A4     Session-ID-ctx:     Master-Key: 60AE24480E2923023012A464D16B13F954A390094167F54CECA1BDCC8485F1E776D01806A17FB332C51FD310730191FE     Key-Arg   : None     Krb5 Principal: None     PSK identity: None     PSK identity hint: None     Start Time: 1527688300     Timeout   : 7200 (sec)     Verify return code: 18 (self signed certificate) Well, seems our SSL connection is up and running. Time try to put some messages into the topic: #  kafka-console-producer  --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar ... 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. 18/05/30 13:56:28 WARN clients.NetworkClient: Connection to node -1 could not be established. Broker may not be available. reason of this error, that we don't have properly configured clients. We will need to create and use client.properties and jaas.conf files. # cat /opt/kafka/security/client.properties security.protocol=SSL ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit -bash-4.1# cat jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useTicketCache=true;     }; # export KAFKA_OPTS="-Djava.security.auth.login.config=/opt/kafka/security/jaas.conf"  now you could try again to produce messages: # kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093  --topic foobar --producer.config client.properties ... Hello SSL world no any errors - already good! Let's try to consume message: # kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... Hello SSL world Bingo! So, we created secure communication between Kafka Cluster and Kafka Client and write a message there. Security implementation. Step 2 - Kerberos So, we up and run Kafka on Kerberized cluster and write and read data from a cluster without Kerberos ticket. $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) This is not how it's suppose to work. We assume that if we protect cluster by Kerberos it's impossible to do something without ticket. Fortunately, it's relatively easy to config communications with Kerberized Kafka cluster. First, make sure that you have enabled Kerberos authentification in Cloudera Manager (Cloudera Manager -> Kafka -> Configuration): second, go again to Cloudera Manager and change value of "security.inter.broker.protocol" to SASL_SSL: Note: Simple Authentication and Security Layer (SASL) is a framework for authentication and data security in Internet protocols. It decouples authentication mechanisms from application protocols, in theory allowing any authentication mechanism supported by SASL to be used in any application protocol that uses SASL. Very roughly - in this blog post you may think that SASL is equal to Kerberos. After this change, you will need to modify listeners protocol on each broker (to SASL_SSL) in "Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties" setting: you ready for restart Kafka Cluster and write/read data from/to it.  Before doing this, you will need to modify Kafka client credentials: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit after this you may try to read data from Kafka cluster: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Caused by: org.apache.kafka.common.KafkaException: javax.security.auth.login.LoginException: Could not login: the client is being asked for a password, but the Kafka client code does not currently support obtaining a password from the user. not available to garner  authentication information from the user ... Error may miss-lead you, but the the real reason is absence of Kerberos ticket: $ klist klist: No credentials cache found (ticket cache FILE:/tmp/krb5cc_1001) $ kinit oracle Password for oracle@BDACLOUDSERVICE.ORACLE.COM: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} Hello SSL world Great, it works! now we have to run kinit everytime before read/write data from Kafka cluster. Instead of this for convenience we may use keytab. For doing this you will need go to KDC server and generate keytab file there: # kadmin.local Authenticating as principal hdfs/admin@BDACLOUDSERVICE.ORACLE.COM with password. kadmin.local: xst -norandkey -k testuser.keytab testuser Entry for principal oracle with kvno 2, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des3-cbc-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type arcfour-hmac added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-hmac-sha1 added to keytab WRFILE:oracle.keytab. Entry for principal oracle with kvno 2, encryption type des-cbc-md5 added to keytab WRFILE:oracle.keytab. kadmin.local:  quit # ls -l ... -rw-------  1 root root    436 May 31 14:06 testuser.keytab ... now, when we have keytab file, we could copy it to the client machine and use it for Kerberos Authentication. don't forget to change owner of keytab file to person, who will run the script: $ chown opc:opc /opt/kafka/security/testuser.keytab Also, we will need to modify jaas.conf file: $ cat /opt/kafka/security/jaas.conf KafkaClient {       com.sun.security.auth.module.Krb5LoginModule required       useKeyTab=true       keyTab="/opt/kafka/security/testuser.keytab"       principal="testuser@BDACLOUDSERVICE.ORACLE.COM";     }; seems we are fully ready to consumption of messages from topic. Despite on we have oracle as kerberos principal on a OS, we connect to the cluster as testuser (according jaas.conf): p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --from-beginning  --consumer.config /opt/kafka/security/client.properties ... 18/05/31 15:04:45 INFO authenticator.AbstractLogin: Successfully logged in. 18/05/31 15:04:45 INFO kerberos.KerberosLogin: [Principal=testuser@BDACLOUDSERVICE.ORACLE.COM]: TGT refresh thread started. ... Hello SSL world Security Implementation Step 3 - Sentry One step before we configured Authentication, which answers on question - who am I. Now is the time to set up some Authorization mechanism, which will answer on question - what am I allow to do. Sentry became very popular engine in Hadoop world and we will use it for Kafka's Authorization. As I posted earlier Sentry have philosophy, when users belongs to the groups, groups has own roles and roles have permissions: And we will need to follow this with Kafka as well. But we will start with some service configurations first (Cloudera Manager -> Kafka -> Configuration): Also, it's very important to add in Sentry config (Cloudera Manager -> Sentry -> Config) user kafka in "sentry.service.admin.group":  Well, when we know who connects to the cluster, we may restrict he/she from reading some particular topics (in other words perform some Authorization).  Note: for perform administrative operations with Sentry, you have to work as Kafka user. p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} span.s1 {font-variant-ligatures: no-common-ligatures} $ id uid=1001(opc) gid=1005(opc) groups=1005(opc) $ sudo find /var -name kafka*keytab -printf "%T+\t%p\n" | sort|tail -1|cut -f 2 /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab $ sudo cp /var/run/cloudera-scm-agent/process/1171-kafka-KAFKA_BROKER/kafka.keytab /opt/kafka/security/kafka.keytab $ sudo chown opc:opc /opt/kafka/security/kafka.keytab obtain Kafka ticket: p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85)} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Monaco; color: #f4f4f4; background-color: #000000; background-color: rgba(0, 0, 0, 0.85); min-height: 16.0px} span.s1 {font-variant-ligatures: no-common-ligatures} span.Apple-tab-span {white-space:pre} $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 05/31/18 15:52:28  06/01/18 15:52:28  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/05/18 15:52:28 Before configuring and testing Sentry with Kafka, we will need to create unprivileged user, who we will give grants (Kafka user is privileged and it bypassed Sentry). there are few simple steps, create test user (unprivileged) on each Hadoop node (this syntax will work on Big Data Appliance, Big Data Cloud Service and Big Data Cloud at Customer): # dcli -C "useradd testsentry -u 1011" we should remember that Sentry heavily relies on the Groups and we have to create it and put "testsentry" user there: # dcli -C "groupadd testsentry_grp -g 1017" after group been created, we should put user there: # dcli -C "usermod -g testsentry_grp testsentry" check that everything is how we expect: # dcli -C "id testsentry" 10.196.64.44: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.60: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.64: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.65: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) 10.196.64.61: uid=1011(testsentry) gid=1017(testsentry_grp) groups=1017(testsentry_grp) Note: you have to have same userID and groupID on each machine. Now verify that Hadoop can lookup group: # hdfs groups testsentry testsentry : testsentry_grp All this steps you have to perform as root. Next you should create testsentry principal in KDC (it's not mandatory, but more organize and easy to understand). Go to the KDC host and run follow commands: # kadmin.local  Authenticating as principal root/admin@BDACLOUDSERVICE.ORACLE.COM with password.  kadmin.local:  addprinc testsentry WARNING: no policy specified for testsentry@BDACLOUDSERVICE.ORACLE.COM; defaulting to no policy Enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Re-enter password for principal "testsentry@BDACLOUDSERVICE.ORACLE.COM":  Principal "testsentry@BDACLOUDSERVICE.ORACLE.COM" created. kadmin.local:  xst -norandkey -k testsentry.keytab testsentry Entry for principal testsentry with kvno 1, encryption type aes256-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type aes128-cts-hmac-sha1-96 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des3-cbc-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type arcfour-hmac added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-hmac-sha1 added to keytab WRFILE:testsentry.keytab. Entry for principal testsentry with kvno 1, encryption type des-cbc-md5 added to keytab WRFILE:testsentry.keytab. Now we have all setup for unprivileged user. Time to start configure Sentry policies. As soon as Kafka is superuser we may run admin commands as Kafka user. For managing sentry settings we will need to use Kafka user. To obtain Kafka credentials we need to run: $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/kafka1.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:37:53  06/16/18 01:37:53  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/20/18 01:37:53 First we need to create role. Let's call it testsentry_role: $ kafka-sentry -cr -r testsentry_role let's check, that role been created: $ kafka-sentry -cr -r testsentry_role ... admin_role testsentry_role [opc@cfclbv3872 ~]$  as soon as role created, we will need to give some permissions to this role for certain topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=write" and also describe: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Topic=testTopic->action=describe" next step, we have to allow some consumer group to read and describe from this topic: $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=read" $ kafka-sentry -gpr -r testsentry_role -p "Host=*->Consumergroup=testconsumergroup->action=describe" next step is linking role and groups, we will need to assign testsentry_role to testsentry_grp (group automatically inherit all role's permissions): $ kafka-sentry -arg -r testsentry_role -g testsentry_grp after this, let's check that our mapping worked fine: $ kafka-sentry -lr -g testsentry_grp ... testsentry_role now let's review list of the permissions, which have our certain role: $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=read it's also very important to have consumer group in client properties file: $ cat /opt/kafka/security/client.properties security.protocol=SASL_SSL sasl.kerberos.service.name=kafka ssl.truststore.location=/opt/kafka/security/truststore.jks ssl.truststore.password=changeit group.id=testconsumergroup after all set, we will need to switch to testsentry user for testing: $ kinit -kt /opt/kafka/security/testsentry.keytab testsentry $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: testsentry@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/15/18 01:38:49  06/16/18 01:38:49  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/22/18 01:38:49 test writes: $ kafka-console-producer --broker-list kafka1.us2.oraclecloud.com:9093 --topic testTopic --producer.config /opt/kafka/security/client.properties ... > testmessage1 > testmessage2 > seems everything is ok, now let's test a read: $ kafka-console-consumer --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic testTopic --from-beginning  --consumer.config /opt/kafka/security/client.properties ... testmessage1 testmessage2 now, for showing Sentry in action, I'll try to read messages from other topic, which is outside of allowed topics for our test group. $ kafka-console-consumer --from-beginning --bootstrap-server kafka1.us2.oraclecloud.com:9093 --topic foobar --consumer.config /opt/kafka/security/client.properties ... 18/06/15 02:54:54 INFO internals.AbstractCoordinator: (Re-)joining group testconsumergroup 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 13 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 15 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 16 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} 18/06/15 02:54:54 WARN clients.NetworkClient: Error while fetching metadata with correlation id 17 : {foobar=UNKNOWN_TOPIC_OR_PARTITION} so, as we can see we could not read from Topic, which we don't authorize to read. Systemizing all this, I'd like to put user-group-role-privilegies flow on one picture: And also, I'd like to summarize steps, required for getting list of privileges for certain user (testsentry in my example): // Run as superuser - Kafka $ kinit -kt /opt/kafka/security/kafka.keytab kafka/`hostname` $ klist  Ticket cache: FILE:/tmp/krb5cc_1001 Default principal: kafka/cfclbv3872.us2.oraclecloud.com@BDACLOUDSERVICE.ORACLE.COM   Valid starting     Expires            Service principal 06/19/18 02:38:26  06/20/18 02:38:26  krbtgt/BDACLOUDSERVICE.ORACLE.COM@BDACLOUDSERVICE.ORACLE.COM     renew until 06/24/18 02:38:26 // Get list of the groups which belongs certain user $ hdfs groups testsentry testsentry : testsentry_grp // Get list of the role for certain group $ kafka-sentry -lr -g testsentry_grp ...   testsentry_role // Get list of permissions for certain role $ kafka-sentry -r testsentry_role -lp ... HOST=*->CONSUMERGROUP=testconsumergroup->action=read HOST=*->TOPIC=testTopic->action=describe HOST=*->TOPIC=testTopic->action=write HOST=*->TOPIC=testTopic->action=read HOST=*->CONSUMERGROUP=testconsumergroup->action=describe Based on what we saw above - our user testsentry could read and write to topic testTopic. For reading data he should to belong to the consumergroup "testconsumergroup". Security Implementation Step 4 - Encryption At Rest Last part of security journey is Encryption of Data, which you store on the disks. Here there are multiple ways, one of the most common is Navigator Encrypt.

A while ago I've wrote Oracle best practices for building secure Hadoop cluster and you could find details here. In that blog I intentionally didn't mention Kafka's security, because this topic...

Big Data

Big Data SQL 3.2.1 is Now Available

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: Existing customers using Big Data SQL 3.2 do not need to take this update; Oracle Database 12.2.0.1 support is the reason for the update. Big Data SQL 3.2.1 can be used for both Oracle Database 12.1.0.2 and 12.2.0.1 deployments For Oracle Database 12.2.0.1, Big Data SQL 3.2.1 requires the April Release Update plus the Big Data SQL 3.2.1 one-off patch The software is available on ARU.  The Big Data SQL 3.2.1 installer will be available on edelivery soon  Big Data SQL 3.2.1 Installer ( Patch 28071671).  Note, this is the complete installer; it is not a patch. Oracle Database 12.2.0.1 April Release Update (Patch 27674384).  Ensure your Grid Infrastructure is also on the 12.2.0.1 April Release Update (if you are using GI) Big Data SQL 3.2.1 one-off on top of April RU (Patch 26170659).  Ensure you pick the appropriate release in the download page.  This patch must be applied to each database server and Grid Infrastructure. Also, check out this new Big Data SQL Tutorial series on Oracle Learning Library.  The series includes numerous videos that helps you understand Big Data SQL capabilities.  It includes: Introducing the Oracle Big Data Lite Virtual Machine and Hadoop Introduction to Oracle Big Data SQL Hadoop and Big Data SQL Architectures Oracle Big Data SQL Performance Features Information Lifecycle Management 

Just wanted to give a quick update.  I am pleased to announce that Oracle Big Data SQL 3.2.1 is now available.   This release provides support for Oracle Database 12.2.0.1.  Here are some key details: E...

Event Hub Cloud Service. Hello world

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it is, without any transformation and preserve it for a long period of time.  At the same time you have two types of data - Streaming Data and Batch. Batch could be log files, RDBMS archives. Streaming data could be IoT, Sensors, Golden Gate replication logs. Apache Kafka is very popular engine for acquiring streaming data. It has multiple advantages, like scalability, fault tolerance and high throughput. Unfortunately, Kafka is hard to manage. Fortunately, Cloud simplifies many routine operations. Oracle Has three options for deploy Kafka in the Cloud: 1) Use Big Data Cloud Service, where you get full Cloudera cluster and there you could deploy Apache Kafka as part of CDH. 2) Event Hub Cloud Service Dedicated. Here you have to specify server shapes and some other parameters, but rest done by Cloud automagically.  3) Event Hub Cloud Service. This service is fully managed by Oracle, you even don't need to specify any compute shapes or so. Only one thing to do is tell for how long you need to store data in this topic and tell how many partitions do you need (partitions = performance). Today, I'm going to tell you about last option, which is fully managed cloud service. It's really easy to provision it, just need to login into your Cloud account and choose "Event Hub" Cloud service. after this go and choose open service console: Next, click on "Create service": Put some parameters - two key is Retention period and Number of partitions. First defines for how long will you store messages, second defines performance for read and write operations. Click next after: Confirm and wait a while (usually not more than few minutes): after a short while, you will be able to see provisioned service:     Hello world flow. Today I want to show "Hello world" flow. How to produce (write) and consume (read) message from Event Hub Cloud Service. The flow is (step by step): 1) Obtain OAuth token 2) Produce message to a topic 3) Create consumer group 4) Subscribe to topic 5) Consume message Now I'm going to show it in some details. OAuth and Authentication token (Step 1) For dealing with Event Hub Cloud Service you have to be familiar with concept of OAuth and OpenID. If you are not familiar, you could watch the short video or go through this step by step tutorial.  In couple words OAuth token authorization (tells what I could access) method to restrict access to some resources. One of the main idea is decouple Uses (real human - Resource Owner) and Application (Client). Real man knows login and password, but Client (Application) will not use it every time when need to reach Resource Server (which has some info or content). Instead of this, Application will get once a Authorization token and will use it for working with Resource Server. This is brief, here you may find more detailed explanation what is OAuth. Obtain Token for Event Hub Cloud Service client. As you could understand for get acsess to Resource Server (read as Event Hub messages) you need to obtain authorization token from Authorization Server (read as IDCS). Here, I'd like to show step by step flow how to obtain this token. I will start from the end and will show the command (REST call), which you have to run to get token: #!/bin/bash curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json as you can see there are many parameters required for obtain OAuth token. Let's take a looks there you may get it. Go to the service and click on topic which you want to work with, there you will find IDCS Application, click on it: After clicking on it, you will go be redirected to IDCS Application page. Most of the credentials you could find here. Click on Configuration: On this page right away you will find ClientID and Client Secret (think of it like login and password):   look down and find point, called Resources: Click on it and you will find another two variables, which you need for OAuth token - Scope and Primary Audience. One more required parameter - IDCS_URL, you may find in your browser: you have almost everything you need, except login and password. Here implies oracle cloud login and password (it what you are using when login into http://myservices.us.oraclecloud.com): Now you have all required credential and you are ready to write some script, which will automate all this stuff: #!/bin/bash export CLIENT_ID=7EA06D3A99D944A5ADCE6C64CCF5C2AC_APPID export CLIENT_SECRET=0380f967-98d4-45e9-8f9a-45100f4638b2 export THEUSERNAME=john.dunbar export THEPASSWORD=MyPassword export SCOPE=/idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export PRIMARY_AUDIENCE=https://7EA06D3A99D944A5ADCE6C64CCF5C2AC.uscom-central-1.oraclecloud.com:443 export THESCOPE=$PRIMARY_AUDIENCE$SCOPE export IDCS_URL=https://idcs-1d6cc7dae45b40a1b9ef42c7608b9afe.identity.oraclecloud.com curl -k -X POST -u "$CLIENT_ID:$CLIENT_SECRET" \ -d "grant_type=password&username=$THEUSERNAME&password=$THEPASSWORD&scope=$THESCOPE" \ "$IDCS_URL/oauth2/v1/token" \ -o access_token.json after running this script, you will have new file called access_token.json. Field access_token it's what you need: $ cat access_token.json {"access_token":"eyJ4NXQjUzI1NiI6InVUMy1YczRNZVZUZFhGbXFQX19GMFJsYmtoQjdCbXJBc3FtV2V4U2NQM3MiLCJ4NXQiOiJhQ25HQUpFSFdZdU9tQWhUMWR1dmFBVmpmd0UiLCJraWQiOiJTSUdOSU5HX0tFWSIsImFsZyI6IlJTMjU2In0.eyJ1c2VyX3R6IjoiQW1lcmljYVwvQ2hpY2FnbyIsInN1YiI6ImpvaG4uZHVuYmFyIiwidXNlcl9sb2NhbGUiOiJlbiIsInVzZXJfZGlzcGxheW5hbWUiOiJKb2huIER1bmJhciIsInVzZXIudGVuYW50Lm5hbWUiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwic3ViX21hcHBpbmdhdHRyIjoidXNlck5hbWUiLCJpc3MiOiJodHRwczpcL1wvaWRlbnRpdHkub3JhY2xlY2xvdWQuY29tXC8iLCJ0b2tfdHlwZSI6IkFUIiwidXNlcl90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsImNsaWVudF9pZCI6IjdFQTA2RDNBOTlEOTQ0QTVBRENFNkM2NENDRjVDMkFDX0FQUElEIiwiYXVkIjpbInVybjpvcGM6bGJhYXM6bG9naWNhbGd1aWQ9N0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMiLCJodHRwczpcL1wvN0VBMDZEM0E5OUQ5NDRBNUFEQ0U2QzY0Q0NGNUMyQUMudXNjb20tY2VudHJhbC0xLm9yYWNsZWNsb3VkLmNvbTo0NDMiXSwidXNlcl9pZCI6IjM1Yzk2YWUyNTZjOTRhNTQ5ZWU0NWUyMDJjZThlY2IxIiwic3ViX3R5cGUiOiJ1c2VyIiwic2NvcGUiOiJcL2lkY3MtMWQ2Y2M3ZGFlNDViNDBhMWI5ZWY0MmM3NjA4YjlhZmUtb2VodGVzdCIsImNsaWVudF90ZW5hbnRuYW1lIjoiaWRjcy0xZDZjYzdkYWU0NWI0MGExYjllZjQyYzc2MDhiOWFmZSIsInVzZXJfbGFuZyI6ImVuIiwiZXhwIjoxNTI3Mjk5NjUyLCJpYXQiOjE1MjY2OTQ4NTIsImNsaWVudF9ndWlkIjoiZGVjN2E4ZGRhM2I4NDA1MDgzMjE4NWQ1MzZkNDdjYTAiLCJjbGllbnRfbmFtZSI6Ik9FSENTX29laHRlc3QiLCJ0ZW5hbnQiOiJpZGNzLTFkNmNjN2RhZTQ1YjQwYTFiOWVmNDJjNzYwOGI5YWZlIiwianRpIjoiMDkwYWI4ZGYtNjA0NC00OWRlLWFjMTEtOGE5ODIzYTEyNjI5In0.aNDRIM5Gv_fx8EZ54u4AXVNG9B_F8MuyXjQR-vdyHDyRFxTefwlR3gRsnpf0GwHPSJfZb56wEwOVLraRXz1vPHc7Gzk97tdYZ-Mrv7NjoLoxqQj-uGxwAvU3m8_T3ilHthvQ4t9tXPB5o7xPII-BoWa-CF4QC8480ThrBwbl1emTDtEpR9-4z4mm1Ps-rJ9L3BItGXWzNZ6PiNdVbuxCQaboWMQXJM9bSgTmWbAYURwqoyeD9gMw2JkwgNMSmljRnJ_yGRv5KAsaRguqyV-x-lyE9PyW9SiG4rM47t-lY-okMxzchDm8nco84J5XlpKp98kMcg65Ql5Y3TVYGNhTEg","token_type":"Bearer","expires_in":604800} Create Linux variable for it: #!/bin/bash export TOKEN=`cat access_token.json |jq .access_token|sed 's/\"//g'` Well, now we have Authorization token and may work with our Resource Server (Event Hub Cloud Service).  Note: you also may check documentation about how to obtain OAuth token. Produce Messages (Write data) to Kafka (Step 2) The first thing that we may want to do is produce messages (write data to a Kafka cluster). To make scripting easier, it's also better to use some environment variables for common resources. For this example, I'd recommend to parametrize topic's end point, topic name, type of content to be accepted and content type. Content type is completely up to developer, but you have to consume (read) the same format as you produce(write). The key parameter to define is REST endpoint. Go to PSM, click on topic name and copy everything till "restproxy": Also, you will need topic name, which you could take from the same window: now we could write a simple script for produce one message to Kafka: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest export CONTENT_TYPE=application/vnd.kafka.json.v2+json curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"records":[{"value":{"foo":"bar"}}]}' \ $OEHCS_ENDPOINT/topics/$TOPIC_NAME if everything will be fine, Linux console will return something like: {"offsets":[{"partition":1,"offset":8,"error_code":null,"error":null}],"key_schema_id":null,"value_schema_id":null} Create Consumer Group (Step 3) The first step to read data from OEHCS is create consumer group. We will reuse environment variables from previous step, but just in case I'll include it in this script: #!/bin/bash export OEHCS_ENDPOINT=https://oehtest-gse00014957.uscom-central-1.oraclecloud.com:443/restproxy export CONTENT_TYPE=application/vnd.kafka.json.v2+json export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ --data '{"format": "json", "auto.offset.reset": "earliest"}' \ $OEHCS_ENDPOINT/consumers/oehcs-consumer-group \ -o consumer_group.json this script will generate output file, which will contain variables, that we will need to consume messages. Subscribe to a topic (Step 4) Now you are ready to subscribe for this topic (export environment variable if you didn't do this before): #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export TOPIC_NAME=idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest curl -X POST \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: $CONTENT_TYPE" \ -d "{\"topics\": [\"$TOPIC_NAME\"]}" \ $BASE_URI/subscription If everything fine, this request will not return something.  Consume (Read) messages (Step 5) Finally, we approach last step - consuming messages. and again, it's quite simple curl request: #!/bin/bash export BASE_URI=`cat consumer_group.json |jq .base_uri|sed 's/\"//g'` export H_ACCEPT=application/vnd.kafka.json.v2+json curl -X GET \ -H "Authorization: Bearer $TOKEN" \ -H "Accept: $H_ACCEPT" \ $BASE_URI/records if everything works, like it supposed to work, you will have output like: [{"topic":"idcs-1d6cc7dae45b40a1b9ef42c7608b9afe-oehtest","key":null,"value":{"foo":"bar"},"partition":1,"offset":17}] Conclusion Today we saw how easy to create fully managed Kafka Topic in Event Hub Cloud Service and also we made a first steps into it - write and read message. Kafka is really popular message bus engine, but it's hard to manage. Cloud simplifies this and allow customers concentrate on the development of their applications. here I also want to give some useful links: 1) If you are not familiar with REST API, I'd recommend you to go through this blog 2) There is online tool, which helps to validate your curl requests 3) Here you could find some useful examples of producing and consuming messages 4) If you are not familiar with OAuth, here is nice tutorial, which show end to end example

In early days, I've wrote a blog about Oracle Reference Architecture and concept of Schema on Read and Schema on Write. Schema on Read is well suitable for Data Lake, which may ingest any data as it...

Data Warehousing

Autonomous Data Warehouse is LIVE!

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference Center We had a major launch event on Thursday last week at the Oracle Conference center in Redwood Shores which got a huge amount of press coverage. Larry Ellison delivered the main keynote covering how our next-generation cloud service is built on the self-driving Oracle Autonomous Database technology which leverages machine learning to deliver unprecedented performance, reliability and ease of deployment for data warehouses. As an autonomous cloud service, it eliminates error-prone manual management tasks and, most importantly for a lot readers of this blog, frees up DBA resources, which can now be applied to implementing more strategic business projects. The key highlights of our Oracle Autonomous Data Warehouse Cloud include: Ease of Use: Unlike traditional cloud services with complex, manual configurations that require a database expert to specify data distribution keys and sort keys, build indexes, reorganize data or adjust compression, Oracle Autonomous Data Warehouse Cloud is a simple "load and go" service. Users specify tables, load data and then run their workloads in a matter of seconds-no manual tuning is needed. Industry-Leading Performance: Unlike traditional cloud services, which use generic compute shapes for database cloud services, Oracle Autonomous Data Warehouse Cloud is built on the high-performance Oracle Exadata platform. Performance is further enhanced by fully-integrated machine learning algorithms which drive automatic caching, adaptive indexing and advanced compression. Instant Elasticity: Oracle Autonomous Data Warehouse Cloud allocates new data warehouses of any size in seconds and scales compute and storage resources independently of one another with no downtime. Elasticity enables customers to pay for exactly the resources that the database workloads require as they grow and shrink. To highlight these three unique aspects of Autonomous Data Warehouse Cloud the launch included a live, on-stage demo of ADWC and Oracle Analytics Cloud. If you have never seen a new data warehouse delivered in seconds rather than days then pay careful attention to the demo video below where George Lumpkin creates a new fully autonomous data warehouse with a few mouse clicks and then starts to query one of the sample schemas, shipped with ADWC, using OAC. Probably the most important section was the panel discussion with a handful of our early adopter customers which was hosted by Steve Daheb, Senior Vice President, Oracle Cloud. As always. it’s great to hear customers talk about how the simplicity and speed of ADWC are bringing about significant changes to the way our customers think about their data. I you missed all the excitement, the keynote, demos and discussions then here is some great news….we recorded everything for you so can watch it from the comfort of your desk. Below are the links to the three main parts of the launch:     Video: Larry Ellison, CTO and Executive Chairman, Oracle, introduces Oracle Autonomous Database Cloud. Oracle Autonomous Database Cloud eliminates complexity and human error, helping to ensure higher reliability, security, and efficiency at the lowest cost.     Video: Steve Daheb, Senior Vice President, Oracle Cloud, discusses the benefits of Oracle Autonomous Cloud Platform with Oracle customers: - Paul Daugherty, Accenture - Benjamin Arnulf, Hertz - Michael Morales, QMP Health - Al Cordoba, QLX   Video: George Lumpkin, Vice President of Product Management, Oracle, demonstrates the self-driving, self-securing, and self-repairing capabilities of Oracle Autonomous Data Warehouse Cloud.   So what's next? So you are all fired up and you want to learn more about Autonomous Data Warehouse Cloud! Where do you go? First place to visit is the ADWC home page on cloud.oracle.com: https://cloud.oracle.com/datawarehouse Can I Try It? Yes you can! We have a great program that let's you get started with Oracle Cloud for free with $300 in free credits. Using your credits (which will probably last you around 30 days depending on how you configure your ADWC)you will be able to get valuable hands-on time to try loading some your own workloads and testing integration with our other cloud services such as Analytics Cloud and Data Integration Cloud. Are there any tutorials to help me get started? Yes there are! We have quick start tutorials covering both Autonomous Data Warehouse Cloud and our bundled SQL notebook application called Oracle Machine Learning, just click here: Provisioning Autonomous Data Warehouse Cloud Connecting SQL Developer and Creating Tables Loading Your Data Running a Query on Sample Data Creating Projects and Workspaces in OML Creating and Running Notebooks Collaborating in OML Creating a SQL Script Running SQL Statements Is the documentation available? Yes it is! The documentation set for ADWC is right here and the documentation set for Oracle Machine Learning is right here. Anything else I need to know? Yes there is! Over the next few weeks I will be posting links to more videos where our ADWC customers will take about their experiences of using ADWC during the last couple of months. There will be information about some deep-dive online tutorials that you can use as part of your free $300 trial along with lots of other topics that re too numerous to list. If you have a burning question about Oracle Autonomous Data Warehouse Cloud then feel free to reach out to me via email:keith.laker@oracle.com

That’s right: Autonomous Data Warehouse Cloud is LIVE and available in the Oracle Cloud. ADWC Launch Event at Oracle Conference CenterWe had a major launch event on Thursday last week at the...

Object Store Service operations. Part 1 - Loading data

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud providers, including Oracle, assumes, that data lifecycle starts from Object Store: You land it there and then either read or load it by different services, such as ADWC or BDCS, for example. Oracle has two flavors of Object Store Services (OSS), OSS on OCI (Oracle Cloud Infrastructure) and OSS on OCI -C (Oracle Cloud  Infrastructure Classic).  In this post, I'm going to focus on OSS on OCI-C, mostly because OSS on OCI, was perfectly explained by Hermann Baer here and by Rachna Thusoo here. Upload/Download files. As in Hermann's blog, I'll focus on most frequent operations Upload and Download. There are multiple ways to do so. For example: - Oracle Cloud WebUI - REST API - FTM CLI tool - Third Part tools such as CloudBerry - Big Data Manager (via ODCP) - Hadoop client with Swift API - Oracle Storage Software Appliance Let's start with easiest one - Web Interface. Upload/Download files. WebUI. For sure you have to start with Log In to cloud services: then, you have to go to Object Store Service: after this drill down into Service Console and you will be able to see list of the containers within your OSS: To create a new container (bucket in OCI terminology), simply click on "Creare Container" and give a name to it: After it been created, click on it and go to "Upload object" button: Click and Click again and here we are, file in the container: Let's try to upload a bigger file, ops... we got an error: So, seems we have 5GB limitations. Fortunitely, we could have "Large object upload": Which will allow us to uplod file bigger than 5GB: so, and what about downloading? It's easy, simply click download and land file on local files system. Upload/Download files. REST API. WebUI maybe a good way to upload data, when a human operates with it, but it's not too convenient for scripting. If you want to automate your file uploading, you may use REST API. You may find all details regarding REST API here, but alternatively, you may use this script, which I'm publishing below and it could hint you some basic commands:​ #!/bin/bash shopt -s expand_aliases alias echo="echo -e" USER="alexey.filanovskiy@oracle.com" PASS="MySecurePassword" OSS_USER="storage-a424392:${USER}" OSS_PASS="${PASS}" OSS_URL="https://storage-a424392.storage.oraclecloud.com/auth/v1.0" echo "curl -k -sS -H \"X-Storage-User: ${OSS_USER}\" -H \"X-Storage-Pass:${OSS_PASS}\" -i \"${OSS_URL}\"" out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` while [ $? -ne 0 ]; do echo "Retrying to get token\n" sleep 1; out=`curl -k -sS -H "X-Storage-User: ${OSS_USER}" -H "X-Storage-Pass:${OSS_PASS}" -i "${OSS_URL}"` done AUTH_TOKEN=`echo "${out}" | grep "X-Auth-Token" | sed 's/X-Auth-Token: //;s/\r//'` STORAGE_TOKEN=`echo "${out}" | grep "X-Storage-Token" | sed 's/X-Storage-Token: //;s/\r//'` STORAGE_URL=`echo "${out}" | grep "X-Storage-Url" | sed 's/X-Storage-Url: //;s/\r//'` echo "Token and storage URL:" echo "\tOSS url: ${OSS_URL}" echo "\tauth token: ${AUTH_TOKEN}" echo "\tstorage token: ${STORAGE_TOKEN}" echo "\tstorage url: ${STORAGE_URL}" echo "\nContainers:" for CONTAINER in `curl -k -sS -u "${USER}:${PASS}" "${STORAGE_URL}"`; do echo "\t${CONTAINER}" done FILE_SIZE=$((1024*1024*1)) CONTAINER="example_container" FILE="file.txt" LOCAL_FILE="./${FILE}" FILE_AT_DIR="/path/file.txt" LOCAL_FILE_AT_DIR=".${FILE_AT_DIR}" REMOTE_FILE="${CONTAINER}/${FILE}" REMOTE_FILE_AT_DIR="${CONTAINER}${FILE_AT_DIR}" for f in "${LOCAL_FILE}" "${LOCAL_FILE_AT_DIR}"; do if [ ! -e "${f}" ]; then echo "\nInfo: File "${f}" does not exist. Creating ${f}" d=`dirname "${f}"` mkdir -p "${d}"; tr -dc A-Za-z0-9 </dev/urandom | head -c "${FILE_SIZE}" > "${f}" #dd if="/dev/random" of="${f}" bs=1 count=0 seek=${FILE_SIZE} &> /dev/null fi; done; echo "\nActions:" echo "\tListing containers:\t\t\t\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/\"" echo "\tCreate container \"oss://${CONTAINER}\":\t\tcurl -k -vX PUT -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}\"" echo "\tListing objects at container \"oss://${CONTAINER}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\n\tUpload \"${LOCAL_FILE}\" to \"oss://${REMOTE_FILE}\":\tcurl -k -vX PUT -T \"${LOCAL_FILE}\" -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${CONTAINER}/\"" echo "\tDownload \"oss://${REMOTE_FILE}\" to \"${LOCAL_FILE}\":\tcurl -k -vX GET -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\" > \"${LOCAL_FILE}\"" echo "\n\tDelete \"oss://${REMOTE_FILE}\":\tcurl -k -vX DELETE -u \"${USER}:${PASS}\" \"${STORAGE_URL}/${REMOTE_FILE}\"" echo "\ndone" I put the content of this script into file oss_operations.sh, give execute permission and run it: $ chmod +x oss_operations.sh $ ./oss_operations.sh the output will look like: curl -k -sS -H "X-Storage-User: storage-a424392:alexey.filanovskiy@oracle.com" -H "X-Storage-Pass:MySecurePass" -i "https://storage-a424392.storage.oraclecloud.com/auth/v1.0" Token and storage URL: OSS url: https://storage-a424392.storage.oraclecloud.com/auth/v1.0 auth token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage token: AUTH_tk45d49d9bcd65753f81bad0eae0aeb3db storage url: https://storage.us2.oraclecloud.com/v1/storage-a424392 Containers: 123_OOW17 1475233258815 1475233258815-segments Container ... Actions: Listing containers: curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/" Create container "oss://example_container": curl -k -vX PUT -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container" Listing objects at container "oss://example_container": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Upload "./file.txt" to "oss://example_container/file.txt": curl -k -vX PUT -T "./file.txt" -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/" Download "oss://example_container/file.txt" to "./file.txt": curl -k -vX GET -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" > "./file.txt" Delete "oss://example_container/file.txt": curl -k -vX DELETE -u "alexey.filanovskiy@oracle.com:MySecurePassword" "https://storage.us2.oraclecloud.com/v1/storage-a424392/example_container/file.txt" Upload/Download files. FTM CLI. REST API may seems a bit cumbersome and quite hard to use. But there is a good news that there is kind of intermediate solution Command Line Interface - FTM CLI. Again, here is the full documentation available here, but I'd like briefly explain what you could do with FTM CLI. You could download it here and after unpacking, it's ready to use! $ unzip ftmcli-v2.4.2.zip ... $ cd ftmcli-v2.4.2 $ ls -lrt total 120032 -rwxr-xr-x 1 opc opc 1272 Jan 29 08:42 README.txt -rw-r--r-- 1 opc opc 15130743 Mar 7 12:59 ftmcli.jar -rw-rw-r-- 1 opc opc 107373568 Mar 22 13:37 file.txt -rw-rw-r-- 1 opc opc 641 Mar 23 10:34 ftmcliKeystore -rw-rw-r-- 1 opc opc 315 Mar 23 10:34 ftmcli.properties -rw-rw-r-- 1 opc opc 373817 Mar 23 15:24 ftmcli.log You may note that there is file ftmcli.properties, it may simplify your life if you configure it once. Documentation you may find here and my example of this config: $ cat ftmcli.properties #saving authkey #Fri Mar 30 21:15:25 UTC 2018 rest-endpoint=https\://storage-a424392.storage.oraclecloud.com/v1/storage-a424392 retries=5 user=alexey.filanovskiy@oracle.com segments-container=all_segments max-threads=15 storage-class=Standard segment-size=100 Now we have all connection details and we may use CLI as simple as possible. There are few basics commands available with FTMCLI, but as a first step I'd suggest to authenticate a user (put password once): $ java -jar ftmcli.jar list --save-auth-key Enter your password: if you will use "--save-auth-key" it will save your password and next time will not ask you for a password: $ java -jar ftmcli.jar list 123_OOW17 1475233258815 ... You may refer to the documentation for get full list of the commands or simply run ftmcli without any arguments: $ java -jar ftmcli.jar ... Commands: upload Upload a file or a directory to a container. download Download an object or a virtual directory from a container. create-container Create a container. restore Restore an object from an Archive container. list List containers in the account or objects in a container. delete Delete a container in the account or an object in a container. describe Describes the attributes of a container in the account or an object in a container. set Set the metadata attribute(s) of a container in the account or an object in a container. set-crp Set a replication policy for a container. copy Copy an object to a destination container. Let's try to accomplish standart flow for OSS - create container, upload file there, list objects in container,describe container properties and delete it. # Create container $ java -jar ftmcli.jar create-container container_for_blog Name: container_for_blog Object Count: 0 Bytes Used: 0 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Upload file to container $ java -jar ftmcli.jar upload container_for_blog file.txt Uploading file: file.txt to container: container_for_blog File successfully uploaded: file.txt Estimated Transfer Rate: 16484KB/s # List files into Container $ java -jar ftmcli.jar list container_for_blog file.txt # Get Container Metadata $ java -jar ftmcli.jar describe container_for_blog Name: container_for_blog Object Count: 1 Bytes Used: 434 Storage Class: Standard Creation Date: Fri Mar 30 21:50:15 UTC 2018 Last Modified: Fri Mar 30 21:50:14 UTC 2018 Metadata --------------- x-container-write: a424392.storage.Storage_ReadWriteGroup x-container-read: a424392.storage.Storage_ReadOnlyGroup,a424392.storage.Storage_ReadWriteGroup content-type: text/plain;charset=utf-8 accept-ranges: bytes Custom Metadata --------------- x-container-meta-policy-georeplication: container # Delete container $ java -jar ftmcli.jar delete container_for_blog ERROR:Delete failed. Container is not empty. # Delete with force option $ java -jar ftmcli.jar delete -f container_for_blog Container successfully deleted: container_for_blog Another great thing about FTM CLI is that allows easily manage uploading performance out of the box. In ftmcli.properties there is the property called "max-threads". It may vary between 1 and 100. Here is testcase illustrates this: -- Generate 10GB file $ dd if=/dev/zero of=file.txt count=10240 bs=1048576 -- Upload file in one thread (has around 18MB/sec rate $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 18381KB/s -- Change number of thrads from 1 to 99 in config file $ sed -i -e 's/max-threads=1/max-threads=99/g' ftmcli.properties -- Upload file in 99 threads (has around 68MB/sec rate) $ java -jar ftmcli.jar upload container_for_blog /home/opc/file.txt Uploading file: /home/opc/file.txt to container: container_for_blog File successfully uploaded: /home/opc/file.txt Estimated Transfer Rate: 68449KB/s so, it's very simple and at the same time powerful tool for operations with Object Store, it may help you with scripting of operations.  Upload/Download files. CloudBerry. Another way to interact with OSS use some application, for example, you may use CloudBerry Explorer for OpenStack Storage. There is a great blogpost, which explains how to configure CloudBerry for Oracle Object Store Service Classic and I will start from the point where I already configured it. Whenever you log in it looks like this:     You may easily create container in CloudBerry: And for sure you may easily copy data from your local machine to OSS: It's nothing to add here, CloudBerry is convinient tool for browsing Object Stores and do a small copies between local machine and OSS. For me personally, it looks like TotalCommander for a OSS.  Upload/Download files. Big Data Manager and ODCP. Big Data Cloud Service (BDCS) has great component called Big Data Manager. This is tool developed by Oracle, which allows you to manage and monitor Hadoop Cluster. Among other features Big Data Manager (BDM) allows you to register Object Store in Stores browser and easily drug and drop data between OSS and other sources (Database, HDFS...). When you copy data to/from HDFS you use optimized version of Hadoop Distcp tool ODCP. This is very fast way to copy data back and forth. Fortunitely, JP already wrote about this feature and I could just simply give a link. If you want to see concreet performance numbers, you could go here to a-team blog page. Without Big Data Manager, you could manually register OSS on Linux machine and invoke copy command from bash. Documentation will show you all details and I will show just one example: # add account: $ export CM_ADMIN=admin $ export CM_PASSWORD=SuperSecurePasswordCloderaManager $ export CM_URL=https://cfclbv8493.us2.oraclecloud.com:7183 $ bda-oss-admin add_swift_cred --swift-username "storage-a424392:alexey.filanovskiy@oracle.com" --swift-password "SecurePasswordForSwift" --swift-storageurl "https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens" --swift-provider bdcstorage # list of credentials: $ bda-oss-admin list_swift_creds Provider: bdcstorage     Username: storage-a424392:alexey.filanovskiy@oracle.com     Storage URL: https://storage-a424392.storage.oraclecloud.com/auth/v2.0/tokens # check files on OSS swift://[container name].[Provider created step before]/: $ hadoop fs -ls swift://alextest.bdcstorage/ 18/03/31 01:01:13 WARN http.RestClientBindings: Property fs.swift.bdcstorage.property.loader.chain is not set Found 3 items -rw-rw-rw- 1 279153664 2018-03-07 00:08 swift://alextest.bdcstorage/bigdata.file.copy drwxrwxrwx - 0 2018-03-07 00:31 swift://alextest.bdcstorage/customer drwxrwxrwx - 0 2018-03-07 00:30 swift://alextest.bdcstorage/customer_address Now you have OSS, configured and ready to use. You may copy data by ODCP, here you may find entire list of the sources and destinations. For example, if you want to copy data from hdfs to OSS, you have to run: $ odcp hdfs:///tmp/file.txt swift://alextest.bdcstorage/ ODCP is a very efficient way to move data from HDFS to Object Store and back. if you are from Hadoop world and you use to Hadoop fs API, you may use it as well with Object Store (configuring it before), for example for load data into OSS, you need to run: $ hadoop fs -put /home/opc/file.txt swift://alextest.bdcstorage/file1.txt Upload/Download files. Oracle Storage Cloud Software Appliance. Object Store is a fairly new concept and for sure there is a way to smooth this migration. Years ago, when HDFS was new and undiscovered, many people didn't know how to work with it and few technologies, such as NFS-Gateway and HDFS-fuse appears. Both these technology allowed to mount HDFS on Linux filesystem and work with it as with normal filesystem. Something like this allows doing Oracle Cloud Infrastructure Storage Software Appliance. All documentation you could find here, brief video here, download software here. In my blog I just show one example of its usage. This picture will help me to explain how works Storage Cloud Software Appliance: you may see that customer need to install on-premise docker container, which will have all required stack. I'll skip all details, which you may find in the documentation above, and will just show a concept. # Check oscsa status [on-prem client] $ oscsa info Management Console: https://docker.oracleworld.com:32769 If you have already configured an OSCSA FileSystem via the Management Console, you can access the NFS share using the following port. NFS Port: 32770 Example: mount -t nfs -o vers=4,port=32770 docker.oracleworld.com:/<OSCSA FileSystem name> /local_mount_point # Run oscsa [on-prem client] $ oscsa up There (on the docker image, which you deploy on some on-premise machine) you may find WebUI, where you can configure Storage Appliance: after login, you may see a list of configured Objects Stores: In this console you may connect linked container with this on-premise host: after it been connected, you will see option "disconnect" After you connect a device, you have to mount it: [on-prem client] $ sudo mount -t nfs -o vers=4,port=32770 localhost:/devoos /oscsa/mnt [on-prem client] $ df -h|grep oscsa localhost:/devoos 100T 1.0M 100T 1% /oscsa/mnt Now you could upload a file into Object Store: [on-prem client] $ echo "Hello Oracle World" > blog.file [on-prem client] $ cp blog.file /oscsa/mnt/ This is asynchronous copy to Object Store, so after a while, you will be able to find a file there: Only one restriction, which I wasn't able to overcome is that filename is changing during the copy. Conclusion. Object Store is here and it will became more and more popular. It means there is no way to escape it and you have to get familiar with it. Blogpost above showed that there are multiple ways to deal with it, strting from user friendly (like CloudBerry) and ending on the low level REST API.

One of the most common and clear trends in the IT market is Cloud and one of the most common and clear trends in the Cloud is Object Store. Some introduction information you may find here. Many Cloud...

Data Warehousing

Loading Data to the Object Store for Autonomous Data Warehouse Cloud

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and are getting ready to rock-n-roll. But the one thing you’re not sure about is this Object Store. Yes, you used it successfully as described in the tutorial, but what’s next?. And what else is there to know about the Object Store? First and foremost, if you are interested in understanding a bit more about what this Object Store is, you should read the following blog post from Rachna, the Product Manager for the Object Store among other things. It introduces the Object Store, how to set it up and manage files with the UI, plus a couple of simple command line examples (don’t get confused by the term ‘BMC’, that’s the old name of Oracle’s Cloud Infrastructure; that’s true for the command line utility as well, which is now called oci). You should read that blog post to get familiar with the basic concepts of the Object Store and a cloud account (tenant). The documentation and blog posts are great, but now you actually want to do use it to load data into ADWC.  This means loading more (and larger) files, more need for automation, and more flexibility.  This post will focus on exactly that: to become productive with command line utilities without being a developer, and to leverage the power of the Oracle Object Store to upload more files in one go and even how to upload larger files in parallel without any major effort. The blog post will cover both: The Oracle oci command line interface for managing files The Swift REST interface for managing files   Using the oci command line interface The Oracle oci command line interface (CLI) is a tool that enables you to work with Oracle Cloud Infrastructure objects and services. It’s a thin layer on top of the oci APIs (typically REST) and one of Oracle’s open source project (the source code is on GitHub). Let’s quickly step through what you have to do for using this CLI. If you do not want to install anything, that is fine, too. In that case feel free to jump to the REST section in this post right away, but you’re going to miss out on some cool stuff that the CLI provides you out of the box. To get going with the utility is really simple, as simple as one-two-three Install oci cli following the installation instructions on github. I just did this on an Oracle Linux 7.4 VM instance that I created in the Oracle Cloud and had the utility up and running in no time.   Configure your oci cli installation. You need a user created in the Oracle Cloud account that you want to use, and that user must have the appropriate privileges to work with the object store. A keypair is used for signing API requests, with the public key uploaded to Oracle. Only the user calling the API should possess the private key. All this is described in the configuration section of the CLI.  That is probably the part that takes you the most time of the setup. You have to ensure to have UI console access when doing this since you have to upload the public key for your user.   Use oci cli. After successful setup you can use the command line interface to manage your buckets for storing all your files in the Cloud, among other things.   First steps with oci cli The focus of the command line interface is on ease-of-use and to make its usage as self-explaining as possible, with a comprehensive built-in help system in the utility. Whenever you want to know something without looking around, use the --help, -h, or -? Syntax for a command, irrespective of how many parameters you have already entered. So you can start with oci -h and let the utility guide you. For the purpose of file management the important category is the object store category, with the main tasks of: Creating, managing, and deleting buckets This task is probably done by an administrator for you, but we will cover it briefly nevertheless   Uploading, managing, and downloading objects (files) That’s your main job in the context of the Autonomous Data Warehouse Cloud That’s what we are going to do now.   Creating a bucket Buckets are containers that store objects (files). Like other resources, buckets belong to a compartment, a collection of resources in the Cloud that can be used as entity for privilege management. To create a bucket you have to know the compartment id to create a bucket. That is the only time we have to deal with this cloud-specific unique identifiers. All other object (file) operations use names. So let’s create a bucket. The following creates a bucket named myFiles in my account ADWCACCT in a compartment given to me by the Cloud administrator. $ oci os bucket create --compartment-id ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq --namespace-name adwcaact --name myFiles {   "data": {     "compartment-id": "ocid1.tenancy.oc1..aaaaaaaanwcasjdhfsbw64mt74efh5hneavfwxko7d5distizgrtb3gzj5vq",     "created-by": "ocid1.user.oc1..aaaaaaaaomoqtk3z7y43543cdvexq3y733pb5qsuefcbmj2n5c6ftoi7zygq",     "etag": "c6119bd6-98b6-4520-a05b-26d5472ea444",     "metadata": {},     "name": "myFiles",     "namespace": "adwcaact",     "public-access-type": "NoPublicAccess",     "storage-tier": "Standard",     "time-created": "2018-02-26T22:16:30.362000+00:00"   },   "etag": "c6119bd6-98b6-4520-a05b-26d5472ea733" } The operation returns with the metadata of the bucket after successful creation. We’re ready to upload and manage files in the object store.   Upload your first file with oci cli You can upload a single file very easily with the oci command line interface. And, as promised before, you do not even have to remember any ocid in this case … . $ oci os object put --namespace adwcacct --bucket-name myFiles --file /stage/supplier.tbl Uploading object  [####################################]  100% {   "etag": "662649262F5BC72CE053C210C10A4D1D",   "last-modified": "Mon, 26 Feb 2018 22:50:46 GMT",   "opc-content-md5": "8irNoabnPldUt72FAl1nvw==" } After successful upload you can check the md5 sum of the file; that’s basically the fingerprint that the data on the other side (in the cloud) is not corrupt and the same than local (on the machine where the data is coming from). The only “gotcha” is that OCI is using base64 encoding, so you cannot just do a simple md5. The following command solves this for me on my Mac: $ openssl dgst -md5 -binary supplier.tbl |openssl enc -base64 8irNoabnPldUt72FAl1nvw== Now that’s a good start. I can use this command in any shell program, like the following which loads all files in a folder sequentially to the object store:  for i in `ls *.tbl` do   oci os object put --namespace adwcacct --bucket-name myFiles --file $i done You can write it to load multiple files in parallel, load only files that match a specific name pattern, etc. You get the idea. Whatever you can do with a shell you can do. Alternatively, if it's just about loading all the files in  you can achieve the same with the oci cli as well by using its bulk upload capabilities. The following shows briefly oci os object bulk-upload -ns adwcacct -bn myFiles --src-dir /MyStagedFiles {   "skipped-objects": [],   "upload-failures": {},   "uploaded-objects": {     "chan_v3.dat": {       "etag": "674EFB90B1A3CECAE053C210D10AC9D9",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "/t4LbeOiCz61+Onzi/h+8w=="     },     "coun_v3.dat": {       "etag": "674FB97D50C34E48E053C230C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:43:28 GMT",       "opc-content-md5": "sftu7G5+bgXW8NEYjFNCnQ=="     },     "cust1v3.dat": {       "etag": "674FB97D52274E48E053C210C10A1DF8",       "last-modified": "Tue, 13 Mar 2018 17:44:06 GMT",       "opc-content-md5": "Zv76q9e+NTJiyXU52FLYMA=="     },     "sale1v3.dat": {       "etag": "674FBF063F8C50ABE053C250C10AE3D3",       "last-modified": "Tue, 13 Mar 2018 17:44:52 GMT",       "opc-content-md5": "CNUtk7DJ5sETqV73Ag4Aeg=="     }   } } Uploading a single large file in parallel  Ok, now we can load one or many files to the object store. But what do you do if you have a single large file that you want to get uploaded? The oci command line offers built-in multi-part loading where you do not need to split the file beforehand. The command line provides you built-in capabilities to (A) transparently split the file into sized parts and (B) to control the parallelism of the upload. $ oci os object put -ns adwcacct -bn myFiles --file lo_aa.tbl --part-size 100 --parallel-upload-count 4 While the load is ongoing you can all in-progress uploads, but unfortunately without any progress bar or so; the progress bar is reserved for the initiating session:  $ oci os multipart list -ns adwcacct -bn myFiles {   "data":    [         {       "bucket": "myFiles",       "namespace": "adwcacct",       "object": "lo_aa.tbl",       "time-created": "2018-02-27T01:19:47.439000+00:00",       "upload-id": "4f04f65d-324b-4b13-7e60-84596d0ef47f"     }   ] }   While a serial process for a single file gave me somewhere around 35 MB/sec to upload on average, the parallel load sped up things quite a bit, so it’s definitely cool functionality (note that your mileage will vary and is probably mostly dependent on your Internet/proxy connectivity and bandwidth.  If you’re interested in more details about how that works, here is a link from Rachna who explains the inner details of this functionality in more detail.   Using the Swift REST interface Now after having covered the oci utility, let’s briefly look into what we can do out of the box, without the need to install anything. Yes, without installing anything you can leverage REST endpoints of the object storage service. All you need to know is your username/SWIFT password and your environment details, e.g. which region your uploading to, the account (tenant) and the target bucket.  This is where the real fun starts, and this is where it can become geeky, so we will focus only on the two most important aspects of dealing with files and the object store: uploading and downloading files.   Understanding how to use Openstack Swift REST File management with REST is equally simple than it is with the oci cli command. Similar to the setup of the oci cli, you have to know the basic information about your Cloud account, namely:  a user in the cloud account that has the appropriate privileges to work with a bucket in your tenancy. This user also has to be configured to have a SWIFT password (see here how that is done). a bucket in one of the object stores in a region (we are not going to discuss how to use REST to do this). The bucket/region defines the rest endpoint, for example if you are using the object store in Ashburn, VA, the endpoint is https://swiftobjectstorage.us-ashburn-1.oraclecloud.com) The URI for accessing your bucket is built as follows: <object store rest endpoint>/v1/<tenant name>/<bucket name> In my case for the simple example it would be https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/myFiles If you have all this information you are set to upload and download files.   Uploading an object (file) with REST Uploading a file is putting a file into the Cloud, so the REST command is a PUT. You also have to specify the file you want to upload and how the file should be named in the object store. With this information you can write a simple little shell script like the following that will take both the bucket and file name as input: # usage: upload_oss.sh <file> <bucket> file=$1 bucket=$2   curl -v -X PUT  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.! '  \  --upload-file ${file} \  https://swiftobjectstorage.us-ashburn-1.oraclecloud.com/v1/adwcacct/${bucket}/${file} So if you want to upload multiple files in a directory, similar to what we showed for the oci cli command, you just save this little script, say upload_oss.sh, and call it just like you called oci cli: for i in `ls *.tbl` do   upload_oss.sh myFiles $i done   Downloading an object (file) with REST  While we expect you to upload data to the object store way more often than downloading information, let’s quickly cover that, too. So you want to get a file from the object store? Well, the REST command GET will do this for you. It is similarly intuitive than uploading, and you might be able to guess the complete syntax already. Yes, it is:  curl -v -X GET  \ -u 'jane.doe@acme.com:)#sdswrRYsi-In1-MhM.!'  \ https://swiftobjectstorage.us-ashburn1.oraclecloud.com/v1/adwcacct/myFiles/myFileName \ --output myLocalFileName That’s about all you need to get started uploading all your files to the Oracle Object Store so that you then can consume them from within the Autonomous Data Warehouse Cloud.  Happy uploading!

So you got your first service instance of your autonomous data warehouse set up, you experienced the performance of the environment using the sample data, went through all tutorials and videos and...

The Data Warehouse Insider

Roadmap Update: What you need to know about Big Data Appliance 4.12

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As discussed the 4.11 release delivered the following: Released with updates CDH and now deliver 5.13.1 Updates to the Operating System (OL6) with security updates to said OS Java updates The release consciously pushed back some of the features to ensure the Oracle environments pick up the latest CDH releases within our (roughly) 4 week goal. Next up is BDA 4.12 As part of the longer development time we carved out by doing 4.12 with features we are able to schedule a set of very interesting components into this release. At a high level, the following are planned to be in 4.12: Configure a Kafka cluster on dedicated nodes on the BDA Set up (and include) Big Data Manager on BDA. For more information on Big Data Manager, see these videos (or click the one further down) on what cool things you can do with the Zeppelin Notebooks, ODCP and drag-and-drop copying of data  Full BDA clusters on OL7. After we enabled the edge nodes for OL7 to support Cloudera Data Science Workbench, we are now delivering full clusters on OL7. Note that we have not yet delivered an in-place upgrade path to migrate from an OL6 based cluster to an OL7 cluster High Availability for more services in CDH, by leveraging and pre-configuring best practices. These new HA set up steps are obviously updated regularly and are fully supported as part of the system going forward: Hive Service Sentry Service Hue Service On BDA X7-2 hardware 2 SSDs are included. When running on X7, the Journal Node metadata and Zookeeper data is put onto these SSDs instead of the regular OS disks. This ensures better performance for highly loaded master nodes.   Of course the software will have undergone testing and we do run infrastructure security scans on the system. We include any Linux updates that available when we freeze the image and ship those. Any violation that crops up after the release can, no should, be updated using the official OL Repo to update the OS. Lastly, we are looking to release early April, and are finalizing the actual Cloudera CDH release. We may use 5.14.1, but there is chance that we switch and jump to 5.15.0 depending on timings. And one more Thing Because Big Data Appliance is an engineered system, customers expect robust movement between versions. Upgrading the entire system, where BDA is different from just a Cloudera cluster, is an important part of the value proposition but is also fairly complex. With 4.12 we place additional emphasis on addressing previously seen upgrade issues and we will be doing this as an ongoing priority on all BDA software releases. So expect even more robust upgrades going forward. Lastly Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

As part of our continuous efforts to ensure transparency in release planning and availability to our customers for our big data stack, below is an update to the original roadmap post. Current Release As...

Big Data SQL

Big Data SQL Quick Start. Multi-user Authorization - Part 25

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side). But it's also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala...) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others.  Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here. Big Data SQL and Kerberos. Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question - which principal do you need to have with Big Data SQL?  Answer is easy - oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences: - Unable to authorize access to data based on the user that is running a query - Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature - Multi-User Authorization. Hadoop impersonalization. In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here: "A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe." at the same manner, "oracle" is the superuser and other users are impersonalized. Multi-User Authorization key concepts. 1) Big Data SQL will identify the trusted user that is accessing data on the cluster.  By executing the query as the trusted user: - Authorization rules specified in Hadoop will be respected - Authorization rules specified in Hadoop do not need to be replicated in the database - Hadoop cluster audits identify the actual Big Data SQL query user 2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop 3) Must map the database user that is running a query in Oracle Database to a Hadoop user 4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop  - This may be an Oracle Database user (i.e. schema) - Lightweight user comes from session-based contexts (see SYS_CONTEXT) - User/Group map must be available thru OS lookup in Hadoop Demonstration. Full documentation for this feature, you may find here and now I'm going to show few most popular cases with code examples. For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table: SQL> grant select on BDSQL_USER_MAP to bikes; SQL> grant execute on DBMS_BDSQL to bikes; SQL> grant BDSQL_ADMIN to bikes; In my cases, this is user "bikes". Just in case clean up permissions for user BIKES: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / check that the mapping table is empty: SQL> select * from SYS.BDSQL_USER_MAP; and after this run a query: SQL> select /*+ MONITOR */ * from bikes.weather_ext; this is the default mode, without any mapping, so I assume that I'll contact HDFS as oracle user. For double check this, I review audit files: $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=oracle ... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here is clear, that user Oracle reads the file (ugi=oracle). Let's check permissions for given file (which represents this external table): $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r--r-- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv so, everybody may read it. Remember this and let's try to create the first mapping. SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'user1' ); end; / this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let's find this in file permission table: Run query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. It's interesting that user1 doesn't exist on the Hadoop OS: # id user1 id: user1: No such user if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again: $ sudo -u hdfs hadoop fs -chmod 640 /data/weather/central_park_weather.csv $ hadoop fs -ls /data/weather/central_park_weather.csv -rw-r----- 3 oracle oinstall 26103 2017-10-24 13:03 /data/weather/central_park_weather.csv Now it failed. For make it works I may create "user1" account on each Hadoop node and add it to oinstall group. $ useradd user1 $ usermod -a -G oinstall user1 Run the query again and check the user, who reads this file: SQL> select /*+ MONITOR */ * from bikes.weather_ext; $ cd /var/log/hadoop-hdfs $ tail -f hdfs-audit.log |grep central_park 2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/10.0.0.10 cmd=open ... src=/data/weather/central_park_weather.csv.. here we are! We could read the file because of group permissions. What if I want to map this schema to HDFS or some other powerful user? Let's try: SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => null, syscontext_parm_hadoop_user => 'hdfs' ); end; / the reason why we got this exception is that hdfs user is on the blacklist for impersonation. $ cat $ORACLE_HOME/bigdatasql/databases/orcl/bigdata_config/bigdata.properties| grep impersonation .... # Impersonation properties impersonation.enabled=true impersonation.blacklist='hue','yarn','oozie','smon','mapred','hdfs','hive','httpfs','flume','HTTP','bigdatamgr','oracle' ... the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema. Bellow is the example, which illustrates this: we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here). So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop. Bellow is a test case, which illustrates this. -- Remove previous rule, related with BIKES user -- SQL> begin DBMS_BDSQL.REMOVE_USER_MAP (current_database_user =>'BIKES'); end; / -- Add a new rule, which tells that if database user is BIKES, Hadoop user have to be taken from USERENV as CLIENT_IDENTIFIER -- SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user =>'BIKES', syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'CLIENT_IDENTIFIER' ); end; --Check current database user (schema) -- SQL> select user from dual; BIKES -- Check CLIENT_IDENTIFIER from USERENV -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; NULL -- Run any query aginst Hadoop -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:14:40 ... ugi=oracle ... src=/data/weather/central_park_weather.csv -- Set CLIENT_IDENTIFIER -- SQL> begin dbms_session.set_IDENTIFIER('Alexey'); end; / -- Check CLIENT_IDENTIFIER for current session -- SQL> select SYS_CONTEXT('USERENV', 'CLIENT_IDENTIFIER') from dual; Alexey -- Run query agin over HDFS data -- SQL> select /*+ MONITOR */ * from bikes.weather_ext; -- check in the Hadoop audit logs: -- -bash-4.1$ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:17:43 ... ugi=Alexey ... src=/data/weather/central_park_weather.csv the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc...) have their authenticated identity passed to Hadoop. To make it work, simply run: SQL> begin DBMS_BDSQL.ADD_USER_MAP( current_database_user => '*' , syscontext_namespace => 'USERENV', syscontext_parm_hadoop_user => 'AUTHENTICATED_IDENTITY'); end; / and after this your user on HDFS will be that returned by: SQL> select SYS_CONTEXT('USERENV', 'AUTHENTICATED_IDENTITY') from dual; BIKES for example, if I logged on to the database as BIKES (as database user), on HDFS I'll be authenticated as bikes user -bash-4.1 $ tail -f hdfs-audit.log |grep central_park 2018-03-01 18:23:23 ... ugi=bikes... src=/data/weather/central_park_weather.csv for checking all rules, which you have for multi-user authorization you may run follow query: SQL> select * from SYS.BDSQL_USER_MAP; Hope that this feature may allow you to create robust security bastion around your data in HDFS.

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features,...

Big Data

Advanced Data Protection using Big Data SQL and Database Vault - Introduction

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new privacy regulations are being implemented such as the European Union (EU) General Data Protection Regulation (GDPR), are also being enforced and the increasing adoption of Public Cloud also legitimate these new Cyber Security requirements. Data Lake/Hub environments can be a treasure trove of sensitive data, so data protection must be considered in almost all Big Data SQL implementations. Fortunately, Big Data SQL is able to propagate several of the data protection capabilities of the Oracle Multi-Model Database such as Virtual Private Database (aka Row Level Security) or Data Redaction described in a previous post (see Big Data SQL Quick Start. Security - Part4.).  But now is the time to speak about one of the most powerful ones: Database Vault. Clearly, databases are a common target and 81% of 2017 hacking-related breaches leveraged either stolen and/or weak passwords.  So, once legitimate internal credentials are acquired (and preferably those for system accounts), then accessing interesting data is just a matter of time. Hence, while Alexey described all the security capabilities you could put in place to Secure your Hadoop Cluster: ...once hackers get legitimate database credentials, it's done... unless you add another Cyber Security layer to manage fine grained accesses. And here comes Database Vault1. This Introductory post is the first of a series where we'll illustrate the security capabilities that can be combined with Big Data SQL in order to propagate these protections to Oracle and non-Oracle data stores: NoSQL clusters (Oracle NoSQL DB, HBase, Apache Cassandra, MongoDB...), Hadoop (Hortonworks and Cloudera), Kafka (Confluent and Apache, with the 3.2 release of Big Data SQL). In essence, Database Vault allows to separation of duties between the operators (DBA) and application users. As a result, data are protected from users with system privileges (SYSTEM (which should never be used and locked), DBA named accounts...) - but they can still continue to do their job: Moreover Database Vault has the ability to add fine grained security layers to control precisely who accesses which objects (tables, view, PL/SQL code...), from where (e.g. edges nodes only), and when (e.g. application maintenance window solely):   As explained in the previous figure, Database Vault introduces the concepts of Realms and Command Rules. From documentation: A realm is a grouping of database schemas, database objects, and/or database roles that must be secured for a given application. Think of a realm as a zone of protection for your database objects. A schema is a logical collection of database objects such as tables (including external tables, hence allowing to work with Big Data SQL), views, and packages, and a role is a collection of privileges. By arranging schemas and roles into functional groups, you can control the ability of users to use system privileges against these groups and prevent unauthorized data access by the database administrator or other powerful users with system privileges. Oracle Database Vault does not replace the discretionary access control model in the existing Oracle database. It functions as a layer on top of this model for both realms and command rules. Oracle Database Vault provides two types of realms: regular and mandatory. A regular realm protects an entire database object (such as a schema). This type of realm restricts all users except users who have direct object privilege grants. With regular realms, users with direct object grants can perform DML operations but not DDL operations. A mandatory realm restricts user access to objects within a realm. Mandatory realms block both object privilege-based and system privilege-based access. In other words, even an object owner cannot access his or her own objects without proper realm authorization if the objects are protected by mandatory realms. After you create a realm, you can register a set of schema objects or roles (secured objects) for realm protection and authorize a set of users or roles to access the secured objects. For example, you can create a realm to protect all existing database schemas that are used in an accounting department. The realm prohibits any user who is not authorized to the realm to use system privileges to access the secured accounting data. And also: A command rule protects Oracle Database SQL statements (SELECT, ALTER SYSTEM), database definition language (DDL), and data manipulation language (DML) statements. To customize and enforce the command rule, you associate it with a rule set, which is a collection of one or more rules. The command rule is enforced at run time. Command rules affect anyone who tries to use the SQL statements it protects, regardless of the realm in which the object exists. One important point to emphasize is that Database Vault will audit any access violation to protected objects ensuring governance and compliance over time. To summarize:   In the next parts of this series, I'll present 3 use cases as following in order to demonstrate some of Database Vault capabilities in a context of Big Data SQL: Protect data from users with system privileges (DBA…) Access data only if super manager is connected too Prevent users from creating EXTERNAL tables for Big Data SQL And in the meantime, you shall discover practical information by reading one of our partner white-papers. 1: Database Vault is a database option and has to be licensed accordingly on the Oracle Database Enterprise Edition only. Notice that Database Cloud Service High Performance and Extreme Performance as well as Exadata Cloud Service and Exadata Cloud at Customer have this capability included into the cloud subscription.   Thanks to Alan, Alexey and Martin for their helpful reviews!

According to latest analysts reports and data breaches statistics, Data Protection is rising up to be the most important IT issue in the coming years! Due to increasing threats and cyber-attacks, new...

Learn more about using Big Data Manager - importing data, notebooks and other useful things

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the distributed (Spark based) copy utility. Since then a lot of useful features have been added to Big Data Manager, and to share with the world, these are now recorded and published on YouTube. The library consists of a number of videos with the following topics (video library is here): Working with Archives File Imports Working with Remote Data Importing Notebooks from GitHub For some background, Big Data Manager is a utility that is included with Big Data Cloud Service, Big Data Cloud at Customer and soon with Big Data Appliance. It's primary goal is to enable users to quickly achieve tasks like copying files, and publishing data via a Notebook interface. In this case, the interface is based on / leverages Zeppelin notebooks. The notebooks run on a node within the cluster and have direct access to the local data elements. As is shown in some of the videos, Big Data Manager enables easy file transport between Object Stores (incl. Oracle's and Amazon's) and HDFS. This transfer is based on ODCP, which leverages Apache Spark in the cluster to enable high volume and high performance file transfers. You can see more here: Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

In one of the previous posts on this blog (See How Easily You Can Copy Data Between Object Store and HDFS) we discussed some functionality enabled by a tool called Big Data Manager, based upon the...

Big Data

Oracle Big Data Lite 4.11 is Available

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c Release 1 Enterprise Edition (12.1.0.2) - including Oracle Big Data SQL-enabled external tables, Oracle Multitenant, Oracle Advanced Analytics, Oracle OLAP, Oracle Partitioning, Oracle Spatial and Graph, and more. Cloudera Distribution including Apache Hadoop (CDH5.13.1) Cloudera Manager (5.13.1) Oracle Big Data Spatial and Graph 2.4 Oracle Big Data Connectors 4.11 Oracle SQL Connector for HDFS 3.8.1 Oracle Loader for Hadoop 3.9.1 Oracle Data Integrator 12c (12.2.1.3.0) Oracle R Advanced Analytics for Hadoop 2.7.1 Oracle XQuery for Hadoop 4.9.1 Oracle Data Source for Apache Hadoop 1.2.1 Oracle Shell for Hadoop Loaders 1.3.1 Oracle NoSQL Database Enterprise Edition 12cR1 (4.5.12) Oracle JDeveloper 12c (12.2.1.2.0) Oracle SQL Developer and Data Modeler 17.3.1 with Oracle REST Data Services 3.0.7 Oracle Data Integrator 12cR1 (12.2.1.3.0) Oracle GoldenGate 12c (12.3.0.1.2) Oracle R Distribution 3.3.0 Oracle Perfect Balance 2.10.0 Check out the download page for the latest samples and useful links to help you get started with Oracle's Big Data platform. Enjoy!

The latest release of Oracle Big Data Lite is now available for download on OTN.  Version 4.11 has the following products installed and configured: Oracle Enterprise Linux 6.9 Oracle Database 12c...

Big Data

Free new tutorial: Quickly uploading files with Big Data Manager in Big Data Cloud Service

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need to do: analyzing that data. This new free tutorial shows you how to easily and quickly do the grunt work with Big Data Manager in Big Data Cloud Service (learn more here) enabling you to worry about analytics, not moving files. The approach taken here is to take a file that resides on your desktop, and drag and drop that into HDFS on Oracle Big Data Cloud Service... as easy as that, and you are now off doing analytics by right clicking and adding the data into a Zeppelin Notebook. Within the notebook, you get to see how Big Data Manager enables you to quickly generate a Hive schema definition from the data set and then start to do some analytics. Mechanics made easy! You can, and always should look at leveraging Object Storage as you entry point for data, as discussed in this other Big Data Manager How To article:  See How Easily You Can Copy Data Between Object Store and HDFS. For more advanced analytics, have a look at Oracle wide ranging set of cloud services or open source tools like R, and the high performance version of R: Oracle R Advanced Analytics for Hadoop.  

Sometimes the simplest tasks make life (too) hard. Consider simple things like uploading some new data sets into your Hadoop cluster in the cloud and then getting to work on the thing you really need...

Big Data

New Release: BDA 4.11 is now Generally Available

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software on MyOracleSupport. So what is new: BDA Software 4.11.0 contains few new things, but is intended to keep our software releases close to the Cloudera releases, as discussed in this roadmap post. This latest version uptakes: Cloudera CDH 5.13.1 and Cloudera Manager 5.13.1 Parcels for Kafka 3.0, Spark 2.2 and Key Trustee Server 5.13 are included in the BDA Software Bundle. Kudu is now included in the CDH parcel The team did a number of small but significant updates: Cloudera Manager cluster hosts now configured with TLS Level 3 - This includes encrypted communication with certificate verification of both Cloudera Manager Server and Agents  to verify identity and prevent spoofing by untrusted Agents running on hosts. Update to ODI Agent 12.2.1.3.0 Updates to Oracle Linux 6, JDK 8u151 and MySQL 5.7.20 It is important to remember that with 4.11.0 we no longer support upgrading OL5 based clusters. Review  New Release: BDA 4.10 is now Generally Available for some details on this. Links: Documentation: http://www.oracle.com/technetwork/database/bigdata-appliance/documentation/index.html Configurator: http://www.oracle.com/technetwork/database/bigdata-appliance/downloads/index.html That's all folks, more new releases, features and good stuff to come in 2018.   

As promised, this update to Oracle Big Data Appliance would come fast. We just uploaded the bits and are in process of uploading both documentation and configurator. You can find the latest software...

Data Warehousing

SQL Pattern Matching Deep Dive - the book

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User Group Conference. The title of each blog post started with SQL Pattern Matching Deep Dive...and covered a set of 6 posts: Part 1 - Overview Part 2 - Using MATCH_NUMBER() and CLASSIFIER() Part 3 - Greedy vs. reluctant quantifiers Part 4 - Empty matches and unmatched rows? Part 5 - SKIP TO where exactly? Part 6 - State machines There are a lot of related posts derived from that core set of 6 posts along with other presentations and code samples. One of the challenges, even when searching via Google, was tracking down all the relevant content. Therefore, I have spent the last 6-9 months converting all my deep dive content into a book - an Apple iBook. I have added a lot of new content based discussions I have had at user conferences, questions posted on the developer SQL forum, discussions with my development team and some new presentations developed for the OracleCode series events. To make it life easier for everyone I have split the content into two volumes and just in time for Thanksgiving Volume 1 is now available in the iBook Store - it's free to download! This first volume covers the following topics: Chapter 1: Introduction/br> Background to the book and explanation of how some of the features with the book are expected to work Chapter 2: Industry specific use cases In this section we will review a series of uses cases and provide conceptual simplified SQL to solve these business requirements using the new SQL pattern matching functionality. Chapter 3: Syntax for MATCH_RECOGNIZE The easiest way to explore the syntax of 12c’s new MATCH_RECOGNIZE clause is to look at a simple example... Chapter 4: How to use built-in measures for debugging In this section I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to your pattern. Chapter 5: Patterns and Predicates This chapter looks at how predicates affect the results returned by MATCH_RECOGNIZE. Chapter 6: Next Steps This final section provides links to additional information relating to SQL pattern matching. Chapter 7: Credits   My objective is that by the end of this two-part series you will have a good, solid understanding of how MATCH_RECOGNIZE works, how it can be used to simplify your application code and how to test your code to make sure it is working correctly. In a couple of weeks I will publish information about the contents of Volume 2 and when I hope to have it finished! As usual, if you have any comments about the contents of the book then let please email me directly at keith.laker@oracle.com

Those of you with long memories might just be able to recall a whole series of posts I did on SQL pattern matching which were taken from a deep dive presentation that I prepared for the BIWA User...

Big Data SQL

Using Materialized Views with Big Data SQL to Accelerate Performance

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll review a performance optimization that has been around for quite a while and is used at thousands of customers:  Materialized Views (MVs). For those of you who are unfamiliar with MVs – an MV is a precomputed summary table.  There is a defining query that describes that summary.  Queries that are executed against the detail tables comprising the summary will be automatically rewritten to the MV when appropriate: In the diagram above, we have a 1B row fact table stored in HDFS that is being accessed thru a Big Data SQL table called STORE_SALES.  Because we know that users want to query the data using a product hierarchy (by Item), a geography hierarchy (by Region) and a mix (by Class & QTR) – we created three summary tables that are aggregated to the appropriate levels. For example, the “by Item” MV has the following defining query: CREATE MATERIALIZED VIEW mv_store_sales_item ON PREBUILT TABLE ENABLE QUERY REWRITE AS (   select ss_item_sk,          sum(ss_quantity) as ss_quantity,          sum(ss_ext_wholesale_cost) as ss_ext_wholesale_cost,          sum(ss_net_paid) as ss_net_paid,          sum(ss_net_profit) as ss_net_profit   from bds.store_sales   group by ss_item_sk ); Queries executed against the large STORE_SALES that can be satisfied by the MV will now be automatically rewritten: SELECT i_category,        SUM(ss_quantity) FROM bds.store_sales, bds.item_orcl WHERE ss_item_sk = i_item_sk   AND i_size in ('small', 'petite')   AND i_wholesale_cost > 80 GROUP BY i_category; Taking a look at the query’s explain plan, you can see that even though store_sales is the table being queried – the table that satisfied the query is actually the MV called mv_store_sales_item.  The query was automatically rewritten by the optimizer. Explain plan with the MV: Explain plan without the MV: Even though Big Data SQL optimized the join and pushed the predicates and filtering down to the Hadoop nodes – the MV dramatically improved query performance: With MV:  0.27s Without MV:  19s This is to be expected as we’re querying a significantly smaller and partially aggregated data.  What’s nice is that query did not need to change; simply the introduction of the MV sped up the processing. What is interesting here is that the query selected data at the Category level – yet the MV is defined at the Item level.  How did the optimizer know that there was a product hierarchy?  And that Category level data could be computed from Item level data?  The answer is metadata.  A dimension object was created that defined the relationship between the columns: CREATE DIMENSION BDS.ITEM_DIM LEVEL ITEM IS (ITEM_ORCL.I_ITEM_SK) LEVEL CLASS IS (ITEM_ORCL.I_CLASS) LEVEL CATEGORY IS (ITEM_ORCL.I_CATEGORY) HIERARCHY PROD_ROLLUP ( ITEM CHILD OF CLASS CHILD OF   CATEGORY  )  ATTRIBUTE ITEM DETERMINES ( ITEM_ORCL.I_SIZE, ITEM_ORCL.I_COLOR, ITEM_ORCL.I_UNITS, ITEM_ORCL.I_CURRENT_PRICE,I_WHOLESALE_COST ); Here, you can see that Items roll up into Class, and Classes roll up into Category.  The optimizer used this information to allow the query to be redirected to the Item level MV. A good practice is to compute these summaries and store them in Oracle Database tables.  However, there are alternatives.  For example, you may have already computed summary tables and stored them in HDFS.  You can leverage these summaries by creating an MV over a pre-built Big Data SQL table.  Consider the following example where a summary table was defined in Hive and called csv.mv_store_sales_qtr_class.  There are two steps required to leverage this summary: Create a Big Data SQL table over the hive source Create an MV over the prebuilt Big Data SQL table Let’s look at the details.  First, create the Big Data SQL table over the Hive source (and don’t forget to gather statistics!):   CREATE TABLE MV_STORE_SALES_QTR_CLASS     (       I_CLASS VARCHAR2(100)     , SS_QUANTITY NUMBER     , SS_WHOLESALE_COST NUMBER     , SS_EXT_DISCOUNT_AMT NUMBER     , SS_EXT_TAX NUMBER     , SS_COUPON_AMT NUMBER     , D_QUARTER_NAME VARCHAR2(30)     )     ORGANIZATION EXTERNAL     (       TYPE ORACLE_HIVE       DEFAULT DIRECTORY DEFAULT_DIR       ACCESS PARAMETERS       (         com.oracle.bigdata.tablename: csv.mv_store_sales_qtr_class       )     )     REJECT LIMIT UNLIMITED; -- Gather statistics exec  DBMS_STATS.GATHER_TABLE_STATS ( ownname => '"BDS"', tabname => '"MV_STORE_SALES_QTR_CLASS"', estimate_percent => dbms_stats.auto_sample_size, degree => 32 ); Next, create the MV over the Big Data SQL table: CREATE MATERIALIZED VIEW mv_store_sales_qtr_class ON PREBUILT TABLE WITH REDUCED PRECISION ENABLE QUERY REWRITE AS (     select i.I_CLASS,     sum(s.ss_quantity) as ss_quantity,        sum(s.ss_wholesale_cost) as ss_wholesale_cost, sum(s.ss_ext_discount_amt) as ss_ext_discount_amt,        sum(s.ss_ext_tax) as ss_ext_tax,        sum(s.ss_coupon_amt) as ss_coupon_amt,        d.D_QUARTER_NAME     from DATE_DIM_ORCL d, ITEM_ORCL i, STORE_SALES s     where s.ss_item_sk = i.i_item_sk       and s.ss_sold_date_sk = date_dim_orcl.d_date_sk     group by d.D_QUARTER_NAME,            i.I_CLASS     ); Queries against STORE_SALES that can be satisfied by the MV will be rewritten: Here, the following query used the MV: - What is the quarterly performance by category with yearly totals? select          i.i_category,        d.d_year,        d.d_quarter_name,        sum(s.ss_quantity) quantity from bds.DATE_DIM_ORCL d, bds.ITEM_ORCL i, bds.STORE_SALES s where s.ss_item_sk = i.i_item_sk   and s.ss_sold_date_sk = d.d_date_sk   and d.d_quarter_name in ('2005Q1', '2005Q2', '2005Q3', '2005Q4') group by rollup (i.i_category, d.d_year, d.D_QUARTER_NAME) And, the query returned in a little more than a second: Looking at the explain plan, you can see that the query is executed against the MV – and the EXTERNAL TABLE ACCESS (STORAGE FULL) indicates that Big Data SQL Smart Scan kicked in on the Hadoop cluster. MVs within the database can be automatically updated by using change tracking.  However, in the case of Big Data SQL tables, the data is not resident in the database – so the database does not know that the summaries are changed.  Your ETL processing will need to ensure that the MVs are kept up to date – and you will need to set query_rewrite_integrity=stale_tolerated. MVs are an old friend.  They have been used for years to accelerate performance for traditional database deployments.  They are a great tool to use for your big data deployments as well!  

One of Big Data SQL’s key benefits is that it leverages the great performance capabilities of Oracle Database 12c.  I thought it would be interesting to illustrate an example – and in this case we’ll...

Big Data SQL

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse: You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here: Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop).  In this architecture, Kafka contains stream data and it's able to answer the question "what is going on right now", whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question "how it use to be". Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical. Example of using Big Data SQL over Kafka and other sources. So, above I've explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example.  Input for demo example: - We have company, called MoviePlex, which sells video content all around the world - There are two stream datasets - network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales.  - Both stream data in real-time in Kafka - Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well. Based on this we have a business case - monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example). Using Oracle Data Visualization Desktop, we've created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries: The blue line is a historical benchmark. Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be. After a while revenue starts going down. This trend keeps going. A lot of network errors in France. Let's drill down into itemized traffic: Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors. Conclusion: 1) Kafka stores real-time data  and answers on question "what is going on right now" 2) Database and Hadoop stores historical data and answers on the question: "how it use to be" 3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets) 4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

In Big Data SQL 3.2 we have introduced new capability - Kafka as a data source. Some details about how it works with some simple examples, I've posted over here. But now I want to talk about why...

Big Data SQL

Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in the nutshell what Kafka is. What is Kafka? The full scope of the information about Kafka you may find here, but in the nutshell, it's distributed fault tolerant message system. It allows you to connect many systems in an organized fashion. Instead, connect each system peer to peer: you may land all your messages company wide on one system and consume it from there, like this: Kafka is kind of Data Hub system, where you land the messages and serve it after. More technical details. I'd like to introduce a few key Kafka's terms. 1) Kafka Broker. This is Kafka service, which you run on each server and which operates all read and write request 2) Kafka Producer. The process which writes data in Kafka 3) Kafka Consumer. The process, which reads data from Kafka. 4) Message. The name describes itself, I just want to add that messages have key and value. In comparison to NoSQL databases key Kafka's key is not indexed. It has application purposes (you may put some application logic in Key) and administrative purposes (each message with the same key goes to the same partition). 5) Topic. Set of the messages organized into topics. Database guys would compare it with a table. 6) Partition. It's a good practice to divide the topic into partitions for performance and maintenance purposes. Messages within the same key go to the same partition. If a key is absent, messages are distributing in round - robin fashion. 7) Offset. The offset is the position of each message in the topic. The offset is indexed and it allows you quickly access your particular message. When do you delete data? One of the basic Kafka concepts is that of retention - Kafka does not keep data forever, nor does it wait for all consumers to read a message before deleting a message. Instead, the Kafka administrator configures a retention period for each topic - either amount of time for which to store messages before deleting them or how much data to store older messages are purged. This two parameters control this: log.retention.ms and log.retention.bytes. The amount of data to retain in the log for each topic-partition. This is the limit per partition: multiply by the number of partitions to get the total data retained for the topic.  How to query Kafka data with Big Data SQL? for query the Kafka data you need to create hive table first. let me show an ent-to-end example. I do have a JSON file: $ cat web_clicks.json { click_date: "38041", click_time: "67786", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "396439", web_page: "646"} { click_date: "38041", click_time: "41831", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "90714", web_page: "804"} { click_date: "38041", click_time: "60334", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "afternoon", item_sk: "151944", web_page: "867"} { click_date: "38041", click_time: "53225", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "175796", web_page: "563"} { click_date: "38041", click_time: "47515", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "186943", web_page: "777"} { click_date: "38041", click_time: "73633", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "118004", web_page: "647"} { click_date: "38041", click_time: "43133", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "148210", web_page: "930"} { click_date: "38041", click_time: "80675", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "380306", web_page: "484"} { click_date: "38041", click_time: "21847", date: "2004-02-26", am_pm: "AM", shift: "third", sub_shift: "morning", item_sk: "55425", web_page: "95"} { click_date: "38041", click_time: "35131", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "185071", web_page: "118"} and I'm going to load it into Kafka with standard Kafka tool "kafka-console-producer": $ cat web_clicks.json|kafka-console-producer --broker-list bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092 --topic json_clickstream for a check that messages have appeared in the topic you may use the following command: $ kafka-console-consumer --zookeeper bds1:2181,bds2:2181,bds3:2181 --topic json_clickstream --from-beginning after I've loaded this file into Kafka topic, I create a table in Hive. Make sure that you have oracle-kafka.jar and kafka-clients*.jar in your hive.aux.jars.path: and here: after this you may run follow DDL in the hive: hive> CREATE EXTERNAL TABLE json_web_clicks_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092', 'oracle.kafka.table.topics'='json_clickstream' ); hive> describe json_web_clicks_kafka; hive> select * from json_web_clicks_kafka limit 1; and as soon as hive table been created I create ORACLE_HIVE table in Oracle: SQL> CREATE TABLE json_web_clicks_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.cluster=CLUSTER com.oracle.bigdata.tablename=default.json_web_clicks_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; here you also have to keep in mind that you need to add oracle -kafka.jar and  kafka -clients*.jar in your bigdata.properties file on the database and on the Hadoop side. I have dedicated the blog about how to do this here. Now we are ready to query: SQL> SELECT * FROM json_web_clicks_kafka WHERE ROWNUM<3; json_clickstream 209 { click_date: "38041", click_time: "43213"..."} 0 26-JUL-17 05.55.51.762000 PM 1 json_clickstream 209 { click_date: "38041", click_time: "74669"... } 1 26-JUL-17 05.55.51.762000 PM 1 Oracle 12c provides powerful capabilities for working with JSON, such as dot API. It allows us to easily query the JSON data as a structure:  SELECT t.value.click_date, t.value.click_time FROM json_web_clicks_kafka t WHERE ROWNUM < 3; 38041 40629 38041 48699 Working with AVRO messages. In many cases, customers are using AVRO as flexible self-described format and for exchanging messages through the Kafka. For sure we do support it and doing this in very easy and flexible way. I do have a topic, which contains AVRO messages and I define Hive table over it: CREATE EXTERNAL TABLE web_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='avro', 'oracle.kafka.table.value.schema'='{"type":"record","name":"avro_table","namespace":"default","fields": [{"name":"ws_sold_date_sk","type":["null","long"],"default":null}, {"name":"ws_sold_time_sk","type":["null","long"],"default":null}, {"name":"ws_ship_date_sk","type":["null","long"],"default":null}, {"name":"ws_item_sk","type":["null","long"],"default":null}, {"name":"ws_bill_customer_sk","type":["null","long"],"default":null}, {"name":"ws_bill_cdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_hdemo_sk","type":["null","long"],"default":null}, {"name":"ws_bill_addr_sk","type":["null","long"],"default":null}, {"name":"ws_ship_customer_sk","type":["null","long"],"default":null} ]}', 'oracle.kafka.bootstrap.servers'='bds2:9092', 'oracle.kafka.table.topics'='web_sales_avro' ); describe web_sales_kafka; select * from web_sales_kafka limit 1; Here I define 'oracle.kafka.table.value.type'='avro' and also I have to specify 'oracle.kafka.table.value.schema'. After this we have structure. In a similar way I define a table in Oracle RDBMS: SQL> CREATE TABLE WEB_SALES_KAFKA_AVRO ( "WS_SOLD_DATE_SK" NUMBER, "WS_SOLD_TIME_SK" NUMBER, "WS_SHIP_DATE_SK" NUMBER, "WS_ITEM_SK" NUMBER, "WS_BILL_CUSTOMER_SK" NUMBER, "WS_BILL_CDEMO_SK" NUMBER, "WS_BILL_HDEMO_SK" NUMBER, "WS_BILL_ADDR_SK" NUMBER, "WS_SHIP_CUSTOMER_SK" NUMBER topic varchar2(50), partitionid integer, KEY NUMBER, offset integer, timestamp timestamp, timestamptype INTEGER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY "DEFAULT_DIR" ACCESS PARAMETERS ( com.oracle.bigdata.tablename: web_sales_kafka ) ) REJECT LIMIT UNLIMITED ; And we good to query the data! Performance considerations. 1) Number of Partitions. This is the most important thing to keep in mind there is a nice article about how to choose a right number of partitions. For Big Data SQL purposes I'd recommend using a number of partitions a bit more than you have CPU cores on your Big Data SQL cluster. 2) Query fewer columns Use column pruning feature. In other words list only necessary columns in your SELECT and WHERE statements. Here is the example. I've created void PL/SQL function, which does nothing. But PL/SQL couldn't be offloaded to the cell side and we will move all the data towards the database side: SQL> create or replace function fnull(input number) return number is Result number; begin Result:=input; return(Result); end fnull; after this I ran the query, which requires one column and checked how much data have been returned to the DB side: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  5741.81 MB after this I repeat the same test case with 10 columns: SQL> SELECT MIN(fnull(WS_SOLD_DATE_SK)), MIN(fnull(WS_SOLD_TIME_SK)), MIN(fnull(WS_SHIP_DATE_SK)), MIN(fnull(WS_ITEM_SK)), MIN(fnull(WS_BILL_CUSTOMER_SK)), MIN(fnull(WS_BILL_CDEMO_SK)), MIN(fnull(WS_BILL_HDEMO_SK)), MIN(fnull(WS_BILL_ADDR_SK)), MIN(fnull(WS_SHIP_CUSTOMER_SK)), MIN(fnull(WS_SHIP_CDEMO_SK)) FROM WEB_SALES_KAFKA_AVRO; "cell interconnect bytes returned by XT smart scan"  32193.98 MB so, hopefully, this test case clearly shows that you have to use only useful columns 3) Indexes There is no Indexes rather than Offset columns. The fact that you have key column doesn't have to mislead you - it's not indexed. The only offset allows you have a quick random access 4) Warm up your data If you want to read data faster many times, you have to warm it up, by running "select *" type of the queries. Kafka relies on Linux filesystem cache, so for reading the same dataset faster many times, you have to read it the first time. Here is the example - I clean up the Linux filesystem cache dcli -C "sync; echo 3 > /proc/sys/vm/drop_caches" - I tun the first query: SELECT COUNT(1) FROM WEB_RETURNs_JSON_KAFKA t it took 278 seconds. - Second and third time took 92 seconds only. 5) Use bigger Replication Factor Use bigger replication factor. Here is the example. I do have two tables one is created over the Kafka topic with Replication Factor  = 1, second is created over Kafka topic with ith Replication Factor  = 3. SELECT COUNT(1) FROM JSON_KAFKA_RF1 t this query took 278 seconds for the first run and 92 seconds for the next runs SELECT COUNT(1) FROM JSON_KAFKA_RF3 t This query took 279 seconds for the first run, but 34 seconds for the next runs. 6) Compression considerations Kafka supports different type of compressions. If you store the data in JSON or XML format compression rate could be significant. Here is the examples of the numbers, that could be: Data format and compression type Size of the data, GB JSON on HDFS, uncompressed 273.1 JSON in Kafka, uncompressed 286.191 JSON in Kafka, Snappy 180.706 JSON in Kafka, GZIP 52.2649 AVRO in Kafka, uncompressed 252.975 AVRO in Kafka, Snappy 158.117 AVRO in Kafka, GZIP 54.49 This feature may save some space on the disks, but taking into account, that Kafka primarily used for the temporal store (like one week or one month), I'm not sure that it makes any sense. Also, you will pay some performance penalty, querying this data (and burn more CPU).  I've run a query like: SQL> select count(1) from ... and had followed results: Type of compression Elapsed time, sec uncompressed 76 snappy 80 gzip 92 so, uncompressed is the leader. Gzip and Snappy slower (not significantly, but slow). taking into account this as well as fact, that Kafka is a temporal store, I wouldn't recommend using compression without any exeptional need.  7) Use parallelize your processing. If for some reasons you are using a small number of partitions, you could use Hive metadata parameter "oracle.kafka.partition.chunk.size" for increase parallelism. This parameter defines a size of the input Split. So, if you set up this parameter equal 1MB and your topic has 4MB total, you will proceed it with 4 parallel threads. Here is the test case: - Drop Kafka topic $ kafka-topics --delete --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales - Create again with only one partition $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic store_sales - Check it $ kafka-topics --describe --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales ... Topic:store_sales PartitionCount:1 ReplicationFactor:3 Configs: Topic: store_sales Partition: 0 Leader: 79 Replicas: 79,76,77 Isr: 79,76,77 ... - Check the size of input file: $ du -h store_sales.dat 19G store_sales.dat - Load data to the Kafka topic $ cat store_sales.dat|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic store_sales --request-timeout-ms 30000 --batch-size 1000000 - Create Hive External table hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Create Oracle external table SQL> CREATE TABLE STORE_SALES_KAFKA ( TOPIC VARCHAR2(50), PARTITIONID NUMBER, VALUE VARCHAR2(4000), OFFSET NUMBER, TIMESTAMP TIMESTAMP, TIMESTAMPTYPE NUMBER ) ORGANIZATION EXTERNAL ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.store_sales_kafka ) ) REJECT LIMIT UNLIMITED PARALLEL ; - Run test query SQL> SELECT COUNT(1) FROM store_sales_kafka; it took 142 seconds - Re-create Hive external table with 'oracle.kafka.partition.chunk.size' parameter equal 1MB hive> CREATE EXTERNAL TABLE store_sales_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.chop.partition'='true', 'oracle.kafka.partition.chunk.size'='1048576', 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='store_sales' ); - Run query again: SQL> SELECT COUNT(1) FROM store_sales_kafka; Now it took only 7 seconds One MB split is quite low, and for big topics we recommend to use 256MB. 8) Querying small topics. Sometimes it happens that you need to query really small topics (few hundreds of messages, for example), but very frequently. At this case, it makes sense to create a topic with fewer paritions. Here is the test case example: - Create topic with 1000 partitions $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1000 --topic small_topic - Load only one message there $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic - Create hive external table hive> CREATE EXTERNAL TABLE small_topic_kafka row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe' stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler' tblproperties( 'oracle.kafka.table.key.type'='long', 'oracle.kafka.table.value.type'='string', 'oracle.kafka.bootstrap.servers'='cfclbv3870:9092,cfclbv3871:9092,cfclbv3872:9092,cfclbv3873:9092,cfclbv3874:9092', 'oracle.kafka.table.topics'='small_topic' ); - Create Oracle external table SQL> CREATE TABLE small_topic_kafka ( topic varchar2(50), partitionid integer, VALUE varchar2(4000), offset integer, timestamp timestamp, timestamptype integer ) ORGANIZATION EXTERNAL (TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.small_topic_kafka ) ) PARALLEL REJECT LIMIT UNLIMITED; - Query all rows from it SQL> SELECT * FROM small_topic_kafka it took 6 seconds - Create topic with only one partition and put only one message there and run same SQL query over it $ kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic small_topic   $ echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic SQL> SELECT * FROM small_topic_kafka now it takes only 0.5 second 9) Type of data in Kafka messages. You have few options for storing data in Kafka messages and for sure you want to do pushdown processing. Big Data SQL supports pushdown operations only for JSONs. This means that everything that you could expose thought the JSON will be pushed down to the cell side and will be prosessed there. Example - The query which could be pushed down to the cell side (JSON): SQL> SELECT COUNT(1) FROM WEB_RETURN_JSON_KAFKA t WHERE t.VALUE.after.WR_ORDER_NUMBER=233183247; - The query which could not be pushed down to the cell side (XML): SQL> SELECT COUNT(1) FROM WEB_RETURNS_XML_KAFKA t WHERE XMLTYPE(t.value).EXTRACT('/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') .getNumberVal() = 233183247; If amounts of data is not significant, you could use Big Data SQL for processing. If we are talking about big data volumes, you could process it once and convert into different file formats on HDFS, with Hive query: hive> select xpath_int(value,'/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') from WEB_RETURNS_XML_KAFKA limit 1 ; 10) JSON vs AVRO format in the Kafka topics In continuing to the previous point, you may be wondering which semi-structured format use? The answer is easy - use what your data source produce there is no significant performance difference between Avro and JSON. For example, a query like: SQL> SELECT COUNT(1) FROM WEB_RETURNS_avro_kafka t WHERE t.WR_ORDER_NUMBER=233183247; Will be done in 112 seconds in case of JSON and in 105 seconds in case of Avro. and JSON topic will take 286.33 GB and Avro will take 202.568 GB. There is some difference, but not worth for converting the original format. How to bring data from OLTP databases in Kafka? Use Golden Gate! Oracle Golden Gate is the well-known product for capturing commit logs on the database side and bring the changes into a target system. The good news that Kafka may play a role in the target system. I'd like to skip the detailed explanation of this feature, because it's already explained in very deep details here. Known Issue. Running Kafka broker on wildcard By default, Kafka doesn't use wildcard address (0.0.0.0) for brokers and pick some IP address. it maybe a problem in case of multi-network Kafka cluster. One network could be used for interconnect, second for external connection. Luckily, there is easy way to solve this issue and start Kafka Broker on Wildcard address. 1) go to: Kafka > Instances (Select Instance)  > Configuration > Kafka Broker  > Advanced > Kafka Broker Advanced Configuration Snippet (Safety Valve) for kafka.properties 2) and add: listeners=PLAINTEXT://0.0.0.0:9092 advertised.listeners=PLAINTEXT://server.example.com:9092

Big Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I'd like to explain in...

Oracle Big Data SQL 3.2 is Now Available

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new data sources, enhanced security and improved performance. Big Data SQL has expanded its data source support to now include querying data streams – specifically Kafka topics: This enables streaming data to be joined with dimensions and facts in Oracle Database or HDFS.  It’s never been easier to combine data from streams, Hadoop and Oracle Database. New security capabilities enable Big Data SQL to automatically leverage underlying authorization rules on source data (i.e. ACLs on HDFS data) and then augment that with Oracle’s advanced security policies.  In addition, to prevent impersonation, Oracle Database servers now authenticate against Big Data SQL Server cells. Finally, secure Big Data SQL installations have become much easier to set up; Kerberos ticket renewals are now automatically configured. There has been significant performance improvements as well.  Oracle now provides its own optimized Parquet driver which delivers a significant performance boost – both in terms of speed and the ability to query many columns.  Support for CLOBs is also now available – which facilitates efficient processing of large JSON and XML data documents. Finally, there has been significant enhancements to the out-of-box experience.  The installation process has been simplified, streamlined and made much more robust.

Big Data SQL 3.2 has been released and is now available for download on edelivery.  This new release has many exciting new features – with a focus on simpler install and configuration, support for new...

Big Data

Roadmap Update for Big Data Appliance Releases (4.11 and beyond)

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To understand what we do before we ship the latest CDH on BDA and why we think we should spend that time, review this post. That all said, we have decide to rejigger the releases and do the following: Focus BDA 4.11 solely on up taking the latest CDH 5.13.1 and related OS and Java updates, thus catching up in timelines to the CDH releases Move all features that were planned for 4.11 to the next release, which will then be on track to uptake CDH 5.14 on our regular schedule So what does this mean in terms of release timeframes, and what does it mean for what we talked about at Openworld for BDA (shown as a image below, review full slide deck incl. our cloud updates at the Openworld site)? BDA version 4.11.0 will have the following updates: Uptake of CDH 5.13.1 - as usual, because we will be very close to the first update to 5.13, we will include that and time our BDA release as close to that as possible. This would get us to BDA 4.11.0 around mid December, assuming the CDH update retains it dates Update the latest OS versions, Kernel etc. to update to state of the art on Oracle Linux 6, and include all security patches Update MySQL, Java and again ensure all security patches are included BDA version 4.12.0 will have the following updates: Uptake of CDH 5.14.x - we are still evaluating the dates and timing for this CDH release and whether or not we go with the .0 or .1 version here. Goal here is to deliver this release 4 weeks or so after CDH drops. Expect early calendar 2018, with more precise updates coming to this forum as we know more. Include roadmap features as follows: Dedicated Kafka cluster on BDA nodes Full cluster on OL7 (aligning with the OL7 edge nodes) Big Data Manager available on BDA Non-stop Hadoop, and proceeding on making more and more components HA out of the box Fully managed BDA edge nodes The usual OS, Java and MySQL updates per the normal release cadence Updates to related components like Big Data Connectors etc. All of this means that we pulled in the 4.11.0 version to the mid December time frame, while we pushed out the 4.12.0 version by no more then maybe a week or so... So, this looks like a win-win on all fronts. Please Note Safe Harbor Statement below: The preceding is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

With the release of BDA version 4.10 we added a number of interesting features, but for various reasons we did slip behind our targets in up taking the Cloudera updates within a reasonable time. To...