Monday May 02, 2016
Friday Apr 29, 2016
By Alexey Filanovskiy-Oracle on Apr 29, 2016
Monday Apr 25, 2016
By Klaker-Oracle on Apr 25, 2016
Welcome to the third post in this deep-dive series on SQL pattern matching using the MATCH_RECOGNIZE feature that is part of Database 12c.
In the first part of this series we looked at a wide range of topics including ensuring query consistency, how to correctly use predicates and how to manage sorting. In the second part we looked at using the built-in measures to understand how a data set is matched to a pattern.
In this post I am going to review the concepts of greedy and reluctant quantifiers. I will breakdown this down into a number of areas: 1) Overview of regular expressions, 2) understanding quantifiers, and 3) greedy vs. reluctant quantifiers. The examples in this post use the built-in measures to help show the difference between greedy and reluctant matching. If you are not familiar with the MATCH_NUMBER() function or the CLASSIFIER() function then please take some time to read the second post in this series.
Overview of regular expressions[Read More]
Thursday Apr 21, 2016
By Yasin.Baskan-Oracle on Apr 21, 2016
Parallel execution uses the producer/consumer model when executing a SQL statement. The execution plan is divided up into DFOs, each DFO is executed by a PX server set. Data is sent from one PX server set (producer) to another PX server set (consumer) using different types of distribution methods.
In this post we will look at the most common distribution methods used in joins in Oracle Database 11gR2. The distribution method is chosen by the optimizer and depends on the DOP, the cardinality of the tables, the number of partitions, etc...
Monday Apr 18, 2016
By Alexey Filanovskiy-Oracle on Apr 18, 2016
Using bloom filter for join operations:
Friday Apr 15, 2016
By Klaker-Oracle on Apr 15, 2016
We are starting to see a significant change in the way we analyze data as a result of the growth of interest in big data and the newer concept of Internet of Things. Ever since databases were first created everyone has been obsessed, quite rightly so, with ensuring queries returned the correct answer - i.e. precise, accurate answers. This key requirement is derived from the need to run operational, transactional applications. If we check our bank balance online we want the figure we see to be accurate right down to the last cent and for a good reason. Yet increasingly both as part of our online as well as offline experiences we deal with numbers that are not 100% accurate and somehow we manage to make good use of these approximate answers. Here are a couple of examples of where we already using approximations: route planning on our smartphones and crowd counting information in newspapers...[Read More]
Thursday Apr 14, 2016
By Alexey Filanovskiy-Oracle on Apr 14, 2016
Big Data SQL is the way to acsess data that stored in HDFS through Oracle RDBMS, using Oracle external table mechanism. In context of security "table" is key word, which means that you may apply standard security approches to those tables. Today I want to give you couple examples with:
- Oracle Data Redaction features.
Oracle Data Redaction
I don't want to replace documentation of the Oracle Data Redaction within this blog, all available information you could find there, but just let me remind you the main idea of this feature. Very often databases contain sensitive data, like credit card number, SSN or other personal information. It could be useful to have this data in unchanged format for resolve different issue with billing department, but at the same time another departments (like call center) may need only part of this information (like 4 last digits of the credit cards) and for security complience you are not able to show them original data.
Wednesday Apr 13, 2016
By Klaker-Oracle on Apr 13, 2016
Yes it's that time of year again! If you have a story to tell about data warehousing, big data and SQL analytics then we want to hear from you because the OpenWorld 2016 call for presentations is now open. Mark your calendars: this year Oracle OpenWorld conference will be held in September on 18 - 22, 2016 at the Moscone Center in San Francisco.
We are looking for proposals that describe insights and improvements that attendees can put to use in their own jobs: exciting innovations, strategies to modernize their business, different or easier ways to implement key features, unique use cases, lessons learned, the best of best practices...[Read More]
Tuesday Apr 12, 2016
By Klaker-Oracle on Apr 12, 2016
Welcome to the second post in this deep dive series on SQL pattern matching using the new MATCH_RECOGNIZE feature that is part of Database 12c. In the first part of this series we looked at the areas of ensuring query consistency, how to correctly use predicates and how to manage sorting.
In this post I am going to review the two built-in measures that we have provided to help you understand how your data set is mapped to the pattern that you have defined. This post will breakdown into three areas: 1) a review of the built-in measures, 2) understanding how to control the output (number of rows returned) and lastly I will bring these two topics together with some examples...[Read More]
Thursday Mar 31, 2016
By Yasin.Baskan-Oracle on Mar 31, 2016
Before 11gR2 Parallel Execution (PX) in the Oracle Database used direct reads to scan large tables and bypassed the buffer cache. As memory sizes got bigger and SGA sizes got bigger as a result, IMPQ was introduced in 11gR2 to make use of the SGA for data warehouse workloads. IMPQ is a performance enhancement feature that caches the data in the SGA so that further scans against the same data can avoid IO and read the data from the buffer cache much more faster. IMPQ makes use of the aggregated buffer cache across all nodes in a RAC cluster so that more data can be cached than can be done in a single server's memory. It is not a RAC only feature, it kicks in for single instance databases too.
As you may already know PX uses granules to scan objects. Each object in a statement is divided into granules and granules are assigned to PX servers so that each process can read a part of the object. A granule can be a partition or a range of blocks depending on the physical definition of the object, the degree of parallelism (DOP), and the execution plan.
With IMPQ, for the first statement that accesses an object each PX server reads its granules and caches them in the buffer cache of the node it is on. In the following example a table is divided into four granules, PX server P1 on node 1 reads two of them and stores them in the buffer cache. PX server P2 on node 2 does the same for the other two granules. Note that the number of granules changes based on the object size and the DOP, I show four granules here for demonstration purposes.
Now, half of the table is cached in node 1 and the other half is cached on node 2. When another parallel statement comes and accesses the same table, as you can see in the below picture the PX servers assigned to this statement (P3 and P4) will read the data from their node's buffer cache instead of going to storage and doing IO. Note that cache fusion is not used to read data from other node's memory, all buffer cache access is local.
In the following example, table T is 700MB and I have a 2-node RAC database, each node with a 500MB buffer cache. We see that the first execution did close to 90K physical reads and finished in around 4 seconds. The second execution shows no physical reads and finished in less than a second. This means the table was cached during the first execution and read from the cache during the second execution. Note that the first execution of the statement can take longer than without IMPQ because of the overhead of populating the buffer cache.
SQL> set autot on stat SQL> select /*+ parallel(2) */ count(*) from t; COUNT(*) ---------- 179982 Elapsed: 00:00:03.58 Statistics ---------------------------------------------------------- 7 recursive calls 4 db block gets 90486 consistent gets 89996 physical reads 248 redo size 544 bytes sent via SQL*Net to client 551 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed SQL> select /*+ parallel(2) */ count(*) from t; COUNT(*) ---------- 179982 Elapsed: 00:00:00.68 Statistics ---------------------------------------------------------- 6 recursive calls 0 db block gets 90485 consistent gets 0 physical reads 248 redo size 544 bytes sent via SQL*Net to client 551 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
If IMPQ is enabled there are two decisions to be made for a statement to use IMPQ.
1. First the query coordinator (QC) decides to use IMPQ for an object or not. This decision is based on the object size to be scanned. By default if the object size is between 2% and 80% of the total buffer cache size in the cluster the object is a candidate for IMPQ. If the object is larger IMPQ will not be used and that object will be accessed using direct reads bypassing the buffer cache. For smaller objects IMPQ will not be used but the object can be cached like an object accessed by a serial query, Cache Fusion will be used to transfer data between nodes when necessary. For object size Oracle looks at the optimizer stats if they have been gathered, if not it looks at the actual object size.
If the query is using partition granules the object size is the total size of the partitions after static partition pruning. For dynamic partition pruning where the exact partitions to be scanned are determined at runtime the total table size is used in the IMPQ decision.
The total buffer cache size is calculated as the size of the buffer cache of the node where the QC resides multiplied by the number of nodes.
2. After the decision to use IMPQ is made the QC decides which granules will be cached on which node. It affinitizes each granule to a node so that subsequent accesses to the same granules are done on the same nodes. For partition granules this affinity is done based on the partition number, for block range granules it is done based on the data block address of the first block in the granule.
At this point the granules are assigned to PX servers on each node. Here each PX server makes the decision to use direct reads or buffered reads based on a set of heuristics and tries to cache as much data as possible. Depending on this decision, it is still possible to see direct reads even if the table is a candidate for IMPQ.
The smart scan capability of Exadata kicks in if an object is scanned using direct reads. If IMPQ tries to cache an object the object will be scanned using buffered reads and smart scan will not kick in. In this case you will see the wait event cell multiblock physical read instead of cell smart table scan. Not using smart scans means you will be using the database nodes to do filtering and projection. This can cause a query to run slower if that query benefits a lot from smart scan.
To prevent this PX servers use adaptive offloading on Exadata. The first scan of an object bypasses offloading and populates the buffer cache. In subsequent scans of the same object each PX server calculates the buffer cache scan rate and smart scan rate. Depending on the ratio of these rates PX servers will favor direct reads or buffer cache reads. So, if smart scan is very fast for your query you will see that PX servers will use more direct reads than buffer cache reads. For example, if the ratio of the smart scan rate to buffer cache read rate is 3:1 PX servers will scan 3x number of granules using direct reads compared to the number of granules scanned from the buffer cache. This rate comparison is done for each query execution by PX servers.
As of 184.108.40.206 IMPQ assumes the RAC cluster is homogenous which means every RAC instance has the same buffer cache size. If you have instances with different buffer cache sizes you can see IMPQ being used or not depending on which instance your session is on.
For example, if you have a 2-node RAC with node 1 having a 300MB buffer cache and node 2 having a 700MB buffer cache, IMPQ will use 600MB as the total buffer cache size if you are on node 1, it will use 1400MB if you are on node 2. So, if you query a table of 700MB, IMPQ may or may not kick in, if you are on node 2 it will try to cache the table, if you are on node 1 it will not.
To make use of IMPQ efficiently we recommend sizing the buffer cache equally on all instances.
As of 220.127.116.11 granule-node affinity in IMPQ depends on the statement DOP. If you query a table with DOP=4 and then query the same table with DOP=8 the node affinity may not be the same. This means the second query may do IO to read the table even if the table is totally cached. This is because the node affinity depends on the starting data block address of a granule, DOP change means the number and size of granules can change which means the starting data block address can change.
Auto DOP automatically rounds up the DOP to a multiple of the number of instances so that every instance has equal number of PX servers. This means the table will be cached on all instances uniformly. If you are using hints to specify the DOP or if you are using Database Resource Manager(DBRM) to limit the DOP, make sure to set the DOP to a multiple of the number of instances to get the same behavior, this is because hints and DBRM override Auto DOP.
Tuesday Mar 22, 2016
By Alexey Filanovskiy-Oracle on Mar 22, 2016
Monday Mar 21, 2016
By Klaker-Oracle on Mar 21, 2016
This is the start of a series of posts based on a presentation that I put together for the recent annual BIWA conference at Oracle HQ. The Oracle BI, DW and Analytics user community always puts on a great conference and this year was the best yet. You can download any or all of the presentations from this year’s conference by following this link. My pattern matching deep dive presentation started life about a year ago as a post covering some of the new keywords in the explain plan that are linked to pattern matching, see here. It has now expanded to cover a much wider range of topics.
The aim of this group of posts is to help you understand the underlying mechanics of the MATCH_RECOGNIZE clause. During these posts we will explore key concepts such as: how to get consistent results, using built-in debugging functions, deterministic vs. non-deterministic state machines, back-tracking (what is it and how to identify when it is occurring), and finally greedy vs. reluctant quantifiers. If you need a quick refresher on how MATCH_RECOGNIZE works then I would recommend that you take a look at the following links[Read More]
Thursday Mar 17, 2016
By Mgubar-Oracle on Mar 17, 2016
In summary, Big Data SQL 3.0:
- Expands support for Hadoop platforms - covering Hortonworks HDP, Cloudera CDH on commodity hardware as well as on Oracle Big Data Appliance
- Expands support for database platforms - covering Oracle Database 12c on commodity hardware as well as on Oracle Exadata
- Improves performance through new features like Predicate Push-Down on top of Smart Scan and Storage Indexes
To learn more:
- Big Data SQL Landing Page on oracle.com
- Data Sheet for Big Data SQL 3.0
- White paper on Big Data SQL
- Big Data SQL Documentation
Wednesday Mar 16, 2016
By Jean-Pierre Dijcks-Oracle on Mar 16, 2016
Oracle Maximum Availability Architecture (MAA) is Oracle's best practices blueprint based on proven Oracle high availability technologies, along with expert recommendations and customer experiences. MAA best practices have been highly integrated into the design and operational capability of Oracle Big Data Appliance, and together they provide the most comprehensive highly available solution for Big Data.
Oracle MAA papers are published at the MAA home page of the Oracle Technology Network (OTN) website. Oracle Big Data Appliance (BDA) Maximum Availability Architecture is a best-practices blueprint for achieving an optimal high-availability deployment using Oracle high-availability technologies and recommendations.
The Oracle BDA MAA exercise for this paper was executed on Oracle Big Data Appliance and Oracle Exadata Database Machine to validate high availability and to measure downtime in various outage scenarios. The current release of this technical paper covers the first phase of the overall Oracle BDA MAA project. The project comprises the following two phases:
Phase 1: High Availability and Outage scenarios at a single site
Phase 2: Disaster Recovery Scenarios across multiple sites
Tuesday Mar 15, 2016
The data warehouse insider is written by the Oracle product management team and sheds lights on all thing data warehousing and big data.
- Big Data SQL Quick Start. Partition Pruning - Part7.
- Big Data SQL Quick Start. Predicate Push Down - Part6.
- SQL Pattern Matching Deep Dive - Part 3, greedy vs. reluctant quantifiers
- Common Distribution Methods in Parallel Execution
- Big Data SQL Quick Start. Joins. Bloom Filter and other features - Part5.
- Is an approximate answer just plain wrong?
- Big Data SQL Quick Start. Security - Part4.
- Oracle OpenWorld 2016 call for papers is OPEN!
- SQL Pattern Matching Deep Dive - Part 2, using MATCH_NUMBER() and CLASSIFIER()
- In-Memory Parallel Query