today I'm going to explain how to multitenant on the same cluster Big Data SQL and YARN. I think it's a quite common scenario - you may want to store historical data and query it with the Big Data SQL. As well you may want to perform the ETL job within the same cluster. If so, resource management became one of the main requirement for this. In other words, you have to warranty certain performance despite on other jobs. For example:
1) You may need to finish your ETL as fast as possible. In this case, MapReduce, which run on YARN has higher priority
2) You build critical reports with the Big Data SQL and in this case, Big Data SQL have to have higher priority rather than YARN
Life without resource manager.
Let's have a start from the beginning. I do have MapReduce (YARN) jobs and Big Data SQL queries, which runs on the same cluster. It will work perfectly fine unless you have exceeded your CPU or IO boundary. Let me give you an example. I picked up small data set for quering it (my goal is not exceed CPU limit)
1) I run the MapReduce job (hive query) and it finished in 165 seconds.
2) I run the Big Data SQL and it finished in 30 seconds.
3) I run Big Data SQL together with Hive and BDS finished in 31 seconds, Hive in 170 Sec. Almost the same results!
But as soon as you run the query, which has reach CPU boundary and your engines (Big Data SQL and YARN) start to share the CPU among two processes. Resource manager will not increase your CPU capacity, but it will help you to define how to share resources between those two processes.
How to enable Resource Sharing between YARN and Big Data SQL.
Cloudera has a very powerful mechanism to share resources - "Static Service Pool". Under the hood, it uses Linux cgroups. It defines the proportion of CPU and IO resources between processes. The easiest way to enable it is the use Cloudera Manager:
1) Go to the "Cluster -> Static Service Pool":
2) Go to the configuration:
3) Enable Cgroup Managment and use: Cgroup CPU Shares and Cgroup IO Weight
It's interesting that Linux CPU share may vary between 2 and 262144 and at the same moment IO weight vary between 100 and 1000. I recommend you to change those two handlers synchronously (in other words, change the values between 100 and 1000 for both). After the restart of coresponding processes, you will have enabled Resource Managment.
Trust, bu verify.
it's all the theory and every theory has to be proven by some concrete examples. I played a bit with Static Service Pools in the context of tenant Big Data SQL and Hive query (read YARN) on the same cluster. For benchmarking I picked up the simplest query which use neither Storage Indexes nor Predicate Push Down and which returns exactly 0 rows:
In a case of Hive this query will be very similar:
Before all, I sequentially ran the Big Data SQL query and Hive. Bellow you could find the CPU and IO profile for Big Data SQL and Hive:
Hive query was done in 890.628 seconds, BDS in 391 seconds.
I start my test with running those queries without any Resource Managment, just run two statements simultaneously.
Big Data SQL (BDS) took 731 seconds
Hive have finished within 1434.75 seconds
After this, I enable cgroup resource management (by Static Service Pool) and run Hive and Big Data SQL queries simultaneously.
In my tests, I only play with the CPU shares which indirectly handle IO as well. I conclude the results into the table, which you could find bellow:
|CPU.shares configuration (BDS/Hive)||Big Data SQL, elapsed time seconds||Hive, elapsed time seconds|
This table shows:
1) Static Service Pool works
2) It's coarse handler. In other words, you couldn't expect exact proportions from it.