Over the past couple of weeks I have been looking at one of the Apache open source projects called Zeppelin. It’s a new style of application called a “notebook” which typically runs within your browser. The idea behind notebook-style applications like Zeppelin is to deliver an adhoc data-discovery tool - at least that is how I see it being used. Like most notebook-style applications, Zeppelin provides a number of useful data-discovery features such as:
Zeppelin is essentially a scripting environment for running ordinary SQL statements along with a lot of other languages such as Spark, Python, Hive, R etc. These are controlled by a feature called “interpreters” and there is a list of the latest interpreters available here.
A good example of a notebook-type of application is R Studio which many of you will be familiar with because we typically use it when demonstrating the R capabilities within Oracle Advanced Analytics. However, R Studio is primarily aimed at data scientists whilst Apache Zeppelin is aimed at other types of report developers and business users although it does have a lot of features that data scientists will find useful.
What’s a good use case for Zeppelin? Well, what I like about Zeppelin is that you can quickly and easily create a notebook, or workflow, that downloads a log file from a URL, reformats the data in the file and then displays the resulting data set as a graph/table.
Nothing really earth-shattering in that type of workflow except that Zeppelin is easy to install, it’s easy to setup (once you understand its architecture), and it seems to be easy to share your results. Here’s a really simple workflow described above that I built to load data from a file, create an external table over the data file and then run a report:
This little example shows how notebooks differ from traditional BI tools. Each of the headings in the above image (Download data from web url, Create directory to data file location, Drop existing staging table etc etc) is a separate paragraph within the “Data Access Tutorial” notebook.
The real power is that each paragraph can use a different language such as SQL, or java, shell scripting or python etc etc. In the workbook shown above I start by running a shell script that pulls a data file from a remote server. Then using a SQL paragraph I create a directory object to access the data file. The next SQL paragraph drops my existing staging table and the subsequent SQL paragraph creates the external table over the data file. The final SQL paragraph looks like this:
%osqlselect * from ext_bank_data
where %osql tells me the language, or interpreter, I am using which in this case is SQL connecting to a specific schema in my database.
You can even build relatively simple briefing books containing data from different data sets and even different data sources (Zeppelin supports an ever growing number of data sources) - in this case I connected Zeppelin to two different schemas in two different PDBs:
What’s really nice is that I can even view these notebooks on my smartphone (iPhone) as you can see below. The same notebook shown above appears on my iPhone screen in a vertical layout style to make best use of the screen real estate:
I am really liking Apache Zeppelin because it’s so simple to setup (I have various versions running on Mac OSX and Oracle Linux) and start. It has just enough features to be very useful and not overwhelming. I like the fact that I can create notebooks, or reports, using a range of different languages and show data from a range of different schemas/PDBs/database alongside each other. It is also relatively easy to share those results. And I can open my notebooks (reports) on my iPhone.
There is a limited set of available visualizations within the notebook (report) editor when you are using a SQL-based interpreter (connector). Essentially you have a basic, scrollable table and five types of graph to choose for viewing your data. You can interactively change the layout of the graph by clicking on the “settings” link but there are no formatting controls to alter the x or y labels - if you look carefully at the right-hand area graph in the first screenshot you will probably spot that the time value labels on the x-axis overlap each other.
Now, this may be obvious but I would not call Zeppelin a data integration tool nor a BI tool for reasons that will become clear during the next set of blog posts.
Having said that, overall, Zeppelin is a very exciting and clever product. It is relatively easy to setup connections to your Oracle Database, the scripting framework is very powerful and there are visualization features are good enough. It's a new type of application that is just about flexible enough for data scientists, power users and report writers.
In my next series of blog posts, which I aiming to write over the next couple of weeks, I will explain how to download and install Apache Zeppelin, how to setup connections to an Oracle Database and how to use some of the scripting features to build reports similar to the ones above. If you are comfortable with writing your own shell scripts, SQL scripts, markup scripts for formatting text then Zeppelin is very flexible tool.
If you are already using Zeppelin against your Oracle Database and would like to share your experiences that would be great - please use the comments feature below or feel free to send me an email: firstname.lastname@example.org.
(image at top of post is courtesy of wikipedia).