Today we are pleased to announce the availability of Tribuo, a Java Machine Learning (ML) library, as open source. We’re releasing it under an Apache 2.0 license on Github for the wider ML community to use.
In Oracle Labs' Machine Learning Research Group, we've been working on deploying Machine Learning (ML) models into large production systems for years. During this time we've noticed a crucial gap between the expectations of an enterprise system, and the features provided by most ML libraries. Large software systems want to use building blocks which describe themselves and know when their inputs or outputs are invalid.
In contrast, most ML libraries expect a pile of float arrays to train a model. Then at deployment time, they expect the input to be a float array, and they produce yet another float array as the predicted output. The description of what any of these arrays mean, or what the input/output floats should look like, is left to another system, either a wiki, a bug tracker, or written as a code comment. We don’t think developers want to add yet another database table per ML model just to explain what that array of output floats means.
Tracking models in production is also tricky because it requires external systems to keep the link between a deployed model and the training procedure and data. Usually the burden of these extra requirements falls on the teams who incorporate ML libraries into their products or systems, but in our group, we believe it's far better to embed this into the ML library itself.
Finally, most popular ML libraries are written in dynamically-typed languages like Python and R, whereas most enterprise systems are written in a statically-typed language like Java. As a result, even implementing simple ML components requires significant code maintenance and system overhead, since code has to be written in multiple languages and operate in multiple runtimes.
Subscribe to the Oracle AI & Data Science Newsletter to get the latest AI, ML, and data science content sent straight to your inbox!
Our group has spent the past few years building an ML library to meet these needs. The library is called Tribuo, from Latin meaning to assign or apportion. Tribuo is written in Java, and runs on Java 8 or later. All the relevant information and documentation, along with tutorials and getting started guides, are available on Tribuo's website tribuo.org. We've been using Tribuo in production inside Oracle for several years now, and we're excited to share it with you.
Tribuo provides the standard ML functionality that you'd expect from an ML library: classification, clustering, anomaly detection, and regression algorithms. Tribuo has data loading pipelines, text processing pipelines, and feature level transformations for operating on data once it's been loaded in. It's also got a full suite of evaluations for each of the supported prediction tasks.
Unlike other systems, Tribuo knows what its inputs are, and can describe the range and type of each input. Each feature is named, so you can't confuse it for another feature just because the input processing system gave it the same id number (in fact, in Tribuo, you don't ever need to see its id number). This means a Tribuo Model knows when you've given it features it's never seen before, which is particularly useful when working with natural language processing. Tribuo's models also know what their outputs are, and those outputs are strongly typed. No more staring at a float wondering if it's a probability, a regressed value, or a cluster id; in Tribuo each of these is a separate type, and the model can describe the types and ranges it knows about.
Keeping track of how any given production model was generated is tricky using other ML libraries, as their models don't store the training data source, transformations, or the training algorithm hyperparameters. There are libraries which layer tracking code on top of an existing model training script, but we feel that this information should be embedded into the model (or evaluation) itself. This training time information, coupled with the information about model inputs and outputs stored in every Tribuo model, means that they are self-describing.
Tribuo's use of strongly typed inputs and outputs means it can track the model construction process, from the point data is loaded into Tribuo, through any train/test splits or dataset transformations, through model training (recording all the hyperparameters), and finally to evaluation on a test set. This tracking (or provenance) information is baked into all the models and evaluations.
Tribuo's provenance system is for more than just tracking models in production. Each provenance can generate a configuration which precisely rebuilds the training pipeline to reproduce the model or evaluation (assuming you've still got the original data), or to build a tweaked model on new data or new hyperparameters. This means you always know what a Tribuo model is, where it came from, and how to recreate it if required. It even records all the PRNG seeds, so a model training run is perfectly reproducible.
Tribuo provides interfaces to ONNX Runtime, TensorFlow and XGBoost. This allows models stored in onnx format, or trained in TensorFlow or XGBoost, to be deployed alongside Tribuo's native models. Our group contributes to all three projects: we wrote ONNX Runtime's Java support, have contributed patches to ensure XGBoost works across platforms and Java versions, and have contributed training support to the upcoming TensorFlow JVM releases.
Our TensorFlow and XGBoost interfaces also allow the training of Tribuo models using these systems. When trained through Tribuo they provide all the type safety and provenance benefits that every Tribuo model has. The XGBoost support is fully functional and we've been using it in production internally for years. TensorFlow support is still experimental as we're awaiting the first release from the TensorFlow JVM SIG before Tribuo's TF API can be finalised. That first TF JVM release will also enable training TF models in Java without defining anything in Python first.
We're excited to share Tribuo with the world, and we hope to build and contribute to the Machine Learning ecosystem on the Java platform. Tribuo's development has always been led by our users' needs internally, and we'd like to continue this approach by incorporating community feedback. We accept code contributions to Tribuo under the Oracle Contributor Agreement, and more details are available in our Github docs.
Adam is a Machine Learning researcher, who finished his PhD in Information
Theory and feature selection in 2012. His thesis won the British Computer
Society Distinguished Dissertation award in 2013. He's interested in
distributed machine learning, Bayesian inference, and structure learning. And
writing code that requires as many GPUs as possible, because he enjoys building
He's the lead developer of Tribuo, a Java Machine Learning library.