Machine Learning | Oracle Labs - IRML Blog1
https://blogs.oracle.com/irml/machine-learning-2/rss
Wed, 26 Jun 2019 20:47:19 +0000FeedCreator 1.7.3MLRG is hiring: Machine Learning Researcher
https://blogs.oracle.com/irml/irml-is-hiring%3A-machine-learning-researcher
<p>The Mission of Oracle Labs is straightforward: Identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. The Machine Learning Research Group (MLRG, formerly the Information Retrieval and Machine Learning group) in Oracle Labs is working to develop, scale and deploy Machine Learning throughout all of Oracle's Products and Services.</p>
<p>Our team conducts research related to the development and application of techniques and technologies for Information Retrieval, Machine Learning and Statistical Natural Language Processing (NLP) in a variety of topic areas of interest to Oracle and its customers. Along with developing new techniques, researchers in the IRML team are afforded the opportunity to work with Oracle product groups to transfer them into commercial products.</p>
<p>Oracle Labs encourages the publication of results in academic conferences and journals. At Oracle Labs, you won't get lost in the crowd and you'll have a chance to drive your own research agenda.</p>
<p>The IRML team has Research Scientist positions available in Burlington, Massachusetts. Research areas of interest to the group include:</p>
<ul>
<li>Statistical NLP including named entity recognition, entity linking, and relationship extraction in a variety of domains (general text, medical, e-commerce)</li>
<li>Classification techniques, especially text classification. We deal with many problems where it is difficult or expensive to get training data and where there can be large class imbalances or large numbers of classes</li>
<li>Feature selection and structure learning</li>
<li>Deep learning, especially for image classification in commercial domains and for statistical NLP</li>
<li>Model interpretation and visualization. How do we help researchers, data scientists, and product developers understand the decisions that a model is making?</li>
<li>Privacy-preserving machine learning</li>
<li>Scalable machine learning</li>
</ul>
<p>Candidates must have a Ph.D. degree in Computer Science, Machine Learning, or related technical field. Education and/or 3-5 years experience in the following areas is required:</p>
<ul>
<li>Graduate-level research experience in one of the areas described above</li>
<li>Using standard machine learning and/or NLP toolkits</li>
<li>Proficiency in modern programming and scripting languages, such as Java, Scala, C++, or Python</li>
<li>Proficiency in relevant statistical mathematics</li>
<li>Ability to develop novel ML and/or NLP techniques and to apply them to real-world problems faced by Oracle's product groups and customers</li>
</ul>
<p>In addition, the following area of education or experience are preferred:</p>
<ul>
<li>Applying existing machine learning and natural language processing techniques and technologies to real-world problems</li>
<li>Database technologies such as Oracle and MySQL</li>
<li>Big Data platforms such as Hadoop, Spark, or MPI</li>
<li>Familiarity with special-purpose computing architectures as applied to machine learning</li>
</ul>
<p>Researchers in Oracle Labs are expected to work closely with Oracle product groups to transfer new approaches and technologies into Oracle's products. Researchers are expected to innovate and create novel technologies, patent new IP, and publish research results in academic conferences and journals.</p>
<p>You can learn more about the Machine Learning Research Group at our <a class="external-link" href="https://labs.oracle.com/pls/apex/f?p=labs:49:::::P49_PROJECT_ID:7" rel="nofollow">project page</a>.</p>
<p>Applications for the position are available through the <a href="https://oracle.taleo.net/careersection/2/jobsearch.ftl?lang=en&alt=1">Oracle careers site</a>, search for requisition number 180002UE. Or click <a href="https://oracle.taleo.net/careersection/2/jobdetail.ftl?job=180002UE&lang=en">here</a>.</p>
Machine LearningTue, 06 Dec 2016 15:22:00 +0000https://blogs.oracle.com/irml/irml-is-hiring%3A-machine-learning-researcherAdam PocockExponential stochastic cellular automata: How we achieved more than 1 billion tokens per second ...
https://blogs.oracle.com/irml/exponential-stochastic-cellular-automata%3A-how-we-achieved-more-than-1-billion-tokens-per-second-for-lda
<p>This blog post is about our recent 2016 AISTATS paper on scaling inference for LDA and their ilk which you can download <a href="http://jmlr.org/proceedings/papers/v51/zaheer16.pdf">here</a>. Manzil Zaheer, who joined us from CMU for the summer of 2015, contributed to this work during his internship.</p>
<p>As computers become more parallel rather than faster, Bayesian inference is doomed unless it can exploit new computational resources like multicores and GPUs. While stochastic gradient descent (SGD) has been crucial in enabling other areas of machine learning to scale (parameter estimation in Markov random fields, deep learning, training classifiers, etc), unfortunately it is difficult to apply to Bayesian inference. <a href="http://www.jmlr.org/papers/v14/hoffman13a.html">Stochastic variational inference</a> (SVI) and <a href="http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex">Langevin dynamics</a> have shown some promise, but they do not always work in high dimensions. Further, the latter completely breaks down in the context of batch-based SGD as the noise dominates the gradient.</p>
<p>Might <em>maximum a posteriori</em> (MAP) inference in a Bayesian model suffice instead? Sometimes, this might be appropriate, especially for large datasets for which we can appeal to the concentration of measure. So if we're willing to accept MAP in lieu of the full posterior then we have some options for scalable inference. Suppose, for example, that we want to scale a model such as <a href="http://jmlr.org/papers/v3/blei03a.html">latent Dirichlet allocation</a> (LDA) to a large dataset. There are several commonly used tools available including <a href="http://spark.apache.org/">Spark's MLLib</a> and <a href="http://www.vldbarc.org/pvldb/vldb2010/pvldb_vol3/R63.pdf">YahooLDA</a> which aim to scale up LDA. MLLib has two algorithms available: online variational inference and expectation maximization (EM), while YahooLDA employs an approximate collapsed Gibbs sampler (CGS). All of these algorithms are perfectly reasonable, but they each have their own unique set of computational properties.</p>
<p>In CGS we first integrate out some of the parameters (hence "collapsed") and then perform Gibbs sampling on the remaining latent variables. That is, the sampler visits each latent variable in turn, sampling a new assignment to the variable and then updating the appropriate sufficient statistics before moving on to the next variable. Note that during this process we only need to keep track of a single assignment for each latent variable. Thus, CGS is appealing because it enables the use of sparse data structures which reduce memory and communication bottlenecks in distributed environments. However, because the process of collapsing usually causes dependencies between all of the latent variables, CGS is inherently sequential and thus difficult to parallelize.</p>
<p>EM alternates between computing the expectation of the latent variables and updating the sufficient statistics for the parameters. Since the latent variables only depend on each other through the parameters, they are conditionally independent during the E-step. Thus, we can compute all of these variables in parallel. For latent Dirichlet allocation (LDA), this means that we could parallelize at the granularity of words! Unfortunately EM requires estimating expectations for each latent variable which requires dense data structures that put tremendous pressure on memory and communication bandwidth.</p>
<p>So while CGS is easy to distribute and hard to parallelize, EM is easy to parallelize, but hard to distribute. This got us thinking, is there a way to combine the parallel nature of EM with the sparsity of a collapsed Gibbs sampler? Motivated by this purely computational goal, we temporarily threw everything we knew about statistics out the window, and settled-upon an algorithm that would satisfy both the sparsity and massively parallel desiderata. In essence, we have a <a href="https://statistics.stanford.edu/sites/default/files/OLK%20NSF%20301.pdf">stochastic EM (SEM)</a>. The idea is to insert an extra stochastic step (S-step) after the E-step in EM. That is, once we have computed the expectations for the latent variables, we estimate that expectation with a single sample from its distribution. Thus, like collapsed Gibbs sampling it employs sparse data structures, yet like EM, all the variables can be processed in parallel. Here is the resulting algorithm that we recently presented at <a href="http://www.aistats.org/">AISTATS 2016</a>:</p>
<p>Alternate the following steps: <em>S-step: in parallel, compute the expectations of the latent variables and estimate the expectation with a single draw from the variable's conditional probability distribution.</em> M-step: update the sufficient statistics</p>
<p>An implementation can have two copies of the data structures that store the sufficient statistics. During one round it reads the sufficient statistics from copy A and increments the counts in copy B (as it imputes values in the S-step). Then, in the next round, it reads the fresh sufficient statistics from copy B and increments the counts in copy A. This process is similar to double-buffering in computer graphics.</p>
<p>Note that for a model such as LDA, the algorithm has a number of desirable computational properties:</p>
<ul>
<li>
<p>The memory footprint is minimal since we only store the data and sufficient statistics. In contrast to MCMC methods, we do not store the assignments to the per-word latent variables <em>z</em>. In contrast to variational methods, we do not store the variational parameters. Further, variational methods require K memory accesses (one for each topic) per word. In contrast, the S-step ensures we only have a single access (for the sampled topic) per word. Such reduced pressure on the memory bandwidth can improve performance significantly for highly parallel applications.</p>
</li>
<li>
<p>We can further reduce the memory footprint by compressing the sufficient statistics with approximate counters. This is possible because updating the sufficient statistics only requires increments as in <a href="http://www.jmlr.org/proceedings/papers/v37/tristan15.html">Mean-for-Mode</a> (an earlier approach to running LDA on GPUs developed last year). In contrast, CGS decrements counts, preventing the use of approximate counters.</p>
</li>
<li>
<p>An implementation can be lock-free (in that it does not use locks, but assumes atomic increments) because the double buffering ensures we always read from one data structure and write to another.</p>
</li>
<li>
<p>Finally, the algorithm is able to fully benefit from <a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=92917">Vose's alias method</a> for constant time random sampling because the cost of constructing the table can be amortized over all the parallel computation in a single S-step.</p>
</li>
</ul>
<p>So yes indeed the algorithm has nice computational properties, but will it work from a statistical stand-point? On the one hand the algorithm might work because it resembles EM and statistical dependencies could be capture through the time lag. On the other hand, the algorithm resembles a parallel Gibbs sampler or stochastic cellular automaton (SCA), which are known in the general case to <a href="http://jphys.journaldephysique.org/articles/jphys/abs/1988/10/jphys_1988__49_10_1647_0/jphys_1988__49_10_1647_0.html">converge to the wrong stationary distribution</a>.</p>
<p>With this in mind, we were eager to see if the algorithm would actually work, so we implemented it for latent Dirichlet allocation (LDA), and compared it to collapsed Gibbs sampling. Of course, there were two possible outcomes to our little experiment. Either (a) the experiment would be successful because the computational efficiency would dominate the statistical accuracy of the CGS, and so although the final model log-likelihood might be slightly worse, it would converge to that slightly worse log-likelihood much quicker or (b) the experiment would be a complete failure because the statistical error is so large that no amount of computational efficiency could salvage the model.</p>
<p>But in fact there was a third possibility that we had not considered: ESCA and CGS actually converged to the same likelihood! Even when we did a massive sweep over the hyper parameters, each instance of the algorithm converged to a similar likelihood as the corresponding CGS. Delighted, we were eager to understand why it worked so well. Is it actually converging to the correct stationary distribution after all? If so, what properties of LDA allow this to happen? Do these properties apply to other models as well? To what class of models can we generalize? We spent the next several months investigating this line of questioning.</p>
<p>Our initial results on this problem form the bulk of the AISTATS <a href="http://jmlr.org/proceedings/papers/v51/zaheer16.pdf">paper</a> (authored by Manzil Zaheer, Michael Wick, Jean-Baptiste Tristan, Alex Smola, and Guy Steele). For example, there are certain conditions under which SEM is known to converge and we were able to show that under reasonable assumptions, LDA (and a broader class of Bayesian models, those with an exponential family likelihood) satisfies these conditions. We also found an interesting connection between stochastic EM and stochastic gradient descent with a Frank-Wolfe update. However, more work needs to be done to further establish the statistical validity of the algorithm. For this, we seek the collective wisdom of the machine learning community.</p>
<p>Michael</p>
Machine LearningSat, 18 Jun 2016 20:47:00 +0000https://blogs.oracle.com/irml/exponential-stochastic-cellular-automata%3A-how-we-achieved-more-than-1-billion-tokens-per-second-for-ldaAdam Pocock