Thursday Nov 20, 2014

The Information Retrieval and Machine Learning group in Oracle Labs is looking for a solid research software engineer

My group in Oracle Labs, the Information Retrieval and Machine Learning group (IRML for short), is looking for a senior research software engineer.

The Mission of Oracle Labs is to identify, explore, and transfer new technologies that have the potential to substantially improve Oracle's business. Oracle's commitment to R&D is a driving factor in the development of technologies that have kept Oracle at the forefront of the computer industry. Although many of Oracle's leading-edge technologies originate in its product development organizations, Oracle Labs is the sole organization at Oracle that is devoted exclusively to research.

The Information Retrieval and Machine Learning Group does research and collaborates with a number of groups inside Oracle in areas such as search relevance, scalable search systems, feature selection, large-scale hierarchical classification, sentiment analysis, named entity recognition, entity linking, and coreference resolution.

Job Responsibilities

The engineer will work closely with the other members of IRML to build prototype systems that we can transfer to our collaborators. The transfer of these systems will often involve consulting with the collaborators to explain the prototypes to them and to determine how they can be integrated into their existing systems, so the successful candidate must possess excellent written and oral communication skills.

Our goal is to build near-production ready systems that clearly demonstrate the value of our research and are easy for our collaborators to make productive use of.

Candidate Profile

  • 5+ years of experience implementing machine learning, NLP, or information retrieval algorithms in production systems.
  • Experience building scalable systems that run on distributed platforms.
  • Experience with NLP toolkits such as ClearNLP, Mallet, or Factorie.
  • Extensive experience with Java, experience with Scala is a plus.
  • Experience with Big Data systems like Hadoop and Spark.
  • Experience with information retrieval systems, data mining, machine learning, natural language processing, and statistical techniques.
  • Experience in extracting and manipulating extremely large datasets
  • Practical understanding of the mathematics behind modern machine learning, linear algebra and statistics.
  • Ability to communicate the design of algorithms and systems to other members of the group and to management.


Burlington, Massachusetts

Application Process

If you're interested, please apply using Oracle's careers site. The application process will likely take about 10-15 minutes. Please direct any further questions to

Friday Jan 08, 2010

Leaving Sun

Seems like I've been seeing a lot of these posts lately, and now I guess I get to write my own. Today is my last day at Sun. I'm moving on to a new opportunity. It's not really anything to do with the pending acquisition, it's more that it's a good time for me to make the move.

Before I go, though, I wanted to say thanks to some people who made my ten years (and two months!) at Sun really great. Bill Woods (now at ITA software) and Bob Sproull took a chance on a kid (well, I was 31 — grad school takes a looooong time) who'd never had a real job and I really do appreciate it. I learned a lot working here in Sun Labs. I've had a great time working with Jeff Alexander, who started at Sun Labs to work on the Advanced Search Technologies project and the Minion search engine, and went on to do a great job of designing and implementing the Data Store for the AURA Project. Thanks to Bill, Geoff, Karl, Meg, and Miriam (and then Karl again) for managing (or at least, attempting to manage) me over the years. Thanks to Josh Simons for being my SEED mentor and for lots of great discussions over the years. Thanks, too to Jim Waldo who was always willing to give me the straight dope. This one's for you, Jim!

I don't think any of them are left any more, but I owe the Portal Server team a real debt of gratitude. They were also willing to take a chance: on a search engine written by some weird labs guy. Thanks Mark, Ashok and everyone else.

All in all, it's hard to overstate what a great group of people Sun Labs is, especially the Labs here in Burlington ("Large enough to be interesting, small enough to be fun"). I mean, you give a guy an earworm from 30 Rock and he actually makes Cheesy Blasters:

That's pretty amazing right there. Also, Cheesy Blasters are awesome. I'm going to continue blogging (perhaps less sporadically) over at the new blog, if you would like to continue to read about search stuff.

Tuesday Aug 11, 2009

Homebrew Hybrid Storage Pool

I had a bit of trouble with a slow iscsi connection to my downstairs Solaris box, and so I tried something crazy. I used a RAM disk for a ZIL. This is fine, as long as your machine never, ever (ever!) goes down. I made things slightly less crazy by replacing the RAM disk with a file vdev. This allows me to power down the machine, but the iscsi performance goes back to being pretty terrible.

The answer is to use an SSD as a ZIL vdev, so I talked the wife into letting me get 16GB SATA II drive, which I just put into the machine. There was a bit of a tense moment (note: this is what we in the business call an understatement) when the file systems on my big pool didn't appear right away (me: ls /tank, machine: I got nothing, me: WTF!?), but they appeared eventually and all I had to do was svcadm clear a couple of services that depended on them. Note to self: make those services dependent on the ZFS file systems being available.

The SSD was all the way off on c11d0 for some reason, but ZFS was happy to replace my ZIL with the new vdev, so now I'm sitting here watching this:

root@blue:~# zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h21m, 57.62% done, 0h15m to go

	NAME                     STATE     READ WRITE CKSUM
	tank                     ONLINE       0     0     0
	  raidz1                 ONLINE       0     0     0
	    c4d0                 ONLINE       0     0     0
	    c4d1                 ONLINE       0     0     0
	    c5d0                 ONLINE       0     0     0
	    c5d1                 ONLINE       0     0     0
	  replacing              ONLINE       0     0     0
	    /rpool/slog/current  ONLINE       0     0     0
	    c11d0                ONLINE       0     0     0

errors: No known data errors
Yes, I called my big pool tank. I'm a ZFS nerd, I guess. Tomorrow I'll plug the Mac Book into the gigE and see how Time Machine does over iscsi. I'm hoping for big numbers.

Update: I'm glad that I didn't stay up until it finished:

root@blue:~# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: resilver completed after 2h37m with 0 errors on Wed Aug 12 01:27:18 2009

	tank        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    c4d0    ONLINE       0     0     0
	    c4d1    ONLINE       0     0     0
	    c5d0    ONLINE       0     0     0
	    c5d1    ONLINE       0     0     0
	  c11d0     ONLINE       0     0     0

errors: No known data errors

Why do you want a free ISMIR registration?

Sun is sponsoring ISMIR again this year. As part of our sponsorship, we've been given two registrations. One of the registrations is going to our intern Fran├žois as our way of saying thanks for all his hard work, but we still have one to give away.

We'd really like for it to go to someone who really needs it, so if you would like to get our second registration, please send me a paragraph (just one paragraph!) explaining why you need our free registration. Send it to me by this Friday, August 14th, at noon EST, and a panel of experts (well, me and Paul) will look at them and decide who gets the registration.

And don't worry if you've already registered, if we select you, you'll get a refund.

Saturday Jul 25, 2009

Dear ZFS and Time Slider teams: Will you marry me?

I'm sure my wife and your wives (or husbands) and children will understand.

You see, I was working on my home system this afternoon, writing code instead of enjoying the summer weather, when I hit the following:

stgreen@blue:~/Projects/silv/work$ hg verify
\*\* unknown exception encountered, details follow
\*\* report bug details to
\*\* or
\*\* Mercurial Distributed SCM (version 1.1.2)
\*\* Extensions loaded: 
Traceback (most recent call last):
  File "/usr/bin/hg", line 20, in ?
  File "/usr/lib/python2.4/vendor-packages/mercurial/", line 379, in parseindex
    index, nodemap, cache = parsers.parse_index(data, inline)
ValueError: corrupt index file

This made me, shall we say, unhappy. This made me realize that I hadn't done a push to the "main" hg repository since I started the new project that I had just recently gotten working, so I was looking at losing more than a thousand lines of code.

But there you were, ZFS and Time Slider, ready to pick me up and get me back in the game:

You are the wind beneath my wings:

stgreen@blue:~/Desktop/work$ hg verify
checking changesets
checking manifests
crosschecking files in changesets and manifests
checking files
3209 files, 266 changesets, 3862 total revisions

I only lost 15 minutes of work, and those 15 minutes don't even matter, because for those 15 minutes, I was screwing around in a virtualized Ubuntu getting GWT hosting mode working. I only lost two small changes.

I have no idea what caused this problem. ZFS isn't reporting any errors on the drive, but the hg and virtualbox forums suggest that the vboxsf filesystem might be corrupting files. So, note to self: push to the main hg repository before cloning to the virtualized Ubuntu!

And I'm 100% serious about that marriage thing.

Wednesday Jun 17, 2009

Crikey! Fix a bug and look what you get!

Our intern arrived at the beginning of June and promptly began putting bugs into my pristine and utterly bug free search engine code.

One of the bugs he planted/discovered was that the recent change to use a BST for part of the dictionary lookup resulted in a BST search that was a bit, shall we say, overzealous in searching the tree. I fixed the bug and committed the change. Today I had a bit of time, so I decided to re-run the multi threading tests to see how the change affected our dictionary lookup times.

I was expecting the times to be worse, because I thought that perhaps we were not finding the right entries (but the dictionary code checks for that), but here's what I found:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That's a huge speedup at the low end: somewhere between 3 to 5 times! Something weird happens in the middle where the results are worse, but then at the top end is settles down to be more than twice as fast. Also notice that I added a result for 200 threads (hey, why not, right?)

Here's the graph:

I have to tell you: it's pretty cool to see a machine with 256 hardware threads looking like this in prstat:

  1529 stgreen   100M   86M cpu169  30    0   4:00:48  71% java/223

Onward to query performance...

Thursday May 28, 2009

Scalin' Dictionaries 2: Electric BST

I've been working on the scalability properties of the dictionaries again. The last time, thanks to Jeff, we managed to get the dictionary to scale pretty well up to 32 threads and acceptably up to 64 threads by removing the synchronization bottlenecks on allocating lookup states during dictionary lookups and by using positional reads on the files containing the dictionaries.

Once we'd done this, I went back to collect/analyze and had a look at where the synchronization time was being spent during a run of our dictionary test program with 128 threads. I was surprised to see that a lot of synchronization time was being spent adding elements to the caches that we're mainitaining. This was a bit weird, because there's no explicit synchronization in the java.util.LinkedHashMap that we were using for the cache. I suspected that we were hitting an allocation problem with the objects that it allocated to put on the linked list that preserves the insertion order.

Aside from the synchronization problems, the main problem with the caches, in particular the entry-by-name cache, is that we're not getting that many hits on the cache during querying. In some of our query tests, the cache hit rate is about 10%, which means we're doing a lot of synchronization for not a lot of benefit.

So, I did away with both the entry-by-name cache and the name-by-position cache that we were using. The name-by-position cache actually was used: over time it was building up the top few layers of the binary search tree for the names in the dictionary. Unfortunately, this useful property was overwhelmed by the need to synchronize on the cache to fetch out the entry name at a given position while binary searching.

So I decided to add an honest-to-goodness binary search tree to the dictionary. This tree is populated when the dictionary is loaded, and since it never changes, it's completely thread safe. Because we took out the other caches and because the BST is so simple, we can afford to devote more entries to the BST. Every level that we add to the tree may save us a read from the underlying file, which is good, because the fastest read is the one that you don't have to make.

Each node in the BST stores the name at that position in the dictionary and the upper and lower extents of the search, so that once we've traversed the BST we know which chunk of the dictionary entries to continue the search in.

Here's the results for the new dictionary test, where we use a 1024 node BST (i.e., a tree 9 levels deep), along with a graph (note the log scale!) comparing this one to the old one:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

That gives us a speedup of about 1.8 times at 64 and 128 threads, and a slightly smaller speedup of 1.6 times with smaller numbers of threads.

The last time I did this, I got a comment asking why we were even hitting the disk when we had so much memory on this box. That's a good question, but I'm pretty sure that we are hitting the disk. If we copy the index into a RAM disk and run the dictionary test, here are the numbers and the corresponding graph:

# of threadsTotal Number of LookupsAverage Lookup Time (ms)

So, yeah, I'd say we're hitting the disk. That 9.67ms per lookup with 128 threads is pretty nice. That's about 123 times faster than original code was doing with a disk-based lookup.

While I was debugging my BST, I went ahead and modified the DictionaryFactory so that when you open a dictionary that is smaller than the specified cache size, we just return you a CachedDiskDictionary that loads the dictionary into a hash table when the dictionary is opened, since it all would have been cached eventually anyways.

Tuesday May 19, 2009

Scaling a dictionary

Another post about Minion's dictionaries today. We recently got hold of a really big box: it has 256 hardware threads and 256GB of RAM. This lead us to ask the question: How does Minion scale on this kind of hardware. Our initial experiments running queries on a couple of million documents with a varying number of threads (powers of 2 up to 128) showed us that as we increased the number of threads we were spending more and more time doing dictionary lookups.

Because of our EIDAP philosophy, we need to be sure that our dictionaries have good performance especially the multi-threaded case. We've tried out things on 4 or 8 processor machines, but nothing like the new beast. Although I'm writing about it, Jeff did all of the hard work here. The Sun Studio collect/analyze tools turned out to be exceedingly useful for doing this analysis.

We built a test program that selects a number of terms from a dictionary on-disk and then looks them up in the dictionary. A lot. For the runs that we'll be describing, we selected 10,000 terms. This list of terms is handed out to a number of threads. Each thread shuffles its list and then looks up the terms from its list in the dictionary until a time limit (300 seconds by default) passes.

Here's the state of affairs before we started:

Number of threadsTotal Number of LookupsAverage Lookup Time (ms)

Oh, dear. Not very good: we're pretty close to doubling the time when we double the number of threads, which is kind of the opposite of the definition of scalability. These times are fairly acceptable when we're doing querying with a small number of threads, because they're swamped by all of the other work that we're doing, like uncompressing postings. Once we get up to larger numbers of threads (around 16 or 32), the dictionary lookup time starts to dominate the times for the other work.

We started out by inspecting the code for doing a get from the dictionary. I described the way that it worked in a previous post, but the basic idea is that we do a binary search to find the term. We have an LRA cache for entries in the dictionary indexed by name that is meant to speed up access for commonly used terms. We also have an LRA cache for entries indexed by their position in the dictionary that is meant to speed up the binary search. Since dictionary access is multithreaded, we need to synchronize the cache accesses.

This was the only synchronization that was happening in the dictionary's get method, so we figured that was what was causing the scalability bottleneck. Unfortunately, a quick change to reduce the amount of synchronization by removing the entry-by-position cache didn't make any difference!

This is where collect/analyze comes in. It turns out that it can do a pretty good job of giving you visibility into where your Java code is spending its synchronization time, but it also shows you what's happening underneath Java as well. Jeff ran up the tools on our big box and I have to say that we were surprised at what he found.

The first step of a dictionary fetch is to create a lookup state that contains copies of the file-backed buffers containing the dictionary's data. Although we provided for a way to re-use a lookup state, the test program was generating a new lookup state for every dictionary lookup, which meant that it was duplicating the file-backed buffers for every lookup. The tools showed us two synchronization bottlenecks in the buffer code: the first was that we were using a non-static logger, and getting the logger caused synchronization to happen. The second was that we were blocking when allocating the byte arrays that we used to buffer the on-disk dictionary data.

We were surprised that the allocator wasn't doing very well in a multithreaded environment, and it turns out that there are (at least) two multithreaded allocators that you can use in Solaris. Unfortunately, Java isn't linked against these libraries, so using them would require LD_PRELOAD tricks when running Minion. We've always tried to avoid having to have a quadruple bucky Java invocation, and we didn't want to start now.

The answer was to use thread-local storage to store a per-thread lookup state. When a thread does its first dictionary lookup the lookup state gets created and then that lookup state is used for all future lookups. Don't worry: Jeff was careful to make sure that the lookup states associated with unused threads will get garbage collected.

Once we had that working, we re-ran our tests and got better results, but still not great. So, back to collect/analyze. Now we were seeing a bottleneck when reading data from the file. This turned out to be synchronization on the RandomAccessFile in the FileReadableBuffer. In order to read a chunk of data from the dictionary, we need to seek to a particular position in the file and then read the data. Of course, this needs to be atomic!

An NIO FileChannel offers a positional read method that does the seek-and-read without requiring synchronization (this may not be the case on some OSes, so caveat optimizer!) Our final modification was therefore to introduce a new file-backed buffer implementation, NIOFileReadableBuffer that uses a FileChannel and an NIO buffer to store the data.

We added a configuration option to the dictionaries so that we could select one or the other of the file-backed buffer implementations and then re-ran our tests. Here's the results after this change, along with a nice graph.

# of threadsTotal Number of LookupsAverage Lookup Time (ms)Speedup

Clearly, this is a keeper. At 32 threads we're doing a lot better than we were at 4 threads and almost better than we were doing at 2 threads! We start to see the times doubling again as we get to 64 and 128 threads.

Because of the nature of the test, we're not hitting the dictionary cache very much, so I expect we're starting to run into some contention for the disk here (the index is residing on a ZFS pool that's on disks that are in the box). Of course, I thought I knew what the problem was at the beginning, so back to collect/analyze we go!

Friday May 01, 2009

Running Mahout on Elastic MapReduce

Here in the Labs we have a Big Data reading group. The idea is that we get together once a week and discuss a paper of interest. We've covered a lot of the famous ones, like the initial papers for GFS and MapReduce. A couple of weeks ago, I volunteered to tackle the paper from Stanford that lays out methods for running a number of standard machine learning techniques in a MapReduce framework.

The Apache Mahout project was started to build the algorithms described in the paper on the Hadoop MapReduce framework (the original paper describes running the algorithms on multicore processors.) They've also brought in the Taste Collaborative Filtering framework, which is interesting to us as recommendation folks. As it turns out, they had just released Mahout 0.1. around the time we were going to read the paper.

Coincidentally, Amazon had just announced their Elastic MapReduce (EMR) service that lets you run a MapReduce job on EC2 instances, so I decided to see what it would take to get Mahout running on EMR.

I didn't manage to get it running in time for the reading group, but one Mahout issue and a few "Oh, that's the way it works"es later, I had it running.

Apparently I'm the first person to have run Mahout on Elastic MapReduce, which just shows, as my father used to say, that brute force has an elegance all its own.

If you're interested the details are on the Mahout wiki.

Tuesday Apr 21, 2009

Distributed Key Stores: Roll your own!

Leonard Lin has some interesting notes on distributed key stores. He implemented a distributed key store for a client based on Tokyo Cabinet and a consistent hashing scheme to distribute data.

The really interesting part of this though, is that he had a look at a lot of the available options (admittedly on a pretty tight schedule) and his conclusion is that, given the maturity of the options available you could probably write your own in a day.

I'm pretty interested in retrying some of this evaluation with our own data. I'm not sure how the numbers will compare given that we're doing inverted indexing on the text in the items that we're storing, but it will be interesting to find out!

Thursday Apr 09, 2009

Compare and contrast: The Tragically Hip edition

A new album from The Tragically Hip came out this week. I'm listening to it now. I bought it from Amazon, and I just went to Amazon to see if the fine folks there could recommend any related artists for me. Today, after I searched for the band, Amazon offered a link to the "Tragically Hip Store", where "a fan can be a fan." I'm a fan, and I sure do want a place to be a fan, so let's check it out!

Here's The Tragically Hip Store at Amazon:

Now, don't get me wrong, I like Amazon (it's where I bought the new album, after all), but I'm very surprised to see that there's no recommendations on the page. There's links to the albums, but wouldn't a fan who's looking for a place to be a fan already have the albums and possibly be looking for other music?

Over there is The Tragically Hip on the Explaura where the it's all about the recommendations, and a fan can find other things to be a fan of.

Wednesday Apr 08, 2009

My god, it's full of sshes

Here's a handy hint for controlling a load of EC2 instances (or some other griddy type of thing) from a single terminal in MacOS.

Check out the awesome screenshot:

Of course, if your data store is running on 128 instances, you're going to have teeny-tiny terminal windows...

Monday Apr 06, 2009

The Aussie Long Tail

As a Canadian, I grew up in a culture that was dominated to a pretty large extent by American culture (e.g., watch any good Canadian movies lately?) The one area where this was not as much of an issue was music, mostly due to the CanCon regulations that required radio stations to air a certain percentage of Canadian music. This could lead to some bad results (hearing enough Celine Dion on the radio, yet? No? Well, here's some more!), but also to some good results.

I did my postdoc work at Macquarie University in Sydney. I was already a fan of a couple of Aussie bands (most notably Midnight Oil) but I was pleasantly surprised to discover a lot of other great Aussie bands (e.g., Regurgitator and Spiderbait — you gotta love a band that can put together a two minute song that leaves you wanting more) mostly by listening to Triple M and Triple J.

A bug report for the Explaura dropped me right back into this Aussie music. I must admit that I had forgotten how much I liked some of this music. To a first approximation you would never hear about these bands in the states, so that's one in the win column for the Explaura.

Friday Apr 03, 2009

The AURA Data Store: An Overview

Paul said some pretty nice things about The Explaura today. He mentioned that I hadn't talked much about the backend system architecture, which we call the AURA Data Store (or just "the data store"). This is something that I will be talking about in future posts, but since he mentioned it, I wanted to give you an idea of what the AURA data store is like.

Here's a snapshot of our data store browsing and visualization tool (click for the full sized version):

You can see that the data store has almost 2 million items in it. These are the actual artists along with items for the photos, videos, albums and so on. Along with that are about 1.2 million attentions. An attention is a triple consisting of a user, an item, and an attention type. So, for example, if we know that a particular user listened to a particular artist then we would add an attention for that user with the item corresponding to that artist, and an attention type of LISTEN.

The layout of the visualization tool gives you a pretty good idea of the layout of the data store. At the top level there are data store heads, which is what clients of the data store will use for adding and fetching items and attention and for doing searches in the data store.

The data store heads talk to the second level of the data store, which is composed of a number of partition clusters. Each of the partition clusters manages a number of replicants (currently that number is 1, but stay tuned throughout the summer), which is meant to provide for scalabilty and redundancy in the data store.

Each replicant is composed of a BDB key-value store and a Minion search index. BDB is responsible for storing and retrieving the item and attention data and Minion is responsible for almost all of the searching and all of the similarity computations.

The addition of the Minion search index means that the AURA data store is a little different from typical key value stores because it makes it possible to ask the kinds of queries that a search engine can handle (e.g., find the artists that contain the word progressive in their Wikipedia bios, find artists whose social tags are similar to a particular artist) as well as the kind of queries that a traditional key-value store can handle (e.g., fetch me the item with this key.)

We refer to this data store as a 16-way store, because there are 16 partition clusters and 16 replicants. Each of the boxes that you see in the visualization represents a JVM running a chunk of the data store code. We use Jini in combination with our configuration toolkit to do service registration and discovery. It's all running on the Project Caroline infrastructure, and we'll be migrating it to the new Sun Cloud offering as soon as we can.

In our load tests, a data store like this one was capable of serving about 14,000 concurrent users doing typical recommendation tasks (looking at items, finding similar items, adding attention data) with sub-500ms response time at the client.

Thursday Apr 02, 2009

The new machine

Doop-de-do, logging in to the new machine.

[stgreen@hogwarts 13:38:21 ~]$ 

Huh. I wonder how many processors this machine has.

[stgreen@hogwarts 13:38:22 ~]$ /usr/sbin/psrinfo -v
Status of virtual processor 0 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:19.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 1 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 2 as of: 04/02/2009 13:38:26
  on-line since 03/31/2009 14:56:22.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

[...Many lines deleted...]

Status of virtual processor 253 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 254 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.
Status of virtual processor 255 as of: 04/02/2009 13:38:27
  on-line since 03/31/2009 14:56:23.
  The sparcv9 processor operates at 1414 MHz,
        and has a sparcv9 floating point processor.

OK. That's a lot of processors there. I wonder how much memory it has.

[stgreen@hogwarts 13:38:27 ~]$ /usr/sbin/prtconf | head -2
System Configuration:  Sun Microsystems  sun4v
Memory size: 261856 Megabytes

Uh, wow. That's a lot of RAM. I wonder how much disk space.

[stgreen@hogwarts 13:43:49 ~]$ df -hl 
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0t0d0s0       15G    10G   4.6G    69%    /
swap                   161G   414M   161G     1%    /tmp
swap                   161G    56K   161G     1%    /var/run
scratch                134G    60G    74G    45%    /scratch

Heh. So I guess at start up we should just cache the disk then, right?

Anyone have any single instance, multi-threaded search scalability tests they want me to try?


This is Stephen Green's blog. It's about the theory and practice of text search engines, with occasional forays into Machine Learning and statistical NLP. Steve is the PI of the Information Retrieval and Machine Learning project in Oracle Labs.


« March 2015