Tuesday May 06, 2008

Operation SwoRDFish: The Business End

You’ve probably heard from some that structured metadata is what the Web really needs to evolve to the next level. “It’s all about the Semantic Web”. You’ve probably heard from others that structured metadata on the Web will never work because we’re all too lazy, too bad at spelling and too heavily infiltrated by crooks and spies. “It’s all about metacrap”. Who knows how that movie will end, but what I know is that we’ve quietly been putting structured metadata to work on our own little corner of the Web for years. Well, not always so quietly-- you might have heard of Sun’s SwoRDFish project. It’s a pioneering effort to use Web-friendly metadata, and specifically RDF to identify and describe Sun products across systems and departments. I found myself right at the center of things back when the project really gained legs back in 2003 or so, driving the effort to develop a shared taxonomy of products on the SwoRDFish system. That was about the time I became involved in CME and the maturing of Starlight, the main CMS for global sun.com Web sites.

Starlight needed a consistent way to tag pages related to products for a variety of reasons. A customer who follows a link to a Sun success story might learn that the Sun Fire X2200 M2 Server is an effective hardware platform for high-scale Web application architectures, and we want to make sure that they can find on that page well-updated information on that product. On the other hand, if a customer comes straight to a product feature page we want to make sure we make available just the tags for other standard pages associated with that product. We don’t want to have a downloads tab for a hardware product, so we’re concerned about identity and also general categorization of the product. Sun’s product Web pages use SwoRDFish identifiers and taxonomy to solve many such problems, and it’s become a crucial part of our machinery for dynamic page rendering.

For a quick look at the first example in pictures, the following is from our Customer Success Stories landing page:

The “By Product” tab uses selection by SwoRDFish IDs. If you hover your mouse over “Coolthreads” you’ll find the link is something like http://www.sun.com/customers/index.xml?p=8871f410-44cb-11da-ac39-080020a9ed93 . Yes, urn:uuid:8871f410-44cb-11da-ac39-080020a9ed93 is the SwoRDFish URI for the Coolthreads server product category. Clicking on it produces the generated landing page for all customer success stories that touch on that product category.

There is also useful separation of responsibility here. The product metadata and taxonomy is now maintained by experts on the SwoRDFish team, taking their input from product managers. This is general institutional knowledge that is used across Sun, and not just on the product marketing Web pages. Starlight queries SwoRDFish for the metadata, and makes this available for publishers to add capabilities such as “success story by product”. We have built some cool internal tools for content authors to navigate the SwoRDFish ontology to make tagging and publishing quick and simple.

SwoRDFish was definitely on the bleeding edge of things. We learned lots of lessons, and we’d all do things a bit differently if we were starting out now. For one thing, rather than using UUIDs URIs we’d use good old “http://”. We’re actually working on some things in that direction that I hope to be able to talk about more, soon. Perhaps we'll even get to the point of opening up the richness of our data and metadata to publishers, partners, and others outside Sun as well as inside. In this age we realize that our community is our most effective marketing arm. It’s certainly getting easier to carry the small-S semantic web message inside and outside Sun. Now that people see mashups everywhere they understand the importance of connecting the data behind Web sites, as well as linking pages. I definitely see a bright future for the marriage of sound document design and rich metadata design that’s fueled the success of Starlight to date. Sometimes the bleeding edge may cut you, but when it does take root, and proves its value so thoroughly, the satisfaction more than makes up for the scars.

Friday Apr 18, 2008

Lovely data, lovely model

The most important step in refining data is writing down what makes it tick. In my last blog entry I outlined the process by which my team refined the data flow for Sun Web sites and mentioned Unified Product Data Model (UPDM). UPDM is a formal means for organizing the information for Sun web sites, with an emphasis on product data. The goal in developing UPDM was to reduce the redundancy and inefficiency in the data sets for the Starlight publishing platform, and then to extend the benefits to e-commerce sites, each of which happen to be hosted on different platforms.

UPDM provides definitions, basic business rules and relationships between facets of product data, including hierarchy, and attributes of each category, product and part. It’s maintained in a simple XML which can be transformed to HTML, spreadsheet, or even UML diagram. The following listing is a snippet from the actual core UPDM 1.0 model used in Starlight.

<?xml version="1.0" encoding="UTF-8"?>
  <label>Data Model Browser - UPDM 1.0</label>
   <p>The UPDM Data Model Browser describes concepts and attributes that are 
      core to the <b>Sun Unified Product Data Model</b>. This 
      version [UPDM 1.0] covers product data elements as they are represented 
      in <b>Sun.com, shop.sun.com, and Sun Catalogue</b>. Product elements 
      that describe transactions, implementation, or presentation are not 
      included in UPDM...</p>

  <concept id="product">
    <explanation>Actual product entity.  Representation of the unit offered 
                 to the market by Sun (i.e. Sun Java System Application Server 
                 Platform Edition 9.0; Sun SPARC Enterprise M5000 Server.)
      Use the id as a stand-in for the product itself
    <association ref="swordfish-id"/>
    <association ref="name">
      <constraint>Strictly syndicated through SwoRDFish</constraint>
    <association ref="description"/>
    <association ref="image"/>
      <concept id="plc-date">
        <label>Product life cycle date</label>
        <explanation>A date on which a change of PLC status occurs</explanation>
          <data type="date"/>
        <comment>Related to the price effectivity date</comment>

  <concept id="industry">
    <explanation>Industry for which suited or targeted</explanation>
    <association ref="swordfish-id"/>
      Use the SwoRDFish ID as a stand-in for the industry itself

We developed UPDM 1.0 at a time when there were not as many good options for expressing such data models. RDF and XMI carried too much baggage, and we wanted something simple and clear, although RDF does play an important role in how things are bound together in the implementation, as I’ll discuss another day. Again we can generate all these other representations as needed. As an example the following picture is a UML class diagram generated from the XML above.

OK, a bit of an eye-chart. But, when you develop a broad data model such as UPDM its a real eye-opener, and you gain more than just the end product. You learn a lot about what business problems and business rules are not really well expressed anywhere, and are only to be found in someone’s head. Sometimes you learn about the key tensions between how different roles and departments interpret and process information.

To take one example, at Sun what we sell for hardware, the actual SKUs, are called “parts” in the marketing department, including e-commerce. In many other departments, and in a lot of the vendor software we use these are called “products”. We don’t really market at this level, though. We market the families of these such as “SunFire T2000”, and these are what we call “products” and what others call “product families”.

UPDM itself doesn’t provide any magic to reconcile such differences. It does provide the best you can hope for – a framework for writing down the knowledge so it’s open, shared, and even accessible through code. Then you have half a chance to build some magic on top of the model.

Tuesday Apr 01, 2008

Johnny Data Can't Read

Captain Data Model Chronicles

When I started as data architect for Sun’s Web Publishing Engineering (WPE) department we were just coming out of pilot for Starlight, a unified content platform for Sun’s Web sites. (Back then, my team, Content Management Engineering - CME - didn't even exist yet!) As we began building Starlight, we had a few key goals: increased reuse, improved standardization and globalization. The system is a document-driven platform, which is a good thing, but back in its early days it suffered a bit from lack of a data modeler’s touch. I saw pretty quickly that it wouldn’t scale very well across the wide variety of applications on Sun’s Web space. We had to take a number of steps to improve the system to meet its ambitions.

At the beginning stage the document database was in what you can think of as preschool form. Documents, authoring and presentation templates were designed pretty much ad-hoc, as business needs drove them. There was no real organization to any of this, so that very often when similar needs came along later on, people ended up reinventing the wheel. This led to inefficiency throughout the content life-cycle. On one end, authors would have to get used to multiple templates to create similar documents. On the other end those developing Web applications on Starlight had to create multiple overlapping presentation templates. As a result the workflow was quite a tangle, as I illustrate below.

The first step was to formalize document design. Starlight initially focused on marketing Sun products, and we were exploring how we could better share content and data between the product marketing sites and the various e-commerce sites worldwide. This became an effort to create a Unified Product Data Model (UPDM). We then applied UPDM to formalize the design of many of the Starlight documents, and we used the extensibility of UPDM to establish a model for pages not as closely identified with products. I’d love to discuss UPDM more, because it was a core achievement that provided a foundation for so much of what followed, but for now I’ll continue with the main story.

Once we’d formalized document structures we could identify redundant templates and combine them, and we could also make the workflow much clearer and more efficient. We were able to accomplish this back in 2005 smoothly enough that you probably didn’t notice. For the most part, we made no change to the CMS toolkit or personnel, nor to the resulting pages. All we did was apply data architecture to make these more efficient. Call it data grammar school, illustrated in the following diagram.

Much less confusion. The document design is clearly defined, which makes it easier for Web applications to identify and pull the content they require, and makes useful middleware out of the ad-hoc authoring and presentation support tools. But regardless of how carefully you try to control the proliferation of document templates, you can’t fit every new need into existing ones. You may never regress all the way back to the chaos of preschool, but over time you can definitely lose some of the benefits of all the careful organization. As Starlight grew in application we quickly saw that we needed to organize things even further.

A Web space as broad as Sun’s may have thousands of different permutations of source documents and presentation pages, but for the most part there are basics that you use over and over again. You have titles, links, keywords, images, personal and organizational contact information, prose snippets, and so on. We turned that into a library of document components, defined in RELAX NG, and incorporating UPDM and other standards from inside and outside Sun (as an example of the latter we heavily reused components from XHTML 2.0). As such most of the document templates became nothing more than a bunch of components snapped together, so that the complexity of the middleware and application query no longer has to scale as dramatically with the number of document templates.

At the same time, we had to re-engineer our rendering technology to take advantage of componentization of content. Parallel to our library of document components there is a separate Sun project, Web Design Standards, that defines in fine detail how Sun material should be presented on the Web. We organized our rendering templates so that they gather the needed content components as input, and generate the needed Web Design components as output. The result is a system where building blocks can be readily identified and reused throughout the publication process. Call it data college.

At this point we have the foundations for efficiently creating and routing content, allowing publishers to focus on what they want to communicate and enable in their Web applications. Of course this is only the jumping-off point for tackling even harder problems, such as how to better find and sell our products and services, and how to engage the community more readily on the Web. These are the challenges to which we’ve turned our attention in the past year or so, and the most important factor allowing us to deal with these grown up problems is that finally, at least in CME, Johnny Data can read -- and write.

Sunday Mar 30, 2008

Captain Data Modeler Chronicles: Prologue

It’s been a while since I’ve written here. The key reason is that I’ve made a transition from Captain Data Modeler to directorship of Sun’s Content Management Engineering department. I suppose the nice message is that a lot of the hard work to balance good data architecture and practical business need is what put me on the radar for promotion. Of course the downside is that now I have to peer longingly over a desk piled high with budgets, vendor contracts, and HR priorities in order to catch a glimpse of the bits and bytes in the distance. I do miss those bits and bytes, and how they would always ground me in the comfort of tangible, creative deliverables. There are days when I’m a bit jealous of my engineering team who gets to dive in and immerse themselves in the bits and bytes every day. Ah well, business needs first.

But on the bright side I get to pull the lens back and take a broader look at how we use data on the Web to put our strategies to work. In the past whirlwind year and a half I’ve overseen data flows from legacy data stores, ERP, isolated data silos and files from all sorts of footlockers and broom closets, and I’ve had to conduct that data into new Web site venues and features, low and high-volume e-commerce, unification of product documentation, community sites like BigAdmin, developer resource sites, and much, much more. The first thing that occurs to me, sitting at this lookout point, is that Sun has so much information that we somehow manange to squeeze outside our firewall through various tiny slits. We’re certainly ahead of the marketplace in opening up data to serve customers and partners, but we can do more, and I’m working to see that we do.

We all know that it’s now a much more collaborative marketplace, thanks to the Web. At Sun our marketplace contains some of the best brains in technology, and if we could open up more information in forms that they could easily digest, the possibilities are endless. The most obvious thing we need to provide is more Web feeds, in Atom and RSS, and it would be nice if we offered more data in JSON form, which is now one of the preferred inputs for mash-ups. In general we’d like to provide more content and data in source data formats such as well-defined XML and JSON. Right now too much of what we provide is in presentation formats such as HTML and PDF. And, in some cases, it is still all rolled up with the business rules that govern its current use.

But to get where you’re going it helps to remember where you’ve come from. I think data architecture for Sun’s Web content, while not perfect, is in pretty good shape to expand its usage as ambitiously as I want. There are some interesting lessons in how we tamed the data to that extent. I gave a presentation at XML 2007 (my co-presenter Uche Ogbuji was not able to make it for unfortunate personal reasons), covering some of the work we’d done to in data architecture, and focusing on some of the lessons learned for managing collections of XML. The presentation was very well received, and that gives me the impression that we’re ahead of the curve in what we’ve accomplished behind the scenes, and that this doesn’t manifest enough in what you see on Sun’s Web sites. My experience at XML 2007 encouraged me to discuss such things more often here, not just some of the neat things we’re doing inside the firewall, but more on how we plan to put to the service of Sun’s customer’s, Sun’s community and partners, and ultimately Sun’s strategy.

Friday Aug 06, 2004

Enter, stage right: Captain Data Modeler

Have you read Martin's latest blog on Tips for Averting Bad Web Experiences? If so, cool... read on right here, y'all.
If not, go there first, then be sure to come back here to read my additional thoughts on some of Martin's 10+ tips.

One of my favorite tips in Martin's list is: #4 Beware of arrogant label-makers. In thinking about that tip, in my mind, it basically boils down to...
Don't force your audience to talk [understand] the internal language you walk [use and sometimes change quarterly] in order to find what they need on your web site. There's no doubt in my mind that 9 times out of 10, most customers and partners are not hip to [in the know of] any given company's internal, specialized terminology, and acronyms.

The development dialog on this can be a tough one:
Marketing Professional: "But, now we all call it: Newfangled Feature with YaYa Functionality [NfFwYYF]"
Metadata Modeler: "Do you think our customers will follow navigation or search for that term specifically?"
Marketing Professional: "Hmmm, well... Yeah, especially if I send them some opt-in direct email marketing on the NfFwYYF too!"

Perhaps true to the select audience in the know, but the longevity of the design is certain to be enhanced by figuring out why the customer needs your "NfFwYYF" and give it a clear cut home on the web w/ navigation to it that the audience is more likely to understand.

We could take the www.sun.com/solutions site (a portion of sun.com heavily driven by metadata models) as a case in point.
When the experts started designing this section of sun.com they realized: our authors/employees and our customers/audience speak different languages... Therefore, it was critical that our metadata model address the needs of the internal content authors, while keeping the goals and language of the customer clearly in the forefront of our design. Our metadata model had to map the two vocabularies together and bridge the gap. Captain Data Modeler saves the day!

(Sidebar: How come there aren't any really geeky super heros? OK, sure, Spider-Man is kinda geeky without his suit on, but what kind of super hero outfit would Captain Data Modeler wear? A black and white body suit with a bunch of open and close brackets? Sorta like the Riddler: "I am a man of few words, but many riddles". Captain Data Modeler: "I am a man (or woman!) of little content, but many models to put it in..." But I digress...)

Step #1-- Let the authors focus on authoring
Like many companies, Sun has some highly specialized internal terminology, code phrases, product names, (ala NfFwYYF) that our employees understand and use when creating content stories, case studies, partner profile info, etc. To author our content efficiently, we decided that employees should associate metadata to content (tag content) using Solution terms from a highly specialized, internal-centric vocabulary. This vocabulary became our "Solutions Authoring Taxonomy". It was a flat taxonomy (simply a list of terms in the vocabulary) which included just enough information to not overwhelm content authors.

Step #2 -- Create a Navigation Scheme that Maps Customer Goals to the Tagged Content
Customers need to solve a business problem or infrastructure challenge and if they don't find language they understand on a web site, they're apt to surf right on by. We decided that customers should be able to navigate and find Solutions using the words and phrases they understand. This multi-level hierarchy became the "Solutions Web Navigation Taxonomy". Now, gluing the two taxonomies together was the fun part. (Figuring out the terms in the vocabularies was only the half of it! Now we had term relationships to map!) Many hours were spent by subject matter experts and customer driven design advocates tying the unique identifiers from the Authoring Taxonomy to the customer goals of the Web Navigation Taxonomy.

Use Case: So when a customer's goal is to:
Gain Competitive Advantage
    by : Delivering Superior Product Quality

The customer is lead to all kinds of information authored and tagged with specific terms used internally, like:

  • Adaptive Engineering
  • ERP
  • Product Development
  • SCM
  • Service Delivery
  • Visualization

However, the only thing the customer needs to know is their goal, not our internal terminology for it.

What's also cool: Because these internal terms can be mapped to many different customer goals, the content is authored once and can be repurposed as business climates, goals and challenges evolve. In the end, the web site was redesigned, modeled, and targeted at decision-makers, line-of-business leaders, and IT managers and communicates how Sun, with its partners, address the most common business challenges. As these business challenges evolve, the metadata behind the site is ready for evolution.

Ok, I've rambled past my limit. More Adventures of Captain Data Modeler later.


Passionate about data engineering strategy and solutions for Sun’s external web sites. Happiest when building taxonomies, data models, and high performing teams.

Kristen Harris
Web Data Engineering


« April 2014