Friday Jun 20, 2014

Empirical Discovery - Concept and Workflow Model

Concept models are a powerful tool for articulating the essential elements and relationships that define new or complex things we need to understand.  We've previously defined empirical discovery as a new method, looking at antecedents, and also comparing and contrasting the distinctive characteristics of Empirical Discovery with other knowledge creation and insight seeking methods.  I'm now sharing our concept model of Empirical Discovery, which identifies the most important actors, activities, and outcomes of empirical discovery efforts, to complement the written definition by illustrating how the method works in practice.

In this model, we illustrate the activities of the three kinds of people most central to discovery efforts: Insight Consumers, Data Scientists, and Data Engineers. We have robust definitions of all the major actors involved in discovery (used to drive product development), and may share some of these various personas, profiles, and snapshots subsequently.  For reading this model, understand Insight Consumers as the people who rely on insights from discovery efforts to effect and manage the operations of the business.  Data Scientists are the sensemakers who achieve insights, and create data products, and analytical models through discovery efforts.  Data Engineers enable discovery efforts by building the enterprise data analysis infrastructure necessary for discovery, and often implement the outcomes of empirical discovery by building new tools based on the insights and models Data Scientists create.

A key assumption of this model is that discovery is by definition an iterative and serendipitous method, relying on frequent back-steps and unpredictable repetition of activities as a necessary aspect of how discovery efforts unfold.  This model also assumes the data, methods, and tools shift during discovery efforts, in keeping with the evolution of motivating questions, and the achievement of interim outcomes.  Similarly, discovery efforts do not always involve all of these elements.

To keep the essential structure and relationships between elements clear and in the foreground, we have not shown all of the possible iterative loops or repeated steps. Some closely related concepts are grouped together, to allow reading the model on two levels of detail.

For a simplified view, follow the links between named actors and groups of concepts shown with colored backgrounds and labels.  In this reading, an Insight Consumer articulates questions to a Data Scientist, who combines domain knowledge with the Empirical Discovery Method (yellow) to direct the application of Analytical Tools (blue) and Models (salmon) to Data Sets (green) drawn from Data Sources (magenta).  The Data Scientist shares Insights resulting from discovery efforts with the Insight Consumer, while Data Engineers may implement the models or data products created by the Data Scientist by turning them into tools and infrastructure for the rest of the business.  For a more detailed view of the specific concepts and activities common to Empirical discovery efforts, follow the links between the individual concepts within these named groups.  (Note: there are two kinds of connections; solid arrows indicating definite relationships, and for the Data Sets and Models groups, dashed arrows indicating possible paths of evolution.  More on this to follow)

Another way to interpret the two levels of detail in this model is as descriptions of formal vs. informal implementations of the empirical discovery method.  People and organizations who take a more formal approach to empirical discovery may require explicitly defined artifacts and activities that address each major concept, such as predictions and experimental results.  In less formal approaches, Data Scientists may implicitly address each of the major concepts and activities, such as framing hypotheses, or tracking the states of data sets they are working with, without any formal artifact or decision gateway.  This situational flexibility is follow-on of the applied nature of the empirical discovery method, which does not require scientific standards of proof and reproducibility to generate valued outcomes. 

The story begins in the upper right corner, when an Insight Consumer articulates a belief or question to a Data Scientist, who then translates this motivating statement into a planned discovery effort that addresses the business goal. The Data Scientist applies the Empirical Discovery Method (concepts in yellow); possibly generating a hypothesis and accompanying predictions which will be tested by experiments, choosing data from the range of available data sources (grouped in magenta), and selecting initial analytical methods consistent with the domain, the data sets (green), and the analytical or reference models (salmon) they will work with.  Given the particulars of the data and the analytical methods, the Data Scientist employs specific analytical tools (blue) such as algorithms and statistical or other measures, based on factors such as expected accuracy, and speed or ease of use.  As the effort progresses through iterations, or insights emerge, experiments may be added or revised, based on the conclusions the Data Scientist draws from the results and their impact on starting predictions or hypotheses.  

For example, an Insight Consumer who works in a product management capacity for an on-line social network with a business goal of increasing users’ level of engagement with the service wishes to identify opportunities to recommend users establish new connections with other similar and possibly known users based on unrecognized affinities in their posted profiles.  The data scientist translates this business goal into a series of experiments investigating predictions about which aspects of user profiles more effectively predict the likelihood of creating new connections in response to system-generated recommendations for similarity.  The Data Scientist frames experiments that rely on data from the accumulated logs of user activities within the network that have been anonymized to comply with privacy policies, selecting specific working sets of data to analyze based on awareness of the shoe and nature of the attributes that appear directly in users’ profiles both across the entire network, and among pools of similar but unconnected users. The Data Scientist plans to begin with analytical methods useful for predictive modeling of the effectiveness of recommender systems in network contexts, such as measurements of the affinity of users’ interests based on semantic analysis of social objects shared by users within this network and also publicly in other online media, and also structural or topological measures of relative position and distance from the field of network science.  The Data Scientist chooses a set of standard social network analysisalgorithms and measures, combined with custom models for interpreting user activity and interest unique to this network.  The Data Scientist has predefined scripts and open source libraries available for ready application to data (MLlib, Gephi, Weka, Pandas, etc.) in the form of Analytical tools, which she will combine in sequences according to the desired analytical flow for each experiment.

The nature of analytical engagement with data sets varies during the course of discovery efforts, with different types of data sets playing different roles at specific stages of the discovery workflow.  Our concept map simplifies the lifecycle of data for purposes of description, identifying five distinct and recognizable ways data are used by the Data Scientist, with five corresponding types of data sets.  In some cases, formal criteria on data quality, completeness, accuracy, and content govern which stage of the data lifecycle any  given data set is at.  In most discovery efforts, however, Data Scientists themselves make a series of judgements about when and how the data in hand is suitable for use.  The dashed arrows linking the five types of data sets capture the approximate and conditional nature of these different stages of evolution.  In practice, discovery efforts begin with exploration of data that may or may not be relevant for focused analysis, but which requires some direct engagement to and attention to rule in or out of consideration. Focused analytical investigation of the relevant data follows, made possible by the iterative addition, refinement and transformation (wrangling - more on this in later posts) of the exploratory data in hand.  At this stage, the Data Scientist applies analytical tools identified by their chosen analytical method.  The model building stage seeks to create explicit, formal, and reusable models that articulate the patterns and structures found during investigation.  When validation of newly created analytical models is necessary, the Data Scientist uses appropriate data - typically data that was not part of explicit model creation.  Finally, training data is sometimes necessary to put models into production - either using them for further steps in analytical workflows (which can be very complex), or in business operations outside the analytical context.  

Because so much discovery activity requires transformation of the data before or during analysis, there is great interest in the Data Science and business analytics industries in how Data Scientists and sensemakers work with data at these various stages.  Much of this attention focuses on the need for better tools for transforming data in order to make analysis possible.  This model does not explicitly represent wrangling as an activity, because it is not directly a part of the empirical discovery method; transformation is done only as and when needed to make analysis possible.  However, understanding the nature of wrangling and transformation activities is a very important topic for grasping discovery, so I’ll address in later postings. (We have a good model for this too…)

Empirical discovery efforts aim to create one or more of the three types of outcomes shown in orange: insights, models, and data products.  Insights, as we’ve defined them previously, are discoveries that change people’s perspective or understanding, not simply the results of analytical activity, such as the end values of analytical calculations, the generation of reports, or the retrieval and aggregation of stored information.   

One of the most valuable outcomes of discovery efforts is the creation of externalized models that describe behavior, structure or relationships in clear and quantified terms.  The models that result from empirical discovery efforts can take many forms — google 'predictive model’ for a sense of the tremendous variation in what people active in business analytics consider to be a useful model — but their defining characteristic is that a model always describes aspects of a subject of discovery and analysis that are not directly present in the data itself.  For example, if given the node and edge data identifying all of the connections between people in the social network above, one possible model resulting from analysis of the network structure is a descriptive readout of the topology of the network as scale-free, with some set of subgraphs, a range of node centrality values', a matrix of possible shortest paths between nodes or subgraphs, etc.  It is possible to make sense of, interpret, or circulate a model independently of the data it describes and is derived from.  

Data Scientists also engage with models in distinct and recognizable ways during discovery efforts.  Reference models, determined by the domain of investigation, often guide exploratory analysis of discovery subjects by providing Data Scientists with general  explanations and quantifications for processes and relationships common to the domain.  And the models generated as insight and understanding accumulate during discovery evolve in stages from initial articulation through validation to readiness for production implementation; which means being put into effect directly on the operations of the business.   

Data products are best understood as 'packages' of data which have utility for other analytical or business purposes, such as a list of users in the social network who will form new connections in response to system-generated suggestions of other similar users.  Data products are not literally finished products that the business offers for external sale or consumption.  And as background, we assume operationalization or ‘implementation’ of the outcomes of empirical discovery efforts to change the functioning of the business is the goal of different business processes, such as product development.  While empirical discovery focuses on achieving understanding, rather than making things, this is not the only thing Data Scientists do for the business.  The classic definition of Data Science as aimed at creating new products based on data which impact the business, is a broad mandate, and many of the position descriptions for data science jobs require participation in product development efforts.

Two or more kinds of outcomes are often bundled together as the results of a genuinely successful discovery effort; for example, an insight that two apparently unconnected business processes are in fact related through mutual feedback loops, and a model explicitly describing and quantifying the nature of the relationships as discovered through analysis.

There’s more to the story, but as one trip through the essential elements of empirical discovery, this is a logical point to pause and ask what might be missing from this model? And how can it be improved?

Monday Jun 09, 2014

The Sensemaking Spectrum for Business Analytics: Translating from Data to Business Through Analysis

One of the most compelling outcomes of our strategic research efforts over the past several years is a growing vocabulary that articulates our cumulative understanding of the deep structure of the domains of discovery and business analytics.

Modes are one example of the deep structure we’ve found.  After looking at discovery activities across a very wide range of industries, question types, business needs, and problem solving approaches, we've identified distinct and recurring kinds of sensemaking activity, independent of context.  We label these activities Modes: Explore, compare, and comprehend are three of the nine recognizable modes.  Modes describe *how* people go about realizing insights.  (Read more about the programmatic research and formal academic grounding and discussion of the modes here: By analogy to languages, modes are the 'verbs' of discovery activity.  When applied to the practical questions of product strategy and development, the modes of discovery allow one to identify what kinds of analytical activity a product, platform, or solution needs to support across a spread of usage scenarios, and then make concrete and well-informed decisions about every aspect of the solution, from high-level capabilities, to which specific types of information visualizations better enable these scenarios for the types of data users will analyze.

The modes are a powerful generative tool for product making, but if you've spent time with young children, or had a really bad hangover (or both at the same time...), you understand the difficult of communicating using only verbs. 

So I'm happy to share that we've found traction on another facet of the deep structure of discovery and business analytics.  Continuing the language analogy, we've identified some of the ‘nouns’ in the language of discovery: specifically, the consistently recurring aspects of a business that people are looking for insight into.  We call these discovery Subjects, since they identify *what* people focus on during discovery efforts, rather than *how* they go about discovery as with the Modes.

Defining the collection of Subjects people repeatedly focus on allows us to understand and articulate sense making needs and activity in more specific, consistent, and complete fashion.  In combination with the Modes, we can use Subjects to concretely identify and define scenarios that describe people’s analytical needs and goals.  For example, a scenario such as ‘Explore [a Mode] the attrition rates [a Measure, one type of Subject] of our largest customers [Entities, another type of Subject] clearly captures the nature of the activity — exploration of trends vs. deep analysis of underlying factors — and the central focus — attrition rates for customers above a certain set of size criteria — from which follow many of the specifics needed to address this scenario in terms of data, analytical tools, and methods.

We can also use Subjects to translate effectively between the different perspectives that shape discovery efforts, reducing ambiguity and increasing impact on both sides the perspective divide.  For example, from the language of business, which often motivates analytical work by asking questions in business terms, to the perspective of analysis.  The question posed to a Data Scientist or analyst may be something like “Why are sales of our new kinds of potato chips to our largest customers fluctuating unexpectedly this year?” or “Where can innovate, by expanding our product portfolio to meet unmet needs?”.  Analysts translate questions and beliefs like these into one or more empirical discovery efforts that more formally and granularly indicate the plan, methods, tools, and desired outcomes of analysis.  From the perspective of analysis this second question might become, “Which customer needs of type ‘A', identified and measured in terms of ‘B’, that are not directly or indirectly addressed by any of our current products, offer 'X' potential for ‘Y' positive return on the investment ‘Z' required to launch a new offering, in time frame ‘W’?  And how do these compare to each other?”.  Translation also happens from the perspective of analysis to the perspective of data; in terms of availability, quality, completeness, format, volume, etc.

By implication, we are proposing that most working organizations — small and large, for profit and non-profit, domestic and international, and in the majority of industries — can be described for analytical purposes using this collection of Subjects.  This is a bold claim, but simplified articulation of complexity is one of the primary goals of sensemaking frameworks such as this one.  (And, yes, this is in fact a framework for making sense of sensemaking as a category of activity - but we’re not considering the recursive aspects of this exercise at the moment.)

Compellingly, we can place the collection of subjects on a single continuum — we call it the Sensemaking Spectrum — that simply and coherently illustrates some of the most important relationships between the different types of Subjects, and also illuminates several of the fundamental dynamics shaping business analytics as a domain.  As a corollary, the Sensemaking Spectrum also suggests innovation opportunities for products and services related to business analytics.

The first illustration below shows Subjects arrayed along the Sensemaking Spectrum; the second illustration presents examples of each kind of Subject.  Subjects appear in colors ranging from blue to reddish-orange, reflecting their place along the Spectrum, which indicates whether a Subject addresses more the viewpoint of systems and data (Data centric and blue), or people (User centric and orange).  This axis is shown explicitly above the Spectrum.  Annotations suggest how Subjects align with the three significant perspectives of Data, Analysis, and Business that shape business analytics activity.  This rendering makes explicit the translation and bridging function of Analysts as a role, and analysis as an activity.


Subjects are best understood as fuzzy categories [], rather than tightly defined buckets.  For each Subject, we suggest some of the most common examples: Entities may be physical things such as named products, or locations (a building, or a city); they could be Concepts, such as satisfaction; or they could be Relationships between entities, such as the variety of possible connections that define linkage in social networks.  Likewise, Events may indicate a time and place in the dictionary sense; or they may be Transactions involving named entities; or take the form of Signals, such as ‘some Measure had some value at some time’ - what many enterprises understand as alerts.  

The central story of the Spectrum is that though consumers of analytical insights (represented here by the Business perspective) need to work in terms of Subjects that are directly meaningful to their perspective — such as Themes, Plans, and Goals — the working realities of data (condition, structure, availability, completeness, cost) and the changing nature of most discovery efforts make direct engagement with source data in this fashion impossible.  Accordingly, business analytics as a domain is structured around the fundamental assumption that sense making depends on analytical transformation of data.  Analytical activity incrementally synthesizes more complex and larger scope Subjects from data in its starting condition, accumulating insight (and value) by moving through a progression of stages in which increasingly meaningful Subjects are iteratively synthesized from the data, and recombined with other Subjects.  The end goal of  ‘laddering’ successive transformations is to enable sense making from the business perspective, rather than the analytical perspective.

Synthesis through laddering is typically accomplished by specialized Analysts using dedicated tools and methods. Beginning with some motivating question such as seeking opportunities to increase the efficiency (a Theme) of fulfillment processes to reach some level of profitability by the end of the year (Plan), Analysts will iteratively wrangle and transform source data Records, Values and Attributes into recognizable Entities, such as Products, that can be combined with Measures or other data into the Events (shipment of orders) that indicate the workings of the business.  

More complex Subjects (to the right of the Spectrum) are composed of or make reference to less complex Subjects: a business Process such as Fulfillment will include Activities such as confirming, packing, and then shipping orders.  These Activities occur within or are conducted by organizational units such as teams of staff or partner firms (Networks), composed of Entities which are structured via Relationships, such as supplier and buyer.  The fulfillment process will involve other types of Entities, such as the products or services the business provides.  The success of the fulfillment process overall may be judged according to a sophisticated operating efficiency Model, which includes tiered Measures of business activity and health for the transactions and activities included.  All of this may be interpreted through an understanding of the operational domain of the businesses supply chain (a Domain).

We'll discuss the Spectrum in more depth in succeeding posts.

Friday Mar 28, 2014

Data Science Highlights: An Investigation of the Discipline

I've posted a substantial readout summarizing some of the more salient findings from a long-running programmatic research program into data science. This deck shares synthesized findings around many of the facets of data science as a discipline, including practices, workflow, tools, org models, skills, etc. This readout distills a very wide range of inputs, including; direct interviews, field-based ethnography, community participation (real-world and on-line), secondary research from industry and academic sources, analysis of hiring and investment activity in data science over several years, descriptive and definitional artifacts authored by practitioners / analysts / educators, and other external actors, media coverage of data science, historical antecedents, the structure and evolution of professional disciplines, and even more.

I consider it a sort of business-anthropology-style investigation of data science, conducted from the viewpoint of product making's primary aspects; strategy, management, design, and delivery.

I learned a great deal during the course of this effort, and expect to continue to learn, as data science will continue to evolve rapidly for the next several years.

Data science practitioners looking at this material are invited to provide feedback about where these materials are accurate or inaccurate, and most especially about what is missing, and what is coming next for this very exciting field.

Wednesday Mar 26, 2014

Data Science and Empirical Discovery: A New Discipline Pioneering a New Analytical Method

One of the essential patterns of science and industry in the modern era is that new methods for understanding — what I’ll call sensemaking from now on — often emerge hand in hand with new professional and scientific disciplines.  This linkage between new disciplines and new methods follows from the  deceptively simple imperative to realize new types of insight, which often means analysis of new kinds of data, using new techniques, applied from newly defined perspectives. New viewpoints and new ways of understanding are literally bound together in a sort of symbiosis.

One familiar example of this dynamic is the rapid development of statistics during the 18th and 19th centuries, in close parallel with the rise of new social science disciplines including economics (originally political economy) and sociology, and natural sciences such as astronomy and physics.  On a very broad scale, we can see the pattern in the tandem evolution of the scientific method for sensemaking, and the codification of modern scientific disciplines based on precursor fields such as natural history and natural philosophy during the scientific revolution

Today, we can see this pattern clearly in the simultaneous emergence of Data Science as a new and distinct discipline accompanied by Empirical Discovery, the new sensemaking and analysis method Data Science is pioneering.  Given its dramatic rise to prominence recently, declaring Data Science a new professional discipline should inspire little controversy. Declaring Empirical Discovery a new method may seem bolder, but when we with the essential pattern of new disciplines appearing in tandem with new sensemaking methods in mind, it is more controversial to suggest Data Science is a new discipline that lacks a corresponding new method for sensemaking.  (I would argue it is the method that makes the discipline, not the other way around, but that is a topic for fuller treatment elsewhere)

What is empirical discovery?  While empirical discovery is a new sensemaking method, we can build on two existing foundations to understand its distinguishing characteristics, and help craft an initial definition.  The first of these is an understanding of the empirical method. Consider the following description:

"The empirical method is not sharply defined and is often contrasted with the precision of the experimental method, where data are derived from the systematic manipulation of variables in an experiment.  ...The empirical method is generally characterized by the collection of a large amount of data before much speculation as to their significance, or without much idea of what to expect, and is to be contrasted with more theoretical methods in which the collection of empirical data is guided largely by preliminary theoretical exploration of what to expect. The empirical method is necessary in entering hitherto completely unexplored fields, and becomes less purely empirical as the acquired mastery of the field increases. Successful use of an exclusively empirical method demands a higher degree of intuitive ability in the practitioner."

Data Science as practiced is largely consistent with this picture.  Empirical prerogatives and understandings shape the procedural planning of Data Science efforts, rather than theoretical constructs.  Semi-formal approaches predominate over explicitly codified methods, signaling the importance of intuition.  Data scientists often work with data that is on-hand already from business activity, or data that is newly generated through normal business operations, rather than seeking to acquire wholly new data that is consistent with the design parameters and goals of formal experimental efforts.  Much of the sensemaking activity around data is explicitly exploratory (what I call the 'panning for gold' stage of evolution - more on this in subsequent postings), rather than systematic in the manipulation of known variables.  These exploratory techniques are used to address relatively new fields such as the Internet of Things, wearables, and large-scale social graphs and collective activity domains such as instrumented environments and the quantified self.  These new domains of application are not mature in analytical terms; analysts are still working to identify the most effective techniques for yielding insights from data within their bounds.  

The second relevant perspective is our understanding of discovery as an activity that is distinct and recognizable in comparison to generalized analysis: from this, we can summarize as sensemaking intended to arrive at novel insights, through exploration and analysis of diverse and dynamic data in an iterative and evolving fashion. 

Looking deeper, one specific characteristic of discovery as an activity is the absence of formally articulated statements of belief and expected outcomes at the beginning of most discovery efforts.  Another is the iterative nature of discovery efforts, which can change course in non-linear ways and even 'backtrack' on the way to arriving at insights: both the data and the techniques used to analyze data change during discovery efforts.  Formally defined experiments are much more clearly determined from the beginning, and their definition is less open to change during their course. A program of related experiments conducted over time may show iterative adaptation of goals, data and methods, but the individual experiments themselves are not malleable and dynamic in the fashion of discovery.  Discovery's emphasis on novel insight as preferred outcome is another important characteristic; by contrast, formal experiments are repeatable and verifiable by definition, and the degree of repeatability is a criteria of well-designed experiments.  Discovery efforts often involve an intuitive shift in perspective that is recountable and retraceable in retrospect, but cannot be anticipated.

Building on these two foundations, we can define Empirical Discovery as a hybrid, purposeful, applied, augmented, iterative and serendipitous method for realizing novel insights for business, through analysis of large and diverse data sets.

Let’s look at these facets in more detail.  

Empirical discovery primarily addresses the practical goals and audiences of business (or industry), rather than scientific, academic, or theoretical objectives.  This is tremendously important, since  the practical context impacts every aspect of Empirical Discovery. 

'Large and diverse data sets' reflects the fact that Data Science practitioners engage with Big Data as we currently understand it; situations in which the confluence of data types and volumes exceeds the capabilities of business analytics to practically realize insights in terms of tools, infrastructure, practices, etc.

Empirical discovery uses a rapidly evolving hybridized toolkit, blending a wide range of general and advanced statistical techniques with sophisticated exploratory and analytical methods from a wide variety of sources that includes data mining, natural language processing, machine learning, neural networks, bayesian analysis, and emerging techniques such as topological data analysis and deep learning

What's most notable about this hybrid toolkit is that Empirical Discovery does not originate novel analysis techniques, it borrows tools from established disciplines such information retrieval, artificial intelligence, computer science, and the social sciences.  Many of the more specialized or apparently exotic techniques data science and empirical discovery rely on, such as support vector machines, deep learning, or measuring mutual information in data sets, have established histories of usage in academic or other industry settings, and have reached reasonable levels of maturity.  Empirical discovery's hybrid toolkit is  transposed from one domain of application to another, rather than invented.  

Empirical Discovery is an applied method in the same way Data Science is an applied discipline: it originates in and is adapted to business contexts, it focuses on arriving at useful insights to inform business activities, and it is not used to conduct basic research.  At this early stage of development, Empirical Discovery has no independent and articulated theoretical basis and does not (yet) advance a distinct body of knowledge based on theory or practice. All viable disciplines have a body of knowledge, whether formal or informal, and applied disciplines have only their cumulative body of knowledge to distinguish them, so I expect this to change.

Empirical discovery is not only applied, but explicitly purposeful in that it is always set in motion and directed by an agenda from a larger context, typically the specific business goals of the organization acting as a prime mover and funding data science positions and tools.  Data Science practitioners effect Empirical Discovery by making it happen on a daily basis - but wherever there is empirical discovery activity, there is sure to be intentionality from a business view.  For example, even in organizations with a formal hack time policy, our research suggests there is little or no completely undirected or self-directed empirical discovery activity, whether conducted by formally recognized Data Science practitioners, business analysts, or others. 

One very important implication of the situational purposefulness of Empirical Discovery is that there is no direct imperative for generating a body of cumulative knowledge through original research: the insights that result from Empirical Discovery efforts are judged by their practical utility in an immediate context.  There is also no explicit scientific burden of proof or verifiability associated with Empirical Discovery within it's primary context of application.  Many practitioners encourage some aspects of verifiability, for example, by annotating the various sources of data used for their efforts and the transformations involved in wrangling data on the road to insights or data products, but this is not a requirement of the method.  Another implication is that empirical discovery does not adhere to any explicit moral, ethical, or value-based missions that transcend working context.  While Data Scientists often interpret their role as transformative, this is in reference to business.  Data Science is not medicine, for example, with a Hippocratic oath.

Empirical Discovery is an augmented method in that it depends on computing and machine resources to increase human analytical capabilities: It is simply impractical for people to manually undertake many of the analytical techniques common to Data Science.  An important point to remember about augmented methods is that they are not automated; people remain necessary, and it is the combination of human and machine that is effective at yielding insights.  In the problem domain of discovery, the patterns of sensemaking activity leading to insight are intuitive, non-linear, and associative; activites with these characteristics are not fully automatable with current technology. And while many analytical techniques can be usefully automated within boundaries, these tasks typically make up just a portion of an complete discovery effort.  For example, using latent class analysis to explore a machine-sampled subset of a larger data corpus is task-specific automation complementing human perspective at particular points of the Empirical Discovery workflow.  This dependence on machine augmented analytical capability is recent within the history of analytical methods.  In most of the modern era -- roughly the later 17th, 18th, 19th and early 20th centuries -- the data employed in discovery efforts was manageable 'by hand', even when using the newest mathematical and analytical methods emerging at the time.  This remained true until the effective commercialization of machine computing ended the need for human computers as a recognized role in the middle of the 20th century.

The reality of most analytical efforts -- even those with good initial definition -- is that insights often emerge in response to and in tandem with changing and evolving questions which were not identified, or perhaps not even understood, at the outset.  During discovery efforts, analytical goals and techniques, as well as the data under consideration, often shift in unpredictable ways, making the path to insight dynamic and non-linear.  Further, the sources of and inspirations for insight are  difficult or impossible to identify both at the time and in retrospect. Empirical discovery addresses the complex and opaque nature of discovery with iteration and adaptation, which combine  to set the stage for serendipity.

With this initial definition of Empirical Discovery in hand, the natural question is what this means for Data Science and business analytics?  Three thigns stand out for me.  First, I think one of the central roles played by Data Science is in pioneering the application of existing analytical methods from specialized domains to serve general business goals and perspectives, seeking effective ways to work with the new types (graph, sensor, social, etc.) and tremendous volumes (yotta, yotta, yotta...) of business data at hand in the Big Data moment and realize insights

Second, following from this, Empirical Discovery is methodological a framework within and through which a great variety of analytical techniques at differing levels of maturity and from other disciplines are vetted for business analytical utility in iterative fashion by Data Science practitioners. 

And third, it seems this vetting function is deliberately part of the makeup of empirical discovery, which I consider a very clever way to create a feedback loop that enhances Data Science practice by using Empirical Discovery as a discovery tool for refining its own methods.

Monday Mar 10, 2014

Big Data is a Condition (Or, "It's (Mostly) In Your Head")

Unsurprisingly, definitions of Big Data run the gamut from the turgid to the flip, making room to include the trite, the breathless, and the simply un-inspiring in the big circle around the campfire. Some of these definitions are useful in part, but none of them captures the essence of the matter.   Most are mistakes in kind, trying to ground and capture Big Data as a 'thing' of some sort that is measurable in objective terms. Anytime you encounter a number, this is the school of thought.

Some approach Big Data as a state of being, most often a simple operational state of insufficiency of some kind; typically resources like analysts, compute power or storage for handling data effectively; occasionally something less quantifiable like clarity of purpose and criteria for management.  Anytime you encounter phrasing that relies on the reader to interpret and define the particulars of the insufficiency, this is the school of thought.

I see Big Data as a self-defined (perhaps diagnosed is more accurate) condition, but one that is based on idiosyncratic interpretation of current and possible future situations in which understanding of, planning for, and activity around data are central. 

Here's my working definition: Big Data is the condition in which very high actual or expected difficulty in working successfully with data combines with very high anticipated but unknown value and benefit, leading to the a-priori assumption that currently available information management and analytical capabilties are broadly insufficient, making new and previously unknown capabilities seemingly necessary.

Friday Dec 06, 2013

Strata New York Video: Designing Big Data Interactions With the Language of Discovery

I'm late to making it available here, but O'Reilly media published the video recording of my presentation on The Language of Discovery: A Toolkit For Designing Big Data Interactions from last year's (2012) Strata conference in NY. Looking back at this, I'm happy to say that while my thinking on several of the key ideas has advanced quite a bit in the past 12 months (see our more recent materials), the core ideas and concepts remain vital. Those are, briefly:
  • Big Data is useless unless people can engage with it effectively
  • Discovery is a critical and inadequately acknowledged aspect of sense making that is core to realizing value from Big Data
  • Discovery is literally the most important human/machine interaction in the emerging Age of Insight
  • Providing discovery capability requires understanding people's needs and goals
  • The Language of Discovery is an effective tool for understanding discovery needs and activities, and designing solutions
  • There are known patterns and structure in discovery activities that you can use to create discovery solutions
I've posted it to vimeo for easier viewing - slides are here for those who wish to follow along - enjoy!

The Language of Discovery: A Toolkit For Designing Big Data Interactions (Joe Lamantia) from Joe Lamantia on Vimeo.

Tuesday Oct 22, 2013

Understanding Data Science: Recent Studies

If you need such a deeper understanding of data science than Drew Conway's popular venn diagram model, or Josh Wills' tongue in cheek characterization, "Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician." two relatively recent studies are worth reading.  

'Analyzing the Analyzers,' an O'Reilly e-book by Harlan Harris, Sean Patrick Murphy, and Marck Vaisman, suggests four distinct types of data scientists -- effectively personas, in a design sense -- based on analysis of self-identified skills among practitioners.  The scenario format dramatizes the different personas, making what could be a dry statistical readout of survey data more engaging.  The survey-only nature of the data,  the restriction of scope to just skills, and the suggested models of skill-profiles makes this feel like the sort of exercise that data scientists undertake as an every day task; collecting data, analyzing it using a mix of statistical techniques, and sharing the model that emerges from the data mining exercise.  That's not an indictment, simply an observation about the consistent feel of the effort as a product of data scientists, about data science. 

And the paper 'Enterprise Data Analysis and Visualization: An Interview Study' by researchers Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffery Heer considers data science within the larger context of industrial data analysis, examining analytical workflows, skills, and the challenges common to enterprise analysis efforts, and identifying three archetypes of data scientist.  As an interview-based study, the data the researchers collected is richer, and there's correspondingly greater depth in the synthesis.  The scope of the study included a broader set of roles than data scientist (enterprise analysts) and involved questions of workflow and organizational context for analytical efforts in general.  I'd suggest this is useful as a primer on analytical work and workers in enterprise settings for those who need a baseline understanding; it also offers some genuinely interesting nuggets for those already familiar with discovery work.

We've undertaken a considerable amount of research into discovery, analytical work/ers, and data science over the past three years -- part of our programmatic approach to laying a foundation for product strategy and highlighting innovation opportunities -- and both studies complement and confirm much of the direct research into data science that we conducted. There were a few important differences in our findings, which I'll share and discuss in upcoming posts.


Exploring the emerging space of discovery interactions, analytics, and sensemaking.


« June 2016