Data scientist is one of the fastest-growing professions, and this role is becoming more commonplace across a variety of business sectors. But what does a successful data science team look like?
Now is a good time to ask this question because 85% of data projects fail to deliver commercial value or see no business adaptation, according to Gartner. The truth is that being successful at data science in a business setting is a complex challenge. You can read about the unique strategic requirements and technical requirements in my previous blog posts. This blog covers the last piece in the puzzle, which is delivery management of data science projects.
When looking at the project-delivery pipeline, it’s important to remember that data science is (a) still an emerging business function and (b) constantly evolving from ongoing innovation. These realities not only affect the available capacity of data science teams to deliver projects, but also explain why teams need the autonomy to define their research process.
Many data scientists find themselves in tech companies running agile software development cycles, which can feel a little like a square peg in a round hole. What differentiates the data science function from software development is the amount of commonly open-ended research that comes with the role, and this is where standard agile processes can collide with the interests of data scientists. While research is an essential part of data science, its open-endedness and a lack of clarity around its business value does not fit the standard agile processes very well.
Any data science team needs to spend a significant amount of its time on research to build out their foundational layer for adapting a continuously evolving field to the specific business context. We can break this foundational research work into three categories:
Long term: Conducted by specialized research teams at established data-driven hi-tech companies. (This is not your typical commercial data scientist.)
Medium term: Building out scalable foundations by adapting the ever-evolving key domains in data science to the business specific context.
Short term: Application-specific challenges such as solution design, presentation layers, algorithm selection, and validation methods of models and metrics.
Many businesses fast forward their research backlog by hiring specific expertise from relevant domains such as NLP or computer vision. Regardless, there needs to be continuous investment into the foundational layer (See Figure 1) to keep resources relevant and to provide the data science team with opportunities for development. This is crucial for team motivation and retention.
(Source: Jan Teichmann)
As we break the data science responsibilities into two workstreams, it becomes clear which part should be captured as part of an agile delivery pipeline and project management board like Jira and which collides with standard agile processes.
On the one hand, building out the foundational layer and managing this workstream should be an internal responsibility for data scientists. On the other hand, I find it over and over again to be very personalized to the teams. Generally, what is shown in Figure 1 is the appropriate workstream for proof of concepts with undefined success criteria or deadlines. Rather than to conform that chaos, it’s best to implement transparency on the resource allocation for this workstream, set expectations around collaboration and knowledge transfer, and communicate progress and value of this work, i.e., with blog posts and community meetups.
From the team’s solid foundations, we scale up a growing number of applications and business use-cases that become part of a standard delivery pipeline with stricter project management requirements:
Figure 2: Adapted Flow Chart originally by Shay Palachy
The delivery of data science projects is a team sport for science, product development, and engineering. The key to its coordination and prioritization is a proof of value thinking in the project delivery management rather than the proof of concept thinking which motivates the team’s research efforts.
The Scoping phase has to start with a specific product need. The aim is to reach an agreement between product and data science about the scope and metrics a successful project will address. This phase is a great litmus test for an established product-and-science relationship. In a mature collaboration, suitable data science projects have not been already solutionized by the product teams. The ideation process should be heavily data-informed to define a realistic project scope and should lie with the data science team.
In the Research phase, the data science team will be able to draw from the foundations it has built. This restricts the outstanding research to short-term, application-specific challenges such as the final solution design (including presentation layers and APIs), algorithm selection and suitable validation methods for the model and agreed metrics. As such, it is appropriate to timebox this phase but with a chance that a research phase has to be reiterated. The successful completion of the research phase sees a sense check by engineering and a confirmation that the scope and agreed metrics remain valid.
This is when the actual model Development starts. Data scientists write code, but it differs a lot from how software engineers work. This phase is highly iterative with a large amount of back and forth between model development and model valuations to optimize a solution. This is why data scientists use Notebooks rather than traditional IDEs. It allows them to prototype at pace.
Notebooks are great for fast prototyping but they are terrible for reproducibility and enterprise deployment. The maturity of the data science development environment and the support by engineering to access and process large amounts of data is crucial. It not only impacts the productivity during model development, but also assures the quality of the solution to avoid surprises during the following solution deployment.
The Deployment of models is a challenging topic in itself. Commonly, the model data scientists developed in the previous phase requires significant work to translate it into an enterprise solution that can be deployed on the existing stack. This includes requirements for logging, monitoring, auditing, authentication, SLAs etc.
Solving deployment ad-hoc for each model does not scale at all. Additionally, a big question mark remains: Who owns the operational responsibilities of a deployed model in production? Models are unique for their complex lifecycle management requirements in production. It’s therefore important to look at platforming data science, which you can read more about in my previous article about Machine Learning Logistics and Rendezvous Architecture.