Data – biopm, llc https://biopmllc.com Improving Knowledge Worker Productivity Mon, 01 Mar 2021 04:46:03 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://biopmllc.com/wp-content/uploads/2024/07/cropped-biopm_512w-32x32.png Data – biopm, llc https://biopmllc.com 32 32 193347359 The Missing Information in Business Metrics https://biopmllc.com/strategy/the-missing-information-in-business-metrics/ Mon, 01 Mar 2021 02:18:09 +0000 https://biopmllc.com/?p=1254 Continue reading The Missing Information in Business Metrics]]> Modern businesses generate and consume increasingly large amounts of data.  Information is needed to support operational and strategic decisions.  Despite the advent of Big Data tools and technology, most organizations I have worked with aren’t able to take advantage of the data or tools in their daily work.  While greater awareness of human visual perception and cognition has improved dashboard designs, effective decision-making is often limited by the type of information monitored.

It is common to see summary statistics (such as sum, average, median, and standard deviation) being used in reports and dashboards.  In addition, various metrics are used as Key Performance Indicators (KPIs).  For example, in manufacturing, management often use Overall Equipment Effectiveness (OEE) to gauge efficiency.  In quality, process capability indices (e.g. Cpk) are used to evaluate the process’s ability meet customer requirements. In marketing, the Net Promoter Score (NPS) helps assess customer satisfaction.

All of these are statistics, which are simply functions of data. But what does each of them tell us? What do we want to know from the data? What specific information is needed for the decision?

Unfortunately, these basic questions are not understood by most people who use performance metrics or statistics.  I discussed some specific mistakes in using process capability indices last July.  A more general problem is that statistics can hide the information we need to know.

For example, last year I was coaching a Six Sigma Green Belt (GB) working in Quality.  A manufacturing process had a worsening Cpk.  The project was to increase the Cpk to meet the customer’s demanding requirement. Each time we met, the GB would show me how the Cpk had changed.  But Cpk is a function of both the process center (average) and the process variation (standard deviation), which comes from a number of sources (shifts, parts, measurements, etc.).  The root causes of the Cpk change were not uncovered until we looked deeper into the respective changes in the average and in the different contributors to the standard deviation.  

The key takeaway is that when multiple contributors influence a metric, we cannot just monitor the change in the metric alone.  We must go deeper and seek other information needed for our decisions.

Many people may recall in statistics training that the teachers always tell them “plot the data!”  It is important to visualize the original data instead of relying on statistics alone because statistics don’t tell you the whole story.  The famous example to illustrate this point is the Anscombe’s quartet, which includes four sets of data (x, y) with nearly identical descriptive statistics (mean, variance, and correlation) and even the same linear regression fit and R2.  However, when visualized in a scatter plot, they look drastically different.  If we only looked at one or few statistics, we would have missed the differences.  Again, statistics can hide useful information we need.

Nowadays, there is too much data to digest, and modern tools can conveniently summarize and display them. When we use data to inform our business decisions, it’s easy to fall into the practice of looking only at the attractive summary in a report or on a dashboard.  The challenge of using data for decision making is to know what we want and where to get it.

Guess who wrote below about information monitoring for decisions?

With the coming of the computer this feedback element will become even more important, for the decision maker will in all likelihood be even further removed from the scene of action. Unless he or she accepts, as a matter of course, that he or she had better go out and look at the scene of action, he or she will be increasingly divorced from reality.

Peter Drucker in 1967.  He further wrote:

All a computer can handle is abstractions. And abstractions can be relied on only if they are constantly checked against concrete results.  Otherwise, they are certain to mislead.

Metrics and statistics are abstractions of reality – not the reality.  We must know how to choose and interpret these abstractions and how to complement this information with other types1

1. For more discussion on “go out and look” (aka Go Gemba), see my blog Creating Better Strategies.

]]>
1254
The Practical Value of a Statistical Method https://biopmllc.com/strategy/the-practical-value-of-a-statistical-method/ Tue, 01 Dec 2020 03:58:19 +0000 https://biopmllc.com/?p=1227 Continue reading The Practical Value of a Statistical Method]]> Shortly after I wrote my last blog “On Statistics as a Method of Problem Solving,” I received the latest issue of Quality Progress, the official publication by the American Society for Quality.   A Statistics article “Making the Cut – Critical values for Pareto comparisons remove statistical subjectivity” caught my attention because Pareto analysis is one of my favorite tools in continuous improvement.

It was written by two professors “with more than 70 years of combined experience in the quality arena and the use of Pareto charts in various disciplines” and covers a brief history of Pareto analysis and its use in quality to differentiate the vital few causes from the trivial many.

The authors introduced a statistical method to address the issue of “practitioners who collect data, construct a Pareto chart and subjectively identify the vital few categories on which to focus.”  The main point is that two adjacent categories sorted by occurrence in a descending order may not be statistically different in terms of their underlying frequency (e.g. rate of failure) due to sampling error.  

Based on hypothesis testing, the method includes two simple tools:

  1. Critical values below which the lower occurrence category is deemed significantly different from the higher one
  2. A p-value for each pair of occurrence observations of the adjacent categories to measure the significance in the difference

With a real data set (published by different authors) as an example, they showed that only some adjacent categories are significantly different and therefore, are candidates for making the cut.

I see the value in raising the awareness of statistical thinking in decision making (which is desperately needed in science and industry).  However, in practice, the method is far less useful than it appears and can lead to improper applications of statistical methods.

Here are but a few reasons.

  • The purpose of Pareto charts is for exploratory analysis, not for binary decision-making, i.e. making the cut which categories belong to the vital few.  As a data visualization tool, a Pareto chart shows, overall, whether there is a Pareto effect – an obvious 80/20 distribution in the data not only indicates an opportunity to apply the Pareto principle but also gives the insight in the nature of the underlying cause system.  
  • Using the hypothesis test to answer an unnecessary question is waste.  Overall, if the Pareto effect is strong, the decision is obvious, and the hypothesis test to distinguish between categories is not needed.  If the overall effect is not strong enough to make the obvious decision, the categorization method used is not effective in prioritization, and therefore, other approaches should be considered.  
  • Prioritization decisions depend on resources and other considerations, not category occurrence ranking alone.  This is true even if the Pareto effect is strong.  People making prioritization decisions based solely on Pareto analysis are making a management mistake that cannot be overcome by statistical methods. 
  • The result of the hypothesis test offers no incremental value – it does not change the decisions made without such tests.  For example, if the fourth ranking category is found not statistically different from the third and there are only enough resources to work on three categories, what should the decision be? How would the hypothesis test improve our decision? Equally unhelpful, a test result of significant difference merely confirms our decision. 
  • The claim of “removing subjectivity” by using the hypothesis test is misleading.  The decision in any hypothesis test depends on the risk tolerance of the decision maker, i.e. the alpha (or significance level) used to make the decision whether a given p-value is significant is chosen subjectively.  The choice of a categorization method also depends on subject matter expertise – another subjective factor.  For example, two categories could have been defined as one.  In addition, many decisions in a statistical analysis involve some degrees of expert judgment and therefore introduce subjectivity.  Such decisions may include whether the data is a probability sample, whether the data can be modeled as binomial, whether the process that generated the data was stable, etc.  

Without sufficient understanding of statistical theory and practical knowledge in its applications, one can easily be overwhelmed by statistical methods presented by the “experts.”  Before considering a statistical method, ask the question “how much can it practically improve my decision?”  In addition, “One must never forget the importance of subject matter.” (Deming)

]]>
1227
On Statistics as a Method of Problem Solving https://biopmllc.com/strategy/on-statistics-as-a-method-of-problem-solving/ https://biopmllc.com/strategy/on-statistics-as-a-method-of-problem-solving/#comments Sun, 01 Nov 2020 03:55:59 +0000 https://biopmllc.com/?p=1220 Continue reading On Statistics as a Method of Problem Solving]]> If you have taken a class in statistics, whether in college or as a part of professional training, how much has it helped you solve problems?

Based on my observation, the answer is mostly not much. 

The primary reason is that most people are never taught statistics properly.   Terms like null hypothesis and p-value just don’t make intuitive sense, and statistical concepts are rarely presented in the context of scientific problem solving. 

In the era of Big Data, machine learning, and artificial intelligence, one would expect improved statistical thinking and skills in science and industry.  However, the teaching and practice of statistical theory and methods remain poor – probably no better than when W. E. Deming wrote his 1975 article “On Probability As a Basis For Action.” 

I have witnessed many incorrect practices in teaching and application of statistical concepts and tools.  There are mistakes unknowingly made by users inadequately trained in statistical methods, for example, failing to meet the assumptions of a method or not considering the impact of the sample size (or statistical power).  The lack of technical knowledge can be improved by continued learning of the theory.

The bigger problem I see is that statistical tools are used for the wrong purpose or the wrong question by people who are supposed to know what they are doing — the professionals.  To the less sophisticated viewers, the statistical procedures used by those professionals look proper or even impressive.  To most viewers, if the method, logic, or conclusion doesn’t make sense, it must be due to their lack of understanding.  

An example of using statistics for the wrong purpose is p-hacking – a common practice to manipulate the experiment or analysis to make the p-value the desired value, and therefore, support the conclusion.

Not all bad practices are as easily detectable as p-hacking.  They often use statistical concepts and tools for the wrong question.  One category of such examples is failing to differentiate enumerative and analytic problems, a concept that Deming wrote extensively in his work, including the article mentioned above.  I also touched on this in my blog Understanding Process Capability.

In my opinion, the underlying issue using statistics to answer the wrong questions is the gap between subject matter experts who try to solve problems but lack adequate understanding of probability theory, and statisticians who understand the theory but do not have experience solving real-world scientific or business problems.   

Here is an example. A well-known statistical software company provides a “decision making with data” training.  Their example of using a hypothesis test is to evaluate if a process is on target after some improvement.  They make the null hypothesis as the process mean equal to the desired target.  

The instructors explain that “the null hypothesis is the default decision” and “the null is true unless our data tell us otherwise.” Why would anyone collect data and perform statistical analysis if they already believe that the process is on target?  If you are statistically savvy, you will recognize that you can reject any hypothesis by collecting a large enough sample. In this case, you will eventually conclude that the process is not on target.

The instructors further explain “It might seem counterintuitive, but you conduct this analysis to test that the process is not on target. That is, you are testing that the changes are not sufficient to bring the process to target.” It is counterintuitive because the decision maker’s natural question after the improvement is “does the process hit the target” not “does the process not hit the target?”

The reason I suppose for choosing such a counterintuitive null hypothesis here is that it’s convenient to formulate the null hypothesis by setting the process mean to a known value and then calculate the probability of observing the data collected (i.e. sample) from this hypothetical process.  

What’s really needed in this problem is not statistical methods, but scientific methods of knowledge acquisition. We have to help decision makers understand the right questions. 

The right question in this example is not “does the process hit the target?” which is another example of process improvement goal setting based on desirability, not a specific opportunity. [See my blog Achieving Improvement for more discussion.]  

The right question should be “do the observations fall where we expect them to be, based on our knowledge of the change made?”  This “where” is the range of values estimated based on our understanding of the change BEFORE we collect the data, which is part of the Plan of the Plan-Do-Study-Act or Plan-Do-Check-Act (PDSA or PDCA) cycle of scientific knowledge acquisition and continuous improvement.   

If we cannot estimate this range with its associated probability density, then we don’t know enough of our change and its impact on the process.  In other words, we are just messing around without using a scientific method.  No application of statistical tools can help – they are just window dressing.

With the right question asked, a hypothesis test is unnecessary, and there is no false hope that the process will hit the desired target.  We will improve our knowledge based on how well the observations match our expected or predicted range (i.e. Study/Check).   We will continue to improve based on specific opportunities generated with our new knowledge.

What is your experience in scientific problem solving?

]]>
https://biopmllc.com/strategy/on-statistics-as-a-method-of-problem-solving/feed/ 1 1220
What Does the Data Tell Us? https://biopmllc.com/analytics/what-does-the-data-tell-us/ Wed, 01 Apr 2020 03:09:14 +0000 https://biopmllc.com/?p=1137 Continue reading What Does the Data Tell Us?]]> It’s March 31, 2020.  In the past 3 months, the novel coronavirus (COVID-19) has changed the world we live in.  As the virus spreads around the globe, everyone is anxiously watching the latest statistics on confirmed cases and deaths attributed to the disease in various regions.  With the latest technology, timely data is more accessible to the public than ever before.  

With the availability of data comes the challenge of proper comprehension and communication of it.  I am not talking about advanced data analytics or visualization but communication and interpretation of simple numbers, counts, ratios, and percentages.    

The COVID-19 pandemic has provided us ample examples of such data.   If not careful, even simple data can be misinterpreted and lead to incorrect conclusions or actions. 

Cumulative counts (or totals) never go down. They are monotonously increasing.  The total confirmed cases always increase over time even when the daily new cases are dropping.  The total is not most effective in communicating trends, unless we compare it with some established models.  The change in daily cases can give a better insight of the progress.

Even the daily change should be interpreted with caution.  A jump or drop in new cases on any single day may not mean much because of chance variation inherent in data collection.  It is more reliable to fit the data to a model over a number of days to understand the trend.

The range of a dataset gets bigger as more data is collected.  Even extreme values that occur infrequently will show up if the sample size is large.  Younger people are less likely to have severe symptoms if infected by the virus.  The initial data on hospitalization or mortality show predominantly older patients, the most vulnerable population.  As more cases are collected, the patient age range will naturally expand to include very young patients who need hospitalization or even die.  But this increase in the number of younger patients does not necessarily mean that the virus has become deadlier for the younger population.

The percentage of hospitalized patients who are under 65 years of age is by itself not a right measure of the disease risk to the younger population.  There are significantly more people younger than 65 than those older in a general population.  Each person’s risk should be adjusted by the size of the age group.  In addition, the severity of each hospitalized patient is different and their pre-existing health conditions also play a critical role in their recovery or survival.

Mortality is the ratio of the number of the deceased to the number of confirmed cases.  The numerator is likely more accurate than the denominator.  It is likely most patients who died of COVID-19 related complications are counted, whereas the confirmed cases represent mainly those infected people who have severe symptoms, which is known to be the minority.  Therefore, the calculated mortality is likely an overestimate at the initial stage of the pandemic when the prevalence of the disease is uncertain.

In the above examples, it only takes some awareness to avoid data misinterpretation.  For critical decisions, we must understand the context of the data, e.g. where the data came from, what data is collected, how it is collected, what data is missing, etc.

We should never forget that the data we often see is collected from a sample of the population we try to understand.  Any statistic (or calculation) from the sample data, such as count or average, is not of most interest.  What we truly want to know is some population attribute estimated based on the sample data.  We cannot measure the entire population, e.g. test everyone to see who are infected, and have to rely on sample data available to us.  Different samples can give drastically different data.  We must understand what that sample is and how it is selected in order to infer from the data.

For example, the sample may not be representative of the population.  The people who have been tested for the new coronavirus represent a sample.  But if only seriously ill people are tested, they do not represent the general population if we want to understand how deadly the virus is.

Equally important is the method of measurement.  All tests have errors.  An infected person could give a negative test result (i.e. a false negative), and an uninfected person could give a positive result (i.e. false positive).  The probabilities of such errors depend on the test.  Different tests on the same people can give different results. 

To analyze data properly, trained professionals depend on probability theory and sophisticated methods.  For most people, though, it helps to know that what’s not in the data could be more important than the data.

]]>
1137
Can You Trust Your Data? https://biopmllc.com/operations/can-you-trust-your-data/ Mon, 30 Dec 2019 05:54:37 +0000 https://biopmllc.com/?p=1115 Continue reading Can You Trust Your Data?]]> Data is a new buzzword.   Big Data, data science, data analytics, etc.  are words that surround us every day.  With the abundance of data, the challenges of data quality and accessibility become more prevalent and relevant to organizations that want to use data to support decisions and create value.   One question about data quality is “can we trust the data we have?” No matter what analysis we perform, it’s “garbage in, garbage out.”

This is one reason that Measurement System Analysis (MSA) is included in all Six Sigma training.  Because Six Sigma is a data-driven business improvement methodology, data is used in every step of the problem-solving process, commonly following the Define-Measure-Analyze-Improve-Control (or DMAIC) framework.  The goal of MSA is to ensure that the measurement system is adequate for the intended purpose.   For example, a typical MSA evaluates the accuracy and precision of the data. 

In science and engineering, much more comprehensive and rigorous studies of a measurement system are performed for specific purposes.  For example, the US Food and Drug Administration (FDA) publishes a guidance document: Analytical Procedures and Methods Validation for Drugs and Biologics, which states

“Data must be available to establish that the analytical procedures used in testing meet proper standards of accuracy, sensitivity, specificity, and reproducibility and are suitable for their intended purpose.”

While the basic principles and methods have been available for decades, most organizations lack the expertise to apply them properly.  In spite of good intentions to improve data quality, many make the mistake of sending newly trained Six Sigma Green Belts (GB’s) or Black Belts (BB’s) to conduct MSA and fix measurement system problems.  The typical Six Sigma training material in MSA (even at the BB level) is severely insufficient if the trainees are not already proficient in science, statistical methods, and business management.  Most GB’s and BB’s are ill-prepared to address data quality issues.

Here are just a few examples of improper use of MSA associated with Six Sigma projects.

  • Starting Six Sigma projects to improve operational metrics (such as cycle time and productivity) without a general assessment of the associated measurement systems.  If the business metrics are used routinely in decision making by the management, it should not be a GB’s job to question the quality of these data in their projects.  It is management’s responsibility to ensure the data are collected and analyzed properly before trying to improve any metric.
  • A GB is expected to conduct an MSA on a data source before a business reason or goal is specified.  Is it the accuracy or precision that is of most concern and why? How accurate or precise do we want to be?  MSA is not a check-box exercise and consumes organization’s time and money.  The key question is “is the data or measurement system good enough for the specific purpose or question?”
  • Asking a GB to conduct an MSA in the Measure phase and expecting him/her to fix any inadequacy as a part of a Six Sigma project.  In most cases, changing the measurement system is a project by itself.  It is out of scope of the Six Sigma project.  Unless the system is so poor that it invalidates the project, the GB should pass the result from the MSA to someone responsible for the system and move on with his/her project.
  • A BB tries to conduct a Gage Repeatability & Reproducibility (R&R) study on production data when a full analytical method validation is required.  A typical Gage R&R only includes a few operators to study measurement variation, whereas in many processes there are far more sources of variation in the system, which requires a much more comprehensive study.  This happens when the BB lacks domain expertise and advanced training in statistical methods.

To avoid such common mistakes, organizations should consider the following simple steps.

  1. Identify critical data and assign their respective owners
  2. Understand how the data are used, by whom, and for what purpose
  3. Decide the approach to validate the measurement systems and identify gaps
  4. Develop and execute plans to improve the systems
  5. Use data to drive continuous improvement, e.g. using Six Sigma projects

Data brings us opportunities.  Is your organization ready?

]]>
1115
Capabilities of Future Leaders https://biopmllc.com/organization/capabilities-of-future-leaders/ https://biopmllc.com/organization/capabilities-of-future-leaders/#comments Mon, 22 Oct 2018 04:20:19 +0000 https://biopmllc.com/?p=985 Continue reading Capabilities of Future Leaders]]> What capabilities are required for future leaders in life sciences? How can organizations develop such leaders? A recent McKinsey article, Developing Tomorrow’s Leaders in Life Sciences, addresses this exact question. Using data from their 2017 survey on leadership development in life sciences, the authors illustrated the gaps and opportunities and presented five critical skills.

  1. Adaptive mind-set
  2. 3-D savviness
  3. Partnership skills
  4. Agile ways of working
  5. A balanced field of vision

It is a well written article with useful insights and actionable recommendations for effective leadership development. However, there is one flaw – presentation of the survey data. Did you notice any issues in the figures?

I can see at least two problems that undermine the credibility and impact of the article.

Inconsistent numbers
The stacked bar charts have four individual groups (C-suite, Top team, Middle managers, and Front line) in addition to the Overall. In Exhibit 1, for example, the percentages of the respondents that strongly agree with the statement “My organization has a clear view of the 2-3 leadership qualities and capabilities that it wants to be excellent at” are 44, 44, 30, and 33%, respectively. Overall, can 44% of them strongly agree? No. But that is the number presented.

It doesn’t take a mathematical genius to know that the overall (or weighted average) has to be within the range of the individual group values, i.e. 30 < overall < 44. Similarly, it is not possible to have an 8% overall “Neither agree or disagree” when the individual groups have 11, 9, 16, and 17%. The same inconsistency pattern happens in Exhibits 4 and 5.

Which numbers are correct?

No mention of sample size
Referring to Exhibit 1, the authors compared the executive responses in the “strongly agree” category (“less than 50 percent”) to those of middle managers and frontline staff (“around 30 percent”), stating there is a drop from the executives to the staff. But can a reader make an independent judgment whether the difference between the two groups really exists? No, because the numbers alone, without a measure of uncertainty, cannot support the conclusion.

We all know that the survey like this only measures a limited number of people, or a sample, from each target group. The resulting percent values are only estimates of the true but unknown values and are subject to sampling errors due to random variation, i.e. a different set of respondents will result in a different percent value.

The errors can be large in such surveys depending on the sample size. For example, if 22 out of 50 people in one group agree with the statement, the true percent value may be somewhere in the range of 30-58% (or 44±14%). If 15 out of 50 agree in another group, its true value may be in the range of 17-43% (or 30±13%). There is a considerable overlap between the two ranges. Therefore, the true proportions of the people who agree with the statement may not be different. In contrast, if the sample size is 100, the data are 44/100 vs. 30/100, the same average proportions as the first example. The ranges where the true values may lie are tighter, 34-54% (44±10%) vs. 21-39% (30±9%). Now it is more likely that the two groups have different proportions of people who agree with the statement.

Not everyone needs to know how to calculate the above ranges or determine the statistical significance of the observed difference. But decision makers who consume data should have a basic awareness of the sample size and its impact on the reliability of the values presented. Drawing conclusions without necessary information could lead to wrong decisions, waste, and failures.

Beyond the obvious errors and omissions discussed above, numerous other errors and biases are common in the design, conduct, analysis, and presentation of surveys or other data. For example, selection bias can lead to samples not representative of the target population being analyzed. Awareness of such errors and biases can help leaders ask the right questions and demand the right data and analysis to support the decisions.

In the Preface of Out of Crisis, Edwards Deming made it clear that “The aim of this book is transformation of the style of American management” and “Anyone in management requires, for transformation, some rudimentary knowledge of science—in particular, something about the nature of variation and about operational definitions.”

Over the three and half decades since Out of Crisis was first published, the world has produced orders of magnitude of more data. The pace is accelerating. However, the ability of management to understand and use data has hardly improved.

The authors of the McKinsey article are correct about 3-D savviness: “To harness the power of data, design, and digital (the three d’s) and to stay on top of the changes, leaders need to build their personal foundational knowledge about what these advanced technologies are and how they create business value.” That foundational knowledge can be measured in one way by their ability to correctly use and interpret stacked bar charts.

Now, more than ever, leaders need the rudimentary knowledge of science.

]]>
https://biopmllc.com/organization/capabilities-of-future-leaders/feed/ 2 985
Road to the Data-rich Future https://biopmllc.com/strategy/road-to-the-data-rich-future/ Sun, 26 Aug 2018 20:25:47 +0000 https://biopmllc.com/?p=967 Continue reading Road to the Data-rich Future]]> There is hardly a day gone by without seeing an article about a business or organization using data analytics, machine learning, or artificial intelligence to solve tough problems or even disrupt the industry. With wide availability of computers and other digital devices, capturing and storing data becomes easier. This represents unprecedented opportunities to gain knowledge and insight from data.

However, turning the opportunities into fruitful results can be a bumpy journey. I expect organizations to encounter greater difficulty than implementing Six Sigma, a data-driven business improvement methodology.

Since the 1980’s, many organizations have implemented Six Sigma (often along with other methodologies) to improve their performance. Some were able to transform the entire organization’s culture and capabilities to achieve sustained improvement, while many were only able to achieve isolated and/or temporary gains. There is no question that change leadership and organizational change management capability played a critical role. Implementing data analytics is no exception.

In addition, Six Sigma and Big Data analytics share some unique challenges, one of which is the requirement for data and the expertise in extracting insight from the data. I have seen countless Six Sigma projects fail to deliver the promise because of poor data availability or quality and/or lack of skilled resources. Unable to achieve quick and significant improvement, some organizations have given up on Six Sigma and shifted more effort to Lean or Agile. But the underlying causes of deficiencies in data and analytics capabilities are not addressed and will inevitably impede implementation of data analytics initiatives.

Therefore, organizations considering investing in data analytics should seriously assess these two risk areas.

Poor data quality
I use “quality” here loosely to mean two things, usefulness and absence of defects.

Not all data are equally useful and can help us develop insight or solve problems. What data should be captured, stored, and processed? Data that is readily available may not be useful to the problem we try to solve, whereas potentially useful data can be costly to collect. Who can help decide and prioritize what data to collect?

It is a known fact but may be surprising to some people that data scientists spend more time cleaning up data than analyzing it. Useful data rarely come in a complete, accurate, and consistent format. A Forbes article reports that data scientists spend about 80% of their time on collecting, cleaning, and organizing data. I concur that data cleaning is the most time-consuming and least enjoyable task. No business wants their scarce and highly paid resources to spend the majority of the time on non-value added activities. What can they do about it?

Lack of resources with analytics and subject matter expertise
To solve business problems, analytics experts need computer science and statistical skills but also general operations and business knowledge. Ideally, they also have subject matter expertise. But such talent is the exception rather than the norm. The iterative process of collecting, cleaning, modeling, and interpreting data requires close collaboration among analytics, subject matter experts, and management. My observation has been that most subject matter experts are not familiar with even the basic concepts of computer science and statistics, the backbone of analytics. Simply hiring a few Black Belts never worked for Six Sigma; acquiring data scientists is not enough if the rest of the organization is ill prepared. A Center of Excellence model tends to centralize the analytics expertise and delay broad engagement and ownership across the organization.

These areas are but two important considerations as leaders develop a comprehensive approach to mitigate risks in analytics programs. Leaders should follow a strategy development process and resist one-off efforts, such as technology installation or talent acquisition. By evaluating all aspects of the current operating model with respect to their vision of digital capabilities for the future organization, they can develop a cohesive plan for a smoother ride to the destination.

]]>
967