Despite the hype, there’s a brewing crisis in data science. Quite a few times in recent months, I have heard executives say ‘data doesn’t work’, ‘it isn’t worth it’ or ‘we have more data than is useful’ when discussing “big data”. The problem in each of these cases – there’s a big difference between truly implementing big data solutions and just collecting a lot of data.
At a basic level, defining ‘big data’ is challenging for many people, even at data-driven companies. Most of the definitions center around some description of the size of the data set, i.e. ‘more data than can be handle by a traditional RDBMS system’ or ‘processing more than X billion records per day’. Unfortunately, these simple definitions entirely miss the point of ‘big data’ and have created a data science crisis within many companies where organizations start questioning the value of innovative (and often expensive) data solutions.
These situations would occur less if big data was defined more holistically, shifting the focus from the amount of data to the overall strategic approach – something like: ‘Big Data’ is the implementation of the specialized tools, processes and skills to create practical applications of highly diverse and voluminous data sets. While this definition maybe a bit pedestrian, it addresses the 2 critical issues in succeeding with big data: 1) implementation is a central part of the discipline 2) big data is a complex, cross-disciplinary exercise. If viewed in this light, most detractors would realize that they have not actually engaged in big data, but rather just deployed pieces of a solution.
So how should an organization approach big data to be successful? Up to this point, the main focus has been on the storage, processing or statistical techniques required to handle ‘very large’ data sets (there is a great write up here). But this is incomplete from a solution standpoint. A true solution must include a well developed strategy for data ‘implementation’. In fact, any ‘big data’ project that does not include a holistic implementation plan is not a big data solution but rather a storage and processing solution, where the main deliverable is probably ‘some reports’.
To be truly transformative and maximize the value of data assets, a the following implementation features will be part of solution design from the start:
-Identify which operations personnel will be acting on the outputs of the data solution. And ensure they have the skill sets – or start training or hiring asap.
-Design the operation processes which will use the outputs. Yes, this can be done before there are actual outputs, use mock-ups. If you cannot do this ahead of time, there is a serious flaw in the solution concept and the likely outcome will be a team of data scientists generating ‘interesting insights’ on expensive hardware.
-Assume you will need more data. The best way to improve analytic results is to increase data inputs. Consequently, effective solutions will continually multiply and evolve data types and sources. Identify who will manage this from a business/analytics/technology perspective.
-The concept of material ‘speed’ for the business must be articulated clearly across all data deliverables. There are hundreds of decisions made through out the implementation that require a balance of speed vs accuracy. This must be deliberate or there *will* be misalignment (and dissatisfaction).
-Engineering and analytic requirements will start leap-frogging each other. Essentially, data science will need engineering to deliver infrastructure for large modeling/mining projects. Once engineering delivers, analytics will need some time to iterate on the new scale. Once they catch up to the scale, they will have new requirements & it starts all over. This dynamic needs to be planned for with development road maps to ensure that one group does not become the bottleneck.
-Figure out the economics. This might be the toughest part of implementation, but consider upfront how data solutions will be measured is crucial, since many times ‘insights’ are difficult to value. While many cases are obvious, productivity measures can be employed for more esoteric outputs, i.e. revenue per employee.
-Coordination + Accountability is a must. Seems obvious, but…. Some one must be in charge to make decisions and make sure that all groups are executing. Organizational distance is the surest way for a company to find itself with a huge data store and no way to use it (even after spending considerable $). This person should be articulate enough in all areas to evaluate progress and be empowered to make decisions.
By addressing these implementation items. organizations can claim they are doing ‘big data’ and not just storing a bunch of data and deploying tools.
Taking this comprehensive view is critical if we want to avoid a data science crisis, where data assets, platforms and personnel are largely regarded as very expensive experiments.