Have you ever thought what if we had been able to predict the 2008 financial crisis far ahead of the turmoil it threw the global economy into? Was there data available and accessible to suggest that the supply of houses was outrunning demand, that borrowers could default on their mortgages, and the derivatives and all other investments tied to them would see value erosion? Well, as an afterthought, one can say the price of not knowing was an unprecedented crisis, resulting in several businesses reeling under its impact.
Dark data refers to all the information assets organizations collect, process and store during regular business transactions, but fail to use for other purposes (for example, analytics, deciphering business context or inter-relationships and direct monetizing).
During World War II, fighter planes would come back from battle with bullet holes. The Allies sought to reinforce the most commonly damaged parts of the planes based on areas that were most commonly hit by enemy fire. A mathematician, Abraham Wald, pointed out that perhaps there was another way to look at the data. Perhaps the reason certain areas of the planes weren’t covered in bullet holes was that the planes that were shot in those areas never actually returned. This insight was further validated to be true and led to the armor being re-enforced on the parts of the plane where there were no bullet holes. This is a classic example of survivor bias leading to dark data. Survivor bias is the logical error of concentrating on the people or things that made it past some selection process and overlooking those that did not, typically because of their lack of visibility. This is just one of the many ways in which dark data manifests itself.
The story behind the data is arguably more important than the data itself. The reason behind why we are missing certain pieces of data may be more meaningful than the data we have.
And this is where it becomes extremely critical to have the data analysts or the data scientists asking the right questions and having the right data collection and validation strategy. Ronald H. Coase, a renowned British Economist, famously quoted, “…if you torture the data long enough, it will confess to anything”. In a simplistic sense, it means looking beyond the obvious. When you start looking at the story behind the data you are presented with, many unknowns get answered. This can include looking at when the data was collected, what methods and assumptions were applied during the data collection, what transformations it might have witnessed.
Once you have diligently performed such ‘forensics’ and determined consciously what data elements or data attributes are missing – you are now dealing in the realm of the ‘known unknowns’. There are many different statistical or machine-learning based approaches to overcome those – including new surveys, various data imputation methods, substitution by proxies or even the more recent ML-based synthetic data generation techniques which are gaining currency in many scenarios.
It’s however not adequate to be satisfied with the ‘known unknowns’. How do you deal with the realm of the ‘unknown unknowns’? One promising development has been the evolution of graph technologies. With graph technologies, it becomes possible to understand the relationships between seemingly unrelated pieces of information and hidden patterns. Increasingly, graph technologies are being used to get to a deeper definition and understanding of context between various entities in a transaction or a set of transactions. For example, we are seeing graph technologies being applied in detecting fraud rings or identity theft in the banking or insurance domain and so also for dialogue generation to improve human-machine interaction.
In conclusion, it is critically important for both business leaders and big data practitioners to pause and reflect on the price of not knowing and whether they have the right data strategy – in terms of people, process and technology – to effectively deal with dark data to unlock greater value for the enterprise.
Naveen Kamat, Executive Director & CTO, Data & AI Services, Kyndryl India