Following my earlier blog “Data is the new Oil”, let me share a few thoughts on the other side or rather ‘the darker side’ of Predictive Modelling and Data Science projects – The challenges faced in practicing data sciences from an organizational perspective. The challenges are so tainted that the famous dialogue from The Dark Knight – “When an unstoppable force meets an immovable object” holds just about right. Let’s dig in to see what they are.
The stake holders are bewildered by the plethora of data coming in all formats and in enormous volumes. Predictive analytics can generate significant improvements in efficiency, decision making, and ROI. But Predictive analytics isn’t always successful and in all likelihood, the majority of models are never used operationally. The most common reasons are –
- Obstacles in Management
- Obstacles in Data
- Obstacles in Modelling
Obstacles in Management:
To be able to make the most of it, predictive models have to be deployed. Often, Deployment in of itself requires significant shift in resources for an organization and therefore the project often needs support from management to make the transition from Research and Development to operational solution. If the program management is not a champion of the predictive modelling project, and the resulting models perfectly good models will go unused due to lack of resources and lack of political will to attain those resources. Companied fear this and many (good) models
Obstacles with Data:
Predictive models require data in the form of a single table or flat file containing rows and columns. 2d data. If the data is stored in transactional databases case need not be identified to join the data from the data sources to form the single view or table. Projects can fail before they even begin if the keys does not exist in the table needed to build the data. Even if the data can be joined into a single table if the primary inputs or outputs are not populated sufficiently or consistently that data becomes meaningless. For example, consider a customer acquisition model. Predictive models need examples of customers who were contacted and did not respond as well as those who were contacted and did respond. If active customers are stored in one table and marketing contacts (leads) in a separate table, several problems thwart modelling efforts. First, unless customer tables include the campaign they were acquired from, it may be impossible to reconstruct the list of leads in a campaign along with the label that the lead responded or did not respond to the contact.
The measures taken to overcome the obstacles with data is where 60-70% of the time and effort goes in to the modelling. As the data that goes into the modelling needs great detailing, even if the data has been scrubbed and cleaned from a database or a data warehouse, it may never have been examined from a modelling standpoint. Few issues associated with this are enlisted:
- Incorrect data values
- Inconsistency in data formats
- Outliers
- Missing Values
- Derived attributes
- Skewness
Every one of these can cause a major damage to a good model. Hence there are treated under zero tolerance category.
Obstacles with Modeling:
Perhaps the biggest obstacle to building predictive models from the analyst’s perspective is OVERFITTING, meaning that the model is too complex essentially memorizing the training data. The effect of overfitting is twofold. The model performs poorly on new data and the interpretation of the model is unreliable. If care is not taken in the experimental design of the predictive models, the extent of model overfit is not known until the model has already been deployed and begins to fail.
A second one occurs when analysts become too ambitious in the kind of model that can be built with available data and in the time frame allotted. If they try to hit a home run and cannot completer the model in the timeframe, no model will be deployed at all. Often a better strategy is to build a simpler model first to ensure a model of some value will be ready for development. A model can be augmented, and improved later if time permits.
I would elaborate on some of the methodologies that are widely adopted and employed to treat these issues and mitigate the effect it can have on a model in my upcoming blog posts.