3 Steps to Get Clean, Structured Data You Trust
CloudFactoryCloudFactory
It’s their least favorite part of the job, yet it consumes most of their time. Data wrangling - or gathering and preparing dirty data so it can be used in critical business applications - is the biggest problem in data science today, according to a Kaggle survey. So how can you get clean, structured data you can trust?
First, a recap. As we discussed in the first article of this series, dirty data is data that is invalid or unusable. In the second article, we learned that fewer than half (44%) of people trust their organization’s data to make important business decisions. In fact, more than half of executives (52%) said they rely on educated guesses or gut feelings even when they use their data to make decisions. Guessing is risky and can be costly.
And then there’s dark data, which is hidden and could be valuable to your business. When you don’t evaluate that data for strategic use, there’s an opportunity cost. That’s what United Airlines faced when an estimate showed that hidden dark data within its antiquated system was costing the airline $1 billion, largely from bad assumptions about how much each traveler might pay for a seat.
Unstructured data, which must be modified in some way to make it compatible with the system that will consume it, also can hold tremendous value for your business. If it were cleaned or otherwise enriched, it could be used to build new products, to solve painful problems, and to disrupt entire industries.
Leveraging data for your business strategy is no simple task. It takes discovery, planning, and relentless optimization as you execute. It also takes clean data. Data cleansing - also called data wrangling, data munging, or data scrubbing - is a necessary and significant part of data science and growing priority for businesses around the globe.
Start by establishing consensus internally on how data will be gathered, managed, and used by and for the business. Expect that process alone to take several months. Work in parallel to validate your system of record and document decisions made. Anticipate any regulations that could present challenges for your business’ use of data, such as GDPR, and engage legal experts to develop the standards your business will use to comply. Document and educate employees, as appropriate, on your data governance and the importance of clean data to the business.
We learned in article two that when designing an IoT system, four key factors are required to produce high-quality data over time. According to James Branigan, IoT software platform developer and founder of Bright Wolf, those factors are:
Integration will be critical in this process. As you add functionality to your system of record, consider carefully how each platform integrates with your tech stack. Think about how each one stores and manages data. If you can, clean data as you import it into new platforms.
As we learned in article two of this series, you can address some quality issues as you join data. A developer can use scripts and coding tools to merge data consistently and accurately for two or more relatively small data sources. You still may find you need to remove duplicates, adjust case and date/time formats, and regionalize spelling (e.g., British English vs. American English).
Be sure to establish how much you plan to iterate your process or the way you manage data over time, as it may dictate what is available to you down the road. For example, if you want more control over how data is consumed or reported, you may want to consider an open source solution that gives you the power to create or adjust particular features, such as accuracy thresholds for data work.
Deloitte predicted a burgeoning need for the augmented workforce in its 2017 Global Human Capital Trends Report. According to the report, as connectivity and cognitive technology accelerate and change the nature of work, “organizations must reconsider how they design jobs, organize work, and plan for future growth.”
It’s reasonable to expect the structure of your organization to look quite different in 2020 than it does today. It could include teams of cloud workers who manage data. These teams will specialize in functional areas where your core team doesn’t - and where there would be no strategic benefit to their trying to learn them.
Many companies outsource data gathering, cleaning, and enrichment, often for one of two reasons: 1) so data science and engineering teams can redirect their focus to strategy, iteration, and quality; and 2) so operations teams can achieve greater efficiency, quality, or cost with outsourced teams.
When it comes to cleaning data, think carefully about how best to source and deploy every facet of your workforce. Crowdsourcing is a good option for short-term projects but can be inherently inefficient over the long term, as it requires multiple people to do the same task to achieve double or triple consensus on which performs with the highest accuracy. Managed cloud labor can be helpful when your process is clear, you need agility to iterate quickly, and quality is crucial.
Companies that have developed a standard of governance for data collection, storage, and enrichment are ahead of the game. Those that have shored up their tech-and-human stack to maintain data quality over time are doing even better. And if you’re planning for your augmented workforce, you’re well on your way. Consider it your goal for the year - as it may take you that long - to establish or optimize your process in even one of these areas, and you’ll be ahead of most of your peers.
No matter how you clean your data, you’re not alone. Others like you, along with countless data scientists, are pondering these same issues and finding new ways to clean decades-old data while they apply the latest technology to streamline data collection, cleaning, and enrichment processes. Take heart; technology is bound to make the process easier. And, the lessons you learn along the way will inform your strategy and many of your processes. So get started - the data won’t clean itself! At least, not yet.
Written by Nanette George, Senior Marketing Manager at CloudFactory.
New Podcast Episode
Recent Articles