We waste a lot of time waiting for spectacular new material. We haven’t sat down and taken a very close look at the material we have.
Old data can often be used in ways that were not anticipated by the original data collector.
There is more old data than new data, so the proportion mentioned will always be increasing exponentially. All new data will become old data over time
Like all types of data, old data also has great value, but we need to be smart if we hope to find something that this data wants to tell us.
Data repurposing
Data repurposing involves taking preexisting data and performing any of the following:
- Using pre-existing data to answer questions that were not asked by the original data designer and collector;
- Combining pre-existing data with additional data of the same type to produce aggregated data that will serve as a new set of questions that could not be answered with just one source;
- Re-analyzing the data to validate predictions, theories or conclusions drawn from the original studies;
- Re-analyzing the data using alternative or improved methods to obtain greater accuracy and reliability than the results originally produced;
- Integrating heterogeneous data sets to answer questions or develop concepts that can span a wide range of scientific field (see [Heterogeneous Data]);
- Finding subsets in populations that were once thought to be homogeneous;
- Searching for new relationships between data objects;
- Creating new data sets on-the-fly (on demand, in real time) by linking data;
- Creating new concepts or new ways of thinking about old concepts based on re-examining the data;
- Fine-tuning existing data models;
- Remodeling systems from scratch.
Despite the prevalence of old data, most data scientist focus their efforts on newly acquired data or on non-existent data that may emerge in the unknown future.
There are multiple reasons why old data is not so target
- Mos old data is proprietary and cannot be accessed by anyone other than its owners;
- The owners of proprietary data in many cases do not even know its contents, nor even that the data exists, and are unable to understand the value it holds, for themselves and others;
- Old data is often stored in data formats that are extremely dated and few scientist would be able to venture into;
- Much old data lacks adequate annotation, there is simply not enough information about the data (how it was collected, what the data menas, what the purpose of the collection was, …) to support useful analysis;
- Much has also not been indexed in today’s standards; there is no practical method of searching old data content;
- Much old data can be considered poor data, because when it was collected, it did not follow a quality assurance standard to support useful analysis of the content.
The chaotic mess of old data is conveyed by the jargon that permeates the field of data repurposing (and any type of data analysis) as data mining or data scrapping; Anything that needs to be scraped or mined should not be too clean.