I love creating data-derived products. I’ve been saying one man’s metadata is another’s data for years, and now that we are in a golden age of data brought on by cheap cloud storage/compute, sensors/devices everywhere and the rise of data scientists, the age of data is upon us.
However, getting past the high fives and Big Data drivel means a nuanced understanding and focused execution or the promise of a data-driven world won’t come close to the hype. In this installment, I discuss some in-the-weeds approaches to creating data-derived products. For me, this means using metadata/logs, productized for a non-adjacent market and selling something of value. This requires technical, business and creative skills.
The data of which I speak could be any collection of log or transactional data. Thinking expansively about what could be logged is essential. In a future post I’ll discuss how different collections of transactional data might be of interest to specific markets, but until then we’ll focus constructing a data product.
As the data originator, you are in a unique position to perform value-enhancing activities. What you offer and restrict determines how you capture multiple streams of value while minimizing customer/partner/channel conflict.
The key lies in decomposing data, then re-aggregating and packaging data into a product. This approach provides a great means of price discrimination and offers a multitude of levers in pricing and product definition. At a high level, the process looks like the below graphic, but please don’t stop here!
Using a hypothetical set of log data from an app publisher concerned with location data, imagine the following record layout:
This table contains implicit data that can be further decomposed to offer additional granularity. Specifically, it can be broken into additional fields that provide more detailed information about location, platform/OS version and maybe something useful about IP ranges. Additional external data could be used to provide demographics, behavioral characteristics (based on location or device type) and possibly more insight about usage behavior based on IP. This approach is roughly approximate to code refactoring.
A modified record layout looks something like the below (1).
Using a variety of statistical techniques one could further decompose the above fields, but that would become increasingly probabilistic/inferential, so the above is a reasonably strong representation of decomposed source data. This raw data (meaning technically valid and consistent records (2)), can now be aggregated in any number of ways to capture value related to specific markets and customers.
Data aggregation happens along multiple dimensions and understanding the numerous mechanisms to version/control data will provide additional opportunities to productize. From my time running Urban Mapping, we often used elements of aggregation to charge more (or less), yet still maintained pricing control in various markets and stages of a company, reflecting willingness to pay. Below are a summary of levers that can be used to formulate data-derived products. Depending on the characteristics of your data, some of the below may not make sense.
- Updates – Also referred to as data aging. How often are customers provided updates? Some market segments will want real-time data while others (less sophisticated, mature industries) might be satisfied with weekly or monthly reporting data.
- Resolution – This is easiest to understand for geographic data. Point data (ie, lng/lat of a place or person) could be aggregated to administrative or other boundaries (postcode, county, grid, custom region, country etc…). A global macro hedge fund would likely not be interested in GDP statistics for all 300+ US regions, but a retail brand with 2000+ locations across the US would want to understand regional differences in GDP and not be satisfied with a national statistic.
- Precision: Instrument observations or summary data by truncating records from float to n decimals. For example, temperature observations may be logged to three decimals at source, but precision can be held back to one decimal, integer, rounding to 0, 5, etc… A marketing customer need not receive the same level of precision as a scientific one.
- Bin/Group – Also referred to as quantization. The idea here is to replace underlying records/observations with representative values. A example of this is representing distribution of records (like a histogram) in a table, such as providing number of events every hour (as opposed to per-occurrence). At a high level, this could look similar to summary statistics in the form of a static report or Tableau-style workbook (while restricting access to underlying data).
- Index/Mask – Maintain observation-level records, but ‘disguise’ them in some way that provides a relative measure of the underlying data. Indexing things like temperature, education level, income/revenue, distance and age are immediately actionable for a data consumer. Other ways of doing this use exogenous data. If records contain (say) customer count, it can be represented as a percentage of the total/addressable population, density (count per unit area of land) or other broad-based measures.
Data packaging is associated with customer requirements and feeds into product management. A data manifest can define customer profiles, easily automating production. Packing embodies UX/interaction for the customer. For example, is the product delivered as tabular data, Tableau-style workbook, PDF or interactive dashboard? Some of the levers described in aggregating (above) can actually be performed at this stage instead, as rights management/user access is a function of dashboards.
But how does one determine which package best fits with a customer? Enter the art of product management and engaging in a consultative sales process, a critical function in any early/first generation product. For instance, if the customer insights group of a retail brand lacks expertise in spatial analysis, it would be challenged to work with raw data. If the brand were provided a more directly actionable deliverable, such as a summary (PDF) report, interactive dashboard (a la Tableau, etc…), summary files or a presentation, insight and value to the customer might be greater than if providing tabular data.
Thoughts, comments, accusations? I’d love to hear your perspective in public comments or by contacting me directly.
(1) Timestamp could have also been decomposed into hours, days, weeks etc…
(2) This vastly oversimplifies the role of ETL and presumes sufficient data quality. Research indicates somewhere between 70-90% of data analysis is spent on various elements of data gymnastics, but data ingestion rarely gets the attention it deserves.