[Note: this post is latest in my series of “startups that weren’t.” You can read more about other ideas I’ve (for now!) put in the idea graveyard.]
Not too long ago quant research strategies employed by hedge funds were one of the few places time series data was analyzed at scale. But with the flood of telemetry from probes, remote sensing and transactional/log data of all stripes, the growth of novel analytic/big data/machine learning techniques for predictive purposes has exploded. To explain this opportunity I am using movement/location data, but other flavors of transactional data also apply. It’s an easier one for me to explain with 10+ years experience in geo.
At the risk of being overly pedantic, models serve to approximate reality. Data can help better inform reality once a model is developed, but without additional input, the model isn’t likely to magically improve. In the context of spatially-referenced things (I dislike the term GIS as it carries a lot of baggage), what one observes may not be the totally of what exists. When applying geotech to a specific application, vertical, process, etc… this question may be of paramount importance, or not.
To understand any market with precision, ground truth must be employed to calibrate models. The reason for this is straight-forward: unless your panel (in this case likely location events from an SDK) sees the complete universe of offline activity, bias is present. This bias can be represented as demographic, technical and behavioral, and it can range from a non-issue to a product-killer.
Why does one care about bias? If you care to make claims about a population, you’ll be sampling. How does your panel stack up to the overall US population in terms of demographics, mobility patterns and technology? If the panel is skewed with 90% iPhone users, be aware. If the panel is predominantly suburban, be aware. If location events are logged arbitrarily between 7-12 hours, be aware. If the panel is 500,000 monthly active users, be aware.
The good news is these details matter far less outside of alpha-exploiting quant funds. In retail and CPG, indices, period-on-period change, share of wallet and market share are about as sophisticated as one needs. Easy to understand examples of this are reports from xAd, PlaceIQ, Foursquare and others. But when absolute precision is called for, it isn’t possible to hide under binned values. Enter the world of ground truth.
What is Ground Truth, and Why Do I Care?
In essence, ground truth is an indisputable record of fact– the number of airplanes that take off from SFO in a given day, the soybean yield in Ellis County, Kansas for a given harvest, offramp traffic on Exit 261A in Florida over a holiday weekend, visitors to a Lidl supermarket the Saturday prior to Labor Day, 500mb pressure maps for next week, the US Bureau of Labor Statistics Non-farm Payroll report for August or any other number of discrete measurements. Because models can only as good as the data that goes into them, historical data is invaluable, and more is better. It is also harder to come by than you may think. As an aside, don’t forget the difference between forecasting/predicting and measuring/reporting.
Historically, ground truth has come from independent sources like government and NGOs. Increasingly it can come from private actors, but there are relatively few places where this data can be captured. Sporting venues, public parks and travel nodes are obvious candidates. However, because of critical reliance on historical data for back-testing and model training in machine learning), a newly-identified ground truth source today may mean it can’t be relied on for 18 months or more (this depends on the number of observations, industry focus, required precision and other factors). So the sooner one can aggregate sources of ground truth (and surreptitiously convince others to log new ones), the sooner this data will be of use in model development. With fewer words, it looks like this:
Observed v Ground Truth Data
In short, the business idea was to aggregate and track granular sources of ground truth to allow development of more robust/accurate models that can be used for reporting and forecasting. I looked at the opportunity from a few perspectives and they are outlined below. Later in this post I evaluate the good, bad and ugly, and my conclusions are (naturally) at the end.
[Ed: When xAd rebranded as GroundTruth I couldn’t help wince as they are most definitely not offering services that provide ground truth, per the above.]
Something about this opportunity spoke to me, likely a result of a decade in geo where without data, you are nowhere. At Urban Mapping we’d often create our own data products by stringing together public record requests, perform lots of normalization/ETL to get data to ‘behave’ and create novel data products. I enjoy going into a wormhole to find and create new ones (like a database of all freestanding USPS blue post boxes, complete with attributes!).
This product would entail identifying many ground truth sources, constantly filing/appealing state and federal public record requests, ingesting historical data and capturing updates. There is also an emerging category of private actors who traffic in what I refer to as ‘synthetic ground truth’ and are a potential gold mine of historical data. I had a reasonably good handle on what data would be valuable for several markets, so sources dealing with location was a good place to start.
Knowing that enterprise data consumers are far from uniform–some insist on raw transaction logs, others want something more refined/normalized and some want polished reports/summaries viewed in Tableau or something similar, it would be critical to offer versioning capabilities. Ground truth data would be available by attributes such as observation period, geography, feature types and other metadata below:
Ground Truth Attributes
The market I was most familiar with is the buy side in financial services– precision is of utmost importance in hedge funds, and as one moves through decreasing levels of asset volatility by manager (asset management, equity analysis, mutual funds, pension funds, etc…), this degree of required precision drops. The reason is pretty simple: a mutual fund will typically hold a position far longer than a hedge fund, so volatility is absorbed by the broader market. The intra-day/minute/second opportunities are not what low volitility investors are looking for. With capital on the line and a specific trading strategy to deploy it, quant-oriented funds embrace these fluctuations where signal can be exploited.
In broad brush strokes I identify precision and accuracy as represented below. The important thing to note, which few outside finance seems to understand, is that the buy side is incredibly diverse in terms of sophistication. “Hedge fund” is getting closer to a meaningless term as it tends to intoxicate the mind with ideas of high finance. The number of actors capable of ingesting raw data is very small. And funds can be difficult to work with for a variety of reasons, but if one is able to cross the chasm and emerge as table stakes, like 1010data, you are fortunate.
Precision and Confidence by Financial Services Market
Because of limited (and varying) sophistication, the broader market may not have a statistical desire to incorporate ground truth. As the customer base widens, the requirements become more flexible. Hence the looming question of how large is this market?
While the business opportunity presents many of the things I love (explore new markets, create data arbitrage opportunities and create second-order data products), it is fundamentally a data aggregation/noramalization play. I am skeptical of the barriers to entry, yet privileged relationships with sources of truth (quantity and or quality) and hyper-efficient ETL can be defensible. To generate revenue, sophisticated customers would have to find that aggregated ground truth would sufficiently increase confidence in models. Backtesting and sensitivity analysis is a PhD dissertation on its own, so the potential R&D effort to get to a validation point could be significant.
As trust is developed with key customers, the plan would be to co-discover what complementary data/services would be of value–ground truth that supports alpha hunting in different industries and geographies are obvious places to start.
Sales and Distribution
I’m enamored with the adoption of iPython/Jupyter in the data science community, and selling an interface directly to the people who know what they want sounds compelling. I’ll call this the Twilio model; it can be great until you have higher ticket items that can’t be provisioned by a dev. However, technical users are generally not buyers in financial services. Those who tend to be less technical control budgets, so this means channel awareness becomes important, with some kind of limited distribution arrangements through large financial services information providers as a sort of freemium model.
What To Do?
For the time being anyway, this isn’t something I am pursuing. I’m at least partially guilty of over-analyzing before I get into market, or perhaps it just wasn’t exciting enough to me. The clear unbridled rush to get anything alt-data related into a pipeline is a gravy train, but that doesn’t mean it is a good one to jump on. Asking customers to take a chance on something which could increase model confidence (maybe) 10% in the abstract might be worth it, but when stacked against other opportunities, it might not be.
This is an area of data science that I find especially interesting–the more we can do to tame the spurious correlation, the better, but it involves a great deal of experimentation and trial and error, something industry is less excited support.
I’d love to hear any thoughts, feedback or criticisms you have!