In my 10+ years growing Urban Mapping, the web-mapping business I started in 2005, I’ve seen history repeat itself. A lot. As a self-proclaimed member of the georatti, this current wave of geotech euphoria is nothing new, but product and corporate development tend to go through the same cycles over and over again, choosing to ignore the past. In a sense, this is the post I’ve waited a decade to write– blend my experience with industry news and inside scuttlebutt to paint a picture of what it all really means. This post focuses on data, and subsequent ones will look at market forces and industry changes.
Why Geo Is Different
The underlying technologies in the FOSS geostatack (largely, but not exclusively postgis, mapnik, TileMill, django, Leaflet) have moved mountains in the past few years and are the foundations of what what gets built for the web. But unlike virtually anything else in enterprise software, geoprocessing requires data. Try geocoding without data. In other ares of business software (eg, accounting, stats/business intelligence, HRIS, ) organization-specific data is required and it tends to be a byproduct of the software.. The one quasi-exception is CRM where purchased data can be incorporated. In geo, you will derive zero value without data.
Performing cheaper operations on crappy data won’t win you awards, but it will certainly draw the ire of users. Which is why data is by far, the singular, most crucial piece in making it all work. And it costs a lot. $500m-$1b seems about right.
Making software can be fun. It can automate tasks. It can get you a promotion, a new job, accolades and much more. Data will do none of these things for you. Accurate, updated and clean data will prompt management to shrug, because this what they expect. Unfortunately this will not happen. Management will quickly grow annoyed with the idiosyncratic complexities of normalization, standardization and other techniques that get data to “behave” in the way management expects. Blaming data providers can feel good, but ultimately you will have to work with them to up their game.
If you have global ambitions, there is no one source of data. You will deal with multiple private data providers and state-directed mapping providers. All maintain their own formats and quirks. Do all data providers provide similar metadata? Doubtful. What about accuracy? You will never know. The world is too large a place to manually fact check. Acknowledge that it will always be a game of catch up requiring 1,000s to support the efforts. I can’t stress how important and complex data is. At Urban Mapping, we evaluated data by accuracy, confidence (not the same as accuracy), precision, provenance, currency, completeness, granularity, geographical scope, legality and a host of other filters. There is no panacea. To make this practical, several illustrations of how data issues are endemic to the domain:
Name: it might be New York City to me, NYC to you, 3651000 to a statistician and don’t get me started about ISO 3166-2. What about endonyms, and for which languages? Do you want to include only ‘official’ names? Official by whose standards– USGS, UN, The World Bank, a national mapping agency? Truth is relative.
Geographic boundaries: related to Name above, geographic borders are not absolute, and they change over time. Should Eastern Ukraine be part of Russia? Obvious Russia thinks so, but not Ukraine, NATO or the rest of the world. If you are offering a service to Russians, Chinese or any other region with contested boundaries (over 40 countries), this can be a deal breaker and a direct affront to national pride/sovereignty and possibly shut down your offering in the host country. Which borders and languages should be presented to a French speaker, viewing a map in China? This is not an abstract question.
Taking it one step further, boundaries evolve over time: for example, US Congressional Districts change every ten years, and if you wish to analyze data for multiple time periods, you must maintain boundaries that reflect these different time periods or risk incorrect analysis.
Scale: ever zoom in to a coastline in Google Maps and notice how the coastline doesn’t always match the political boundaries? This is the essence of scale. Because you will be aggregating data from multiple providers, your issues of scale will likely compound– postal code boundaries may not match up with national boundaries or satellite imagery, or jagged coastlines (intended for only low level zoom) may put roads or towns in the ocean.
Provenance: where does your data come from? How often is it updated? Did you know US government does not actually provide postal code boundaries? ZIP code-like boundaries offered by the US Census might be sufficient, but do you know how they differ from private sources? What about the fact that they aren’t actually boundaries at all, but collections of routes traversed by mail carriers?
Attributes: having a solid gazetteer and historical data are table stakes. If you are trying to build a (say) database of health code violations, you will find that there are various ways to achieve an “apples to apples” comparison. Some cities rank scores using letters, others employ a numerical score, some include free text descriptions, some use enumerations. Some might claim exemptions under a freedom of information statute, others may only release data aggregated by ZIP. The greater the collection of attributes, the messier this becomes.
In the context of POI, the 15ish million businesses in the US alone are constantly opening, closing and everything else in between. But many data providers tend to offer an arbitrary update schedule, leaving you in a never ending state of catch up. When you source from multiple vendors, redundant listings will become commonplace, requiring you to develop a data conflation strategy. Refine has been a useful tool, but your Python scripts will be specific to your workflow. Unfortunately you won’t know which attributes from which vendors you should use, resulting in significant headaches. Are you given coordinates or a human-readable address? These are not the inverse of each other. The ever-opinionated Mike Dobson has some thoughtful observations.
I’ve been working on several follow up posts that look at industry dynamics, what’s next with data/service providers, the current state of the Mapping Wars and maybe more–let me know your thoughts!