Reverse geocoding is hard

275 points by pavel_lishin 6 months ago

Dachande663 6 months ago

Fun fact that was dredged up because the author mentions Australia: GPS points change. Their example coordinates give 6 decimal places, accurate to about 10-15cm. Australia a few years back shifted all locations 1.8m because of continental drift they’re moving north at ~7cm/year). So even storing coordinates as a source of truth can be hazardous. We had to move several thousand points for a client when this happened.

jandrewrogers 6 months ago

Even accounting for tectonic drift, there is a concept of positioning reproducibility that is separate from precision. In general the precision of the measurements is much higher than the reproducibility of the same measurements. That is, you may be able to measure a fixed point on the Earth using an instrument with 1cm precision at a specific point in time but if you measure that same point every hour for a year with the same instrument, the disagreement across measurements will often be >10cm (sometimes much greater), which is much larger than e.g. tectonic drift effects.
For this reason, many people use the reproducibility rather than instrument precision as the noise floor. It doesn’t matter how precise an instrument you use if the “fixed point” you are measuring doesn’t sit still relative to any spatial reference system you care to use.
- raducu 6 months ago
  
  > if the “fixed point” you are measuring doesn’t sit still relative to any spatial reference system you care to use.
  But do those points actually move or the air medium changes the measurements?
  I ask because I saw a very interesting documentary once about how they started accurate mapping in England with fixed points and measuring the angles between those points to a high degrees of precision.
  My mental model has always been that those points are all fixed, but now that you mention it, why should they be fixed?
  After all, my 7 grade teacher clearly demonstrated the thermal deformation or copper rods and all bridges have gaps that allow for thermal deformation, so indeed, this would apply to soil on the scale of tens of kms?
  - jandrewrogers 6 months ago
    
    Fixed points actually move relative to each other. This is measurable even locally if you are doing high-precision localization e.g. with LIDAR. The geometry of relationships between objects is in constant motion but below the threshold of what a human can sense. There are many identifiable causes of this motion that vary with locality (tidal, thermal, hydrodynamic, tectonic, geophysical, et al). Additionally, there are local time dilation effects, both static and transient, that influence measurement but aren’t actually motion.
    This comes up concretely when doing long-baseline interferometry. Lasers are used to precisely measure the distance between receivers in adjacent structures for use in time-of-flight calculations. Over the course of a day, the distance between those structures as measured may vary by multiple centimeters, which is why they measure it.
  - throwup238 6 months ago
    
    The air medium does add noise to the measurement depending on wavelength but it’s also small things adding up like the repeatability of the angle the satellite is at when it measures that same point. An arc-second of error at 400km is over a meter so even a fraction of an arc-second is enough to introduce a lot of noise between measurements.
- ta1243 6 months ago
  
  A typical domestic GPS will give you accuracy to worst case 5m, but a good one will be sub metre, and taking enough measurements over time, especially with DGPS or RTK you'll get to less than 10cm.
  After 20 years at 7cm per second that's 1.4m. That's the same order of magnitude of error as domestic.
- Robotbeat 6 months ago
  
  The whole accuracy vs precision thing.
  - jandrewrogers 6 months ago
    
    Related but slightly different. The accuracy is real but it is only valid at a point in time. Consequently, you can have both high precision and high accuracy that nonetheless give different measurements depending on when the measurements were made.
    In most scientific and engineering domains, a high-precision, high-accuracy measurement is assumed to be reproducible.
    
    hnaccount_rng 6 months ago
    
    I think this is a charitable interpretation of the remark which deprives GP from learning something (sorry if this comes across as condescending, I'm genuinely trying to point out a imo relevant difference)
    No it's not at all accuracy vs precision. That statement is about a property of the measurement tool, where one can have systematic offsets [0] (think about a manual clock, where the manufacturer clued the finger on with a slight shift) vs they can simply be inaccurate (think about a clock that only has a minute finger, but not one for seconds).
    The thing pointed out by the original comment is about a change in the _measured_ system. Which is something fundamentally different. No improvement in the measurement tool [1] can help here as its reality that changes. Even writing down the measurement time is only going to help so much since typically you aren't interested in precisely the time of measurement and will do an implicit assumption of staticness of the real world.
    [0] The real reason for those is that it is _much_ simpler to build a precise relative measurement tool (i.e. it's easier to say "bigger than that other thing" than "this large"). One example is CO2 concentration measurements, they are often relative to outdoor CO2, which is - unfortunately - not stable
    [1] Assuming that the tool is only allowed to work on one point in time. If you include e.g. a weather modelling supercomputer in your definition of tools, that would again work.
AlotOfReading 6 months ago

GPS coordinates actually account for the motion of the Earth's tectonic plates. The problem is that it's a highly approximate model that doesn't accurately reflect areas like Australia very well.
There's a great visualizer of the coordinate velocity from the Earthscope team:
https://www.unavco.org/software/visualization/GPS-Velocity-V...
- jandrewrogers 6 months ago
  
  GPS coordinates do not account for tectonic motion. It is a synthetic spheroidal model that is not fixed to any point on Earth. The meridians are derived from the average motion of many objects, some of which are not on the planetary surface.
  The motion of tectonic plates can be calculated relative to this spatial reference system but they are not part of the spatial reference system and would kind of defeat the purpose if they were.
  - AlotOfReading 6 months ago
    
    The corrections are incorporated into the datum. WGS84 is updated every 6 months to follow ITRF by changing the tracking station locations as the plates move around.
    
    meindnoch 6 months ago
    
    That's about correcting the ground stations' coordinates. It doesn't help keeping your house's GPS coordinates fixed. If the tectonic plate your house is built on moves a meter over the course of a decade, then your house's GPS coords will change in the lower decimals, and eventually your government's land registry will need to update those values.
    
    paganel 6 months ago
    
    Hopefully there are still governments that don’t keep such in detail land registries, or any land registries at all, for that matter. Some of us don’t want the State to see everything, at almost every moment in time.
    
    coldtea 6 months ago
    
    If you want to keep your land then you need to keep it in such detail in some registry.
    Else it's trivial for someone to claim it or parts of it. Before such registries tons of people lost their land, lost part of their land, went bankrupt trying to save it, or murdered each other over their plots of land border dispute.
    There are lots of records a state shouldn't have. Something fixed and stationary that needs protection from encroaching, like land limits, doesn't seem it should be one of them.
    
    paganel 6 months ago
    
    You made the very wrong assumption that land possession is mostly an individual thing, and second, that the State would be happy to award “common” ownership to communities big and small, and as such that it would allow said communities (in many cases much older than the State itself) to decide who gets to use what land inside of said communities. And the main reason is that the State doesn’t like, nor want, any sort of competition in this domain.
    
    coldtea 6 months ago
    
    >You made the very wrong assumption that land possession is mostly an individual thing, and second, that the State would be happy to award “common” ownership to communities big and small, and as such that it would allow said communities (in many cases much older than the State itself) to decide who gets to use what land inside of said communities.
    I made neither assumption, and both arguments are irrelevant to my point.
    Take the current rights of anyone to one or more plots of land they own. As those are today, and also as they change while some are sold and bought etc.
    To protect those rights of those onwers (of citizens and businesses and municipalities and so on) a registry or plots and their boundaries is very useful.
    >to decide who gets to use what land inside of said communities
    That's a totally irrelevant point, one that I didn't bring up.
    I never said the state will happily "award “common” ownership to communities big and small, and as such that it would allow said communities (in many cases much older than the State itself) to decide who gets to use what land inside of said communities". In fact, for the purposes of my argument, whether the state will do that doesn't concern me at all.
    Just that the state keeping a registry of plots and their boundaries helps keep track of ownership. Not transfer it to communities, to give it to someone else to administer: to keep track.
    
    Scarblac 6 months ago
    
    And have endless disputes over who owns what exactly? Allow companies to kick people off their land because they can't do anything about it?
    IMO having a good land ownership registry is one of the most important things to count as a developed country.
    
    meindnoch 6 months ago
    
    If the state doesn't know that you own a piece of land, then you don't own that piece of land. Simple as.
    
    sayamqazi 6 months ago
    
    You simply do not own any peice of land at all. The state owns all the land. You simply lease it.
    
    meindnoch 6 months ago
    
    Well, in some countries this might be true.
    
    immibis 6 months ago
    
    It's true in all countries to the extent that it's true in any countries, but it's only partially true to begin with. The reality is that ownership doesn't exist. What actually exists is a credible threat to use violence if certain conditions are violated. In a civilized society this comes from the state. In a failed state it comes from somewhere else.
    
    jandrewrogers 6 months ago
    
    If WGS84 was correcting for tectonic drift it would imply that the coordinates of the terrestrial fixed points used to compute the reference meridian never change under WGS84. Rebasing the coordinates of terrestrial fixed points prior to calculation disregards tectonic drift in the reference meridian calculation, it doesn’t correct it. It is a noise reduction exercise to minimize the influence of plate tectonics on meridian drift. The meridian uses non-terrestrial fixed points too that don’t have a concept of tectonic drift (but may introduce their own idiosyncratic sources of noise).
    Basically, these are corrections to their “fixed points” to make them behave more like actual fixed points in the reference meridian model. It doesn’t eliminate tectonic drift effects when using coordinates in that spatial reference system.
- janzer 6 months ago
  
  I'm pretty positive that is showing the reverse, i.e. how much a given "location" is moving using gps coordinates. Not adjusting the gps coordinates to refer to a constant "location".
- meindnoch 6 months ago
  
  >GPS coordinates actually account for the motion of the Earth's tectonic plates.
  What?
  - 867-5309 6 months ago
    
    accounts for (something), phrasal verb meaning "considers; incorporates; takes on board" as opposed to the more obvious "gives rise to; is responsible for". I had to read twice too
    
    meindnoch 6 months ago
    
    Yeah, I know what "accounts for" means.
    I just can't comprehend how GPS coordinates could account for the tectonic plates' motion. Never heard of such a thing, and can't see how it would work on a conceptual, mathematical level.
    
    n4r9 6 months ago
    
    They don't, and you're right it wouldn't make sense.
atoav 6 months ago

In the past year or so I have thought a lot about how to design tables and columns within databases and there is nearly nothing that wouldn't get more robust by adding in a "valid_from" and "valid_till" and make it accept multiple values. Someone's name is Foo? What if they change it to Bar at some point and you need to access something from before with the old name?
If you have only a name field that has a single value that is going to be a crazy workaround. If your names are referencing a person with a date that is much easier. But you need to make that ddcision pretty early.
- jandrewrogers 6 months ago
  
  The tradeoff is that this is very expensive at the scale of large geospatial data models both in terms of performance and storage. In practice, it is much more common to just take regular snapshots of the database. If you want to go back in time, you have to spin-up an old snapshot of the database model.
  A less obvious issue is that to make this work well, you need to do time interval intersection searches/joins at scale. There is a dearth of scalable data structures and algorithms for this in databases.
- nradov 6 months ago
  
  Anyone who works with human names should take a look at the HL7 V3 and FHIR data models, which were designed for healthcare. They support name validity ranges, and a bunch of other related metadata. It can be challenging to efficiently represent those abstract data models in a traditional traditional database because with a fully normalized schema you end up needing a lot of joins.
- pavel_lishin 6 months ago
  
  If you have an "audit" table, where you write a copy of the data before updating it in the primary table, that's a decision you can make at any point.
  Of course, you don't get that historical data, but you do get it going forward from there.
  - homebrewer 6 months ago
    
    SQL 2011 defines temporal tables, which few FOSS databases support. I used it in mariadb:
    https://mariadb.com/kb/en/temporal-tables/
    and if your schema doesn't change much, it's practically free to implement, much easier and simpler than copypasting audit tables, or relying on codegen to do the same.
  - tough 6 months ago
    
    something like https://www.pgaudit.org/ ?
    Basically you keep an history of all changes so you can always roll-back / get that data if needed?
    
    pavel_lishin 6 months ago
    
    The last time we did this, we basically hand-rolled our own, with a database trigger to insert data into a different table whenever an `UPDATE` statement happened.
    But this seems like it's probably a better solution.
    
    tough 6 months ago
    
    never had used pgaudit yet to vouch for it but have it on the backburner/log of things to try for such a use case!
    I think the real magic is it lleverages the WAL (write ahead logs) from pg engine itself, which you could certainly hook up into too, but im not a db expert here
    
    tough 6 months ago
    
    I just found out about bemi dot io, seems like they're targeting this issue
- boramalper 6 months ago
  
  See also "Eventual Business Consistency"[0] by Kent Beck. Really good read.
  > Double-dated data—we tag each bit of business data with 2 dates:
  > * The date on which the data changed out in the real world, the effective date.
  > * The date on which the system found out about the change, the posting date.
  > Using effective & posting dates together we can record all the strange twists & turns of feeding data into a system.
  [0] https://tidyfirst.substack.com/p/eventual-business-consisten...
  - atoav 6 months ago
    
    Thanks for posting this, I read tbis a while ago, but it is worth revisiting.
- taeric 6 months ago
  
  One that routinely surprises me is that this is not easy to do in any popular contacts application. I actually would like to keep every address I've ever had in there, in case I need to remember it for some reason later. Maybe I just want to reminisce. I don't want to accidentally have it as an "active" one, though.
xucheng 6 months ago

Can this be solved by storing a timestamp of the record along with precise GPS coordinates? Could we then utilize some database to compute the drift from then and now?
- jandrewrogers 6 months ago
  
  Yes, in fact it should essentially be mandatory because the spatial reference system for GPS is not fixed to a point on Earth. This has become a major issue for old geospatial data sets in the US where no one remembered to record when the coordinates were collected.
  To correct for these cases you need to be able to separately attribute drift vectors due to the spatial reference system, plate tectonics, and other geophysical phenomena. Without a timestamp that allows you to precisely subtract out the spatial reference system drift vector, the magnitude of the uncertainty is quite large.
- omcnoe 6 months ago
  
  You don’t need to store a timestamp, but the local coordinate reference system that the coordinates are in. When revisions like this are made, it’s by updating the specification of a specific local coordinate reference.
  WGS84 is global, but for most precise local work more specific national coords are used instead.
- pgreenwood 6 months ago
  
  See https://spatialreference.org/
- haneefmubarak 6 months ago
  
  I mean, certainly - if you store both GPS time and derived coordinates from the same sampling, then you can always later interpret it as needed - whether relative to legal or geographical boundaries etc as you might want to interpret in the future.
pavel_lishin 6 months ago

Damn! 7cm per year feels blazing fast when you consider the fact that it's a whole continent.
- niccl 6 months ago
  
  A way to think about it I've seen a few times: continental drift is roughly the same order of magnitude as the rate your fingernails grow!
- anotherevan 6 months ago
  
  We're coming for you!
- XorNot 6 months ago
  
  I mean I'm still mind blown that the Three Gorges dam in China literally changed the rotational speed of the Earth, and thus the length of the day.
  - voidUpdate 6 months ago
    
    You can change the length of the day just by spinning counterclockwise :P
RainyDayTmrw 6 months ago

This is one of many reasons why property surveying records use so many seemingly obscure or redundant points of reference. In case anyone wonders why modern property surveying isn't only recording lots of GPS coordinates.
akst 6 months ago

My knowledge of geospatial sets is fairly shallow, but I’ve worked a bit with Australian map data and I’m assuming are you referring to the different CRSs, GDA2020 and GDA1994?
I’d imagine older coordinates would work with the earlier CRS?
But I can understand not all coordinates specify their CRS. This have really been an issue for me personally, but I’ve mostly worked with NSW spatial and the Australian Bureau of statistics geodata.
sleepy_keita 6 months ago

Japan publishes new CRSes after large earthquakes to account for drift. The M9 earthquake in 2011 recorded a maximum shift of 5 meters!
cameldrv 6 months ago

I think Australia has its own datum for this reason that can float against WGS84
- jcattle 6 months ago
  
  Almost all continents have their own datum. For Japan there's a special case where they now have 18 geodetic zones. Each zone is defined as parts of the crust that tend to move somewhat homogeneously.
  Basically after the 2011 Earthquake they had a geodetic mess at their hands, with all coordinates just being all over the place since the ground moved so much. That's why they later on changed their approach.
taeric 6 months ago

This is a large part of why surveying is done to landmarks.
- pugworthy 6 months ago
  
  This is exactly what I was thinking. Except for a tectonic fault line between the point and the landmark, a relative grid of neighboring landmarks as extra reference would be useful as well. Once you establish a location of one point, it gets easier to establish the location of related points.
timonofathens 6 months ago

[dead]

andrew_eu 6 months ago

I have a memorable reverse geocoding story.

I was working with a team that was wrapping up a period of many different projects (including a reverse geocoding service) and adopting one major system to design and maintain. The handover was set to be after the new year holidays and the receiving teams had their own exciting rewrites planned. I was on call the last week of the year and got an alert that sales were halted in Taiwan due to some country code issue and our system seemed at fault. The customer facing application used an address to determine all sorts of personalization stuff: what products they're shown, regulatory links, etc. Our system was essentially a wrapper around Google Maps' reverse geocoding API, building in some business logic on top of the results.

That morning, at 3am, the API stopped serving the country code for queries of Kinmen County. It would keep the rest of the address the same, but just omit the country code, totally botching assumptions downstream. Google Maps seemingly realized all of a sudden what strait the island was in, and silently removed what some people dispute.

Everyone else on the team was on holiday and I couldn't feasibly get a review for any major mitigations (e.g. switching to OSM or some other provider). So I drew a simple polygon around the island, wrote a small function to check if the given coordinates were in the polygon, and shipped the hotfix. Happily, the whole reverse geocoding system was scrapped with a replacement by February.

modeless 6 months ago

Wow, I had no idea that Taiwan controlled an island less than three miles from mainland China, essentially surrounded by China in a bay. (The main island is 80+ miles away.) I'm really surprised China has allowed that for 80 years. Unsurprisingly, the beach looks like this: https://www.google.com/maps/place/Shuang+Kou+Zhan+Dou+Cun/@2...
Also interesting that there's a Japanese island only 60 miles from Taiwan on the other side. I guess claims to small Pacific islands have been weird for a long time.
- nradov 6 months ago
  
  If the Chinese Communist Party decides to escalate the pressure on Taiwan then one likely scenario is some sort of blockade against those small islands close to the mainland.
marc_abonce 6 months ago

I faced the same issue with locations inside Crimea and Kashmir. The Google Places API wouldn't return a country code for those regions. At the time I couldn't find any documentation from Google specifying which inhabited locations return a null country code, I assume they want to avoid any potential controversy. Unfortunately this lack of documentation makes it harder to work around this issue.

jandrewrogers 6 months ago

Most people don’t have an intuitive sense of just how technically difficult mapping from real geospatial coordinates to feature spaces is. This is a great example of a relatively simple case. You are essentially doing inference on a sparse data model with complex local non-linearities throughout. If you add in dynamic relationships, like things that move in space, it becomes another order of magnitude worse. We frequently don’t have enough data to make a reliable inference even in theory and you need a way of reliably determining that.

This problem has been the subject of intense interest by the defense research community for decades. It has been conjectured to be an AI-complete type problem for at least ten years, i.e. solving it is equivalent to solving AGI. The current crop of LLM type AI persistently fails at this class of problems, which is one of the arguments for why LLM tech can’t lead to true AGI.

TimTheTinker 6 months ago

Just putting this out there. This is one area where Esri's software really shines. They have so many software offerings and so much is said about different things you can do with ArcGIS (and competing systems), but the capability of their projection engine and geocoding systems - the code that lies at its heart - is unmatched, by far, at least as of 5 years ago when I left for a different company.
I had long conversations with Esri's projection engine lead. Really remarkable guy - he's got graduate degrees in geography and math (including a PhD) and he's an excellent C/C++ developer. That kind of expertise trifecta is rare. I'd walk by his office and sometimes see him working out an integral of a massive equation on his whiteboard (not that he didn't also use a CAS). "Oh yeah, I'm adding support for a new projection this week."
- jandrewrogers 6 months ago
  
  Many people don’t appreciate the extent that building robust geospatial systems requires seriously hardcore mathematics and physics skills. All of the mapping companies have really smart PhDs wrangling with these problems. I’ve always enjoyed talking with them about the subtleties of the challenges. There are so many nuances that never occurred to me until they mentioned them.

sinuhe69 6 months ago

Not my area of expertise, but is this not a form of perfectionist problem? I mean, most places have a clear and simple address. For the rest, either a human can solve it, or we can make a few examples and let an AI do the work. We can go back to them later and revise them if we need to. Addresses don't change often, so I think things can stay the same for a long time.

Except for emergency dispatch and a few high-profile use cases, you can have a good enough address to let the user find its neighbourhood. But they still have the GPS or other form of address coding, so they can find the exact location easily. I'd say 99.9% of the cases are like that. The rest can be solved quickly by looking at the map!

ryandrake 6 months ago

You can call it perfectionism or you can call it "doing it right." I think this gets at a fundamental difference in philosophy among [software] engineers: We have a problem with a lot of edge cases, where a "good enough" solution can be done quickly. What do we do? There's a class of engineers who say 1. Do the "good enough" solution and ignore/error on the edge cases--we'll fix them later somehow (may or may not have an actual plan to do this). And there's a class of engineers who say 2. We cannot solve this problem correctly yet and need more research and better data.
Unfortunately (in my view), group #1 is making all the products and is responsible for the majority of applications of technology that get deployed. Obviously this is the case because they will take on projects that group #2 cannot, and have no compunction against shipping them. And we can see the results with our eyes. Terrible software that constantly underestimates the number and frequency of these "edge cases" and defects. Terrible software that still requires the user to do legwork in many cases because the developers made an incorrect assumption or had bad input data.
AI is making this problem even worse, because now we don't even know what the systems can and cannot do. LLMs nondeterministically fail in ways that sometimes can't even be directly corrected with code, and all engineering can do is stochastically fix defects by "training with better models."
I don't know how we get out of this: Every company is understandably biased towards "doing now" rather than "waiting" to research more and make a better product, and the doers outcompete the researchers.
- sbarre 6 months ago
  
  > Unfortunately (in my view), group #1 is making all the products and is responsible for the majority of applications of technology that get deployed.
  This is an interesting take, and I think I see where you're coming from..
  My first thought on "why" is that so many products today are free to the user, meaning the money is made elsewhere, and so the experience presented to the user can be a lot more imperfect or non-exhaustive than it would otherwise have to be if someone was paying to use that experience.
  So edge cases can be ignored because really you're looking for a critical mass of eyeballs to sell to advertisers or to harvest usage data from, etc.. If a small portion of your users has a bad time or experiences errors, well, you get what you pay for as they say..
  And does that kind of pervasiveness now mean that many engineers think this is just the way to go no matter what?
- e63f67dd-065b 6 months ago
  
  Engineering is about tradeoffs -- are the resources invested in improving a system worth the return of said improvement? We know full well how to build bridges that will last a thousand years, we just choose not to because it's not an effective use of public funds compared to a fifty year bridge.
  The same applies to software engineering -- each additional edge case you handle increases cost but has diminishing returns. At some point you have to say good enough and ship. The cost of perfection is infinite -- you have finite resources, and a great part of engineering is deciding how to allocate them.
mootothemax 6 months ago

> most places have a clear and simple address
That depends on your definition of "clear and simple" and "address" :) While a lot boils down to use case - are you trying to navigate somewhere, or link a string to an address? - even figuring out what is an address can be hard work. Is an address the entrance to a building? Or a building that accepts postal deliveries? Is the "shell" of a building that contains a bunch of flats/apartments but doesn't itself have a postal delivery point or bills registered directly to it an address? How about the address the a location was known as 1 year ago? 2 years ago? 10 years ago?
Park and other public spaces can be fun; they may have many local names that are completely different to the "official" name - and it's a big "if" whether an official name exists at all. Heck, most _roads_ have a bunch of official names that are anything but the names people refer to them as. I have a screaming obsession with the road directly in front of Buckingham Palace that, despite what you see on Google Maps, is registered as "unnamed road" in all of the official sources.
> Addresses don't change often
At the individual level, perhaps. In aggregate? Addresses change all the time, sometimes unrecognisably so. City and town boundaries are forever expanding and contracting, and the borders between countries are hardly static either (and if you're ever near the Netherlands / Belgium border, make a quick trip to Baarle-Hertog and enjoy the full madness). Thanks to intercontinental relative movement, the coordinates we log against locations have a limited shelf life too. All of the things I used to think were certain...
If someone hasn't done "faleshoods programmers believe about addresses," I think its time might be now!
Edit: answering myself with https://www.mjt.me.uk/posts/falsehoods-programmers-believe-a...
jandrewrogers 6 months ago

The update rate for a global map data model, all of which are still woefully incomplete in many contexts, is surprisingly high. The territory underlying the map is a lot less static than people assume. Also, local reality is often much less “regular” than people assume such that a person really can’t figure it out reliably. Currently there are literally thousands of people tasked with incorporating these changes because it has proven to be resistant to automation thus far due to the pervasiveness of edge cases. For your basic global map data model, these are the edge cases that are left after several thousand heuristic and empirically derived rules have been applied.
It is a deeply complex data model that changes millions of times a day in unpredictable ways. Unfortunately, many applications are very sensitive to the local accuracy of the model, which is much higher variance than average accuracy. Only trying to be “good enough” in an 80/20 rule sense is the same as “broken”. The updates are also noisy and often contain errors, so the process has to be resilient to those errors.
The resistance of the problem to automation and the high rate of change has made it extremely expensive to asymptotically converge on model with consistently acceptable accuracy for the vast majority of applications.
Beretta_Vexee 6 months ago

The author's problem is similar to many real-world business problems. A simple example is directing delivery drivers to the correct entrance of a large site with multiple entrances depending on the type of delivery (mail to door A, van to door B, semi-trailer to door C).
Sending a Romanian truck driver to a vague address in Holland such as ‘port terminal B somewhere on road 999’ and leaving him to figure it out for himself is not a solution.
edent 6 months ago

I am deeply guilty of being a perfectionist!
Ultimately, I just want something which is a nice balance between being useful for a human and not so long that it is overwhelming.
- curiousObject 6 months ago
  
  You’re the author?
  The final step in the process “Wait for complaints” seems like a smart acceptance of the “perfect is the enemy of good” challenge
  Publish and be damned, or as we say now: Move fast and break things
smitty1e 6 months ago

I was going to take this tack.
80% of the problem is just transforming floating point coordinates into API calls.
Getting to something useful with it is the hard 20%, and it will be a diminishing returns problem after that.
While not anybody's LLM proponent, that last mile might be a good AI application.

vintermann 6 months ago

Genealogy applications run into this a lot. The person of interest lived at Engeset. FamilySearch has geocoded a place called "Engeset, Møre og Romsdal, Norway". So that's it, right? Not so fast, [there are at least 3 Engesets in Møre og Romsdal](https://www.google.com/maps/search/Engeset/@62.3358577,6.225...).

But that's at least better than when it's some local place name which it's never heard of, and thinks sounds most similar to a place in Afghanistan (this happens all the time).

And to add to it, there are administrative regions, and ecclesiastical regions. Do you put them in the parish, or in the municipality? The birth in the parish and the baptism in the municipality, maybe? How about the burial then...

modeless 6 months ago

Converting from a name/address to coordinates is geocoding. Reverse geocoding is mapping from coordinates to a name/address.
- vintermann 6 months ago
  
  Well you need both. For instance to know that "Hauan" is likely a small named place near to the other named places this person is associated with, and not the name similar to it in Afghanistan. Any time there's any ambiguity about a name, you need the reverse too, to resolve it.

indeed30 6 months ago

Ten years ago, I worked for a company that had billions of sensor readings from mobile phones. The idea was to use crowdsourced data to create truly detailed, real-world coverage maps, and then sell that data to marketing and network operations teams at telcos.

We used reverse geocoding extensively — but never down to street addresses, always to a higher level. We wanted to split measurements by country, region, city — any geographic unit. When you deal with country borders, you get a lot of weird measurements as phones roam onto foreign networks. We weren’t interested in reporting on the experience of users roaming while abroad, so we needed shapefiles good enough to filter all that out and to partition the rest of the data cleanly.

We built a 30-machine Spark cluster on AWS back when Spark was still super early — around v0.7, definitely before 1.0. At the time, you pretty much had to use Scala with Spark if you cared about performance. Most of the workload was point-in-polygon tests. Before that, we were using a brutally hacky pipeline involving PostGIS, EMR, and Pig, and it was hell.

It was incredibly fun, but looking back now, I can see so clearly all the mistakes I made.

roadbuster 6 months ago

I just jotted down your closing sentence. Equally insightful and touching.

punnerud 6 months ago

I created this to solve my own need for reverse geocoding: https://github.com/punnerud/rgcosm (Saving me thousands of $ compared to Google API)

Uses OpenStreetmap file, Python and SQLite3.

First it finds all addresses using +/- like a square from lat/lon, then calculate distance based on the smaller list (Pythagoras), and pick the closest. It expands until a set maximum if no address is found in the first search.

davidmurdoch 6 months ago

Just curious if you looked into using S2 cells for this? It's what Pokemon Go uses for its coordinate system. http://s2geometry.io/devguide/s2cell_hierarchy.html
- punnerud 6 months ago
  
  Isn’t the main purpose of S2 to be able to scan from different “directions”? More a purpose of Google Maps when viewing the world as a spherical object compared to Sqlite3 just using a simple B-tree index on lat+lon?
  - davidmurdoch 6 months ago
    
    The individual cells being sized and being able to easily to compute neighboring cells seems useful for the described algorithm. I haven't given it much thought on applicability here, but it sounded somewhat similar to a search pattern I once implemented within pgsql to locate items on a map that were within proximity of a given latlong.

andrewaylett 6 months ago

It's a lot more expensive, but measuring navigation distance rather than straight line distance would avoid the "river" issue. Although depending on the routing engine and dataset it might well introduce more issues where points can be really close on foot but the only known route is a driving route.

edent 6 months ago

If you know of an API which does navigation distance to POI, I'd love to hear about it!
- dalke 6 months ago
  
  To share a related river issue, a few years back the city of Gothenburg, Sweden switched how they allocate students to a school. It used to be the student would get a school in their specific section of the city ("statsdel"). The new one lets you choose a school, specified in rank order, even if it's in another part of the city.
  You could select from a list of schools, with distances given as straight-line distance (or perhaps as route distance? I can't tell from the articles I've read), which meant some of the schools across the river were considered "close".
  In one case, a student had a 45 minute commute to get to school, due to waiting for the ferry. The parents listed it as their 5th choice, based on the stated distance. The more critical factor should be travel time, but the computer system doesn't take mass transit schedules into account.
  As a contributing factor, the list of schools did not include the name of the stadsdel in the list of schools. This extra sort of reverse geocoding - trivially available - might have helped them realize the issue.
  One news story about it, in Swedish: https://www.svt.se/nyheter/lokalt/vast/kaos-i-antagningssyst...
- mootothemax 6 months ago
  
  You can self-host or run locally Valhalla (https://github.com/valhalla/valhalla), reading in data from OSM as a starting point.
  (For my purposes, I went with local running, generating walking-distance isochrones across pretty much well the entire UK)
- nerdralph 6 months ago
  
  I've used OSRM and Arcgis for addresses in Canada. I think one or both of them have POI support in their APIs. https://route.arcgis.com/arcgis/
- rahimnathwani 6 months ago
  
  Google has Routes API: https://developers.google.com/maps/documentation/routes
- petre 6 months ago
  
  Check out Graphopper. But if your POIs are from OSM, OSRM might be okay as well.

AlotOfReading 6 months ago

I haven't found a better way do this than the Google maps solution [0]:

You write a query of all the different kinds of addresses you'd like to display. The query result is a list of valid candidate addresses for the point matching at least one format that you can rank based on whatever criteria you like.

[0] https://developers.google.com/maps/documentation/geocoding/r...

mvdtnz 6 months ago

It sounds like the author is more interested in getting city or town names from a coordinate. Google maps is massively overkill and horrendously expensive for this use case. I mentioned in another comment I do this in a game I wrote and can complete queries in microseconds.
https://news.ycombinator.com/item?id=43814231

rovr138 6 months ago

Have you looked at the geonames database?, https://www.geonames.org/

Info and schema is here, https://download.geonames.org/export/dump/readme.txt

Could be a good source. Not sure how good it is worldwide, but the countries I’ve used it for, it’s been useful and pretty good.

Try the search too, https://www.geonames.org/search.html?q=R%C3%ADo+grande&count...

Not just roads, but there’s rivers, and other things too

juliansimioni 6 months ago

Geonames is a great dataset, in fact it's one of the "OG" open-source databases of the modern era, dating back to 2005.
It has fairly comprehensive coverage of countries, cities, and major landmarks. It also has stable, simple identifiers that are somewhat of a lingua-franca in the geospatial data world (i.e. Geonames ID 5139572 points to the Statue of Liberty and if you have other data that you need to unambiguously associate with the one Statue of Liberty in New York Harbor, putting a `geonames_id` column in your database with that integer will pretty much solve it, and will allow anyone else you work with to understand the connection clearly too).
However, to be honest, it hasn't really kept pace with modern times. The velocity of changes and updates is pretty low, it doesn't actively grow the community anymore. The data format is simple and rigid and built on old tech that's increasingly hard to work with. You can trust Geonames to have the Statue of Liberty, but not the latest restaurants in NYC.
For a problem like the post author has of finding ways everyday people can easily navigate to something like a park bench that might not have a single address associated with it, or even if it does, needs more granularity to find _that_ specific bench in a park with 100 benches, Geonames probably won't help.
Source: I'm co-founder of Geocode Earth, one of the geocoding companies linked in the blog post. We use Geonames as one source of POI data amongst many others.
edent 6 months ago

That does look interesting. I could search through it for a lat & long, but it looks like it only gives a name (e.g. "Silicon Oasis") without a corresponding country. Food for thought though.
Thanks!
- rovr138 6 months ago
  
  Yeah. It’s not flat.
  You can use admin fields, and it’s a recursive query to find.
  I have recursive CTE (thanks to ChatGPT).
  Could also be done on save, since they shouldn’t change for locations.
  The recursiveness though, gives you a benefit if you extract type and save the intermediate steps, it allows you to start grouping things together at different levels which is one of the use cases you mentioned.

johnlk 6 months ago

It’s almost more of a UX challenge than anything. The feedback widget idea at the end could offer a crowd sourced solution the same way Twitch solved translation via crowdsourcing.

nedt 6 months ago

Is it even worth it? What most of the users will be doing is enter the address in their map or route app of their choice to see on a map the directions. Especially for something that's inside a park an address is really not that useful, but also outside most people don't know where some specific house number is.

Also having zones might not be super useful. Like when I'm in a city next to the border of a district I wouldn't want to only search in my current district. Something on the other side of my current district might be much harder to reach than something in the neighboring district.

Giving a place nearby, like a landmark, can aid in finding interesting places, but in the end a simple radius search, or route distance search or even something next to a path, might be much more useful. Which is more or less what is being done when you visualize points on a map.

Staying closer to coordinates also gets rid of localization issues. And that's not just different languages and scripts but also how addresses are used worldwide. There are some important cultural differences.

byoung2 6 months ago

I recently had to do a lot of work mapping locations inside Disneyland. Once you are inside the park, street addresses aren't useful. I used geoJSON objects to describe the geometry of the resort, the park, then each land, then each ride, restaurant, store, etc, then the elements of each of these, so you have increasingly smaller geometries. Then I used geospatial queries to determine if a point is inside, outside, or nearby a known geometry. So you can say, for example that a certain churro cart is (inside) Disneyland Resort [resort] > (inside) Disney California Adventure [park] > (inside) Buena Vista Street [land], (nearby) Grizzly River Run.

Another challenge is that these shapes change over time. Rides, lands, etc constantly change due to construction, but queues dynamically change size and shape during the day based on crowd size (cast members put extra ropes out to control long lines, and remove them to allow for parades and extra walkways)

nerdralph 6 months ago

Part of the problem is the different ways addresses are expressed throughout the world. I was born and grew up in Canada, and was confused when I started dealing with companies in China. Instead of street addresses, many are given by province, city, district, sub-district, and a building number.

Another problem is choosing which authority for the "correct" address. I've seen many cases where the official postal address city/town name is different than the 911 database. For example Canada Post will say some street addresses are in Dartmouth, while the official civic address is really Cole Harbour. https://www.canadapost-postescanada.ca/ac/ https://nsgi.novascotia.ca/civic-address-finder/

Even streets can have multiple official names/aliases. People who live on "East Bay Hwy", also live on "Highway 4", which is an alias.

mvdtnz 6 months ago

I dealt with this exact issue and went with that exact solution in my browser based geography game[0].

What the author is looking for is administrative divisions and boundaries[1], in particular probably down to level 3 which is the depth my game goes to. These differ in size greatly by country. With admin boundaries you need to accept there is no one-size-fits-all solution and embrace the quirks of the different countries.

For my game I downloaded a complete database of global admin boundaries[2] and imported them into PostgreSQL for lightning fast querying using PostGIS.

[0] https://guesshole.com

[1] https://en.wikipedia.org/wiki/List_of_administrative_divisio...

[2] https://gadm.org/data.html

jillesvangurp 6 months ago

This is a while ago but about 12 years ago I experimented with putting the whole of openstreetmap into Elasticsearch.

Reverse geocoding then becomes a problem of figuring out which polygons contain the point with a simple query and which POIs/streets/etc. are closest based on perpendicular distance. For that, I simply did a radius search and some post processing on any street segments. Probably not perfect for everything. But it worked well enough. My goal was actually being able to group things by neighborhood and microneighborhoods (e.g. squares, nightlife areas, etc.).

This should work well enough with anything that allows for geospatial queries. In a pinch you can use geohashes (I actually did this because geospatial search was still a bit experimental in ES).

harry-wood 6 months ago

There's a couple of options listed here https://wiki.openstreetmap.org/wiki/Geocoding "Photon" and "Pelias" which were built on ElasticSearch
sleepy_keita 6 months ago

What were some other problems you ran in to when putting OSM in to ES? (I've had this thought before too, I'm curious why/how you did it)
- jillesvangurp 6 months ago
  
  It was part of a bigger project. I still have some code related to converting OSM data to json on my github: https://github.com/jillesvangurp/osm2geojson
  Mainly the problem isn't Elasticsearch but dealing with openstreetmap tags and reconstructing polygons from ways, nodes, and relations. Otherwise, even back then Elasticsearch scaled fine; so no issues on that front. And it's pretty well suited for this type of stuff.
  We had a largish Hetzner server (50 euro/month) with 32 GB and a quad core xeon that did the indexing. But basically that part was fine. I wrote some simple code that simply pulled out nearby street segments and then calculated perpendicular distance to each of them using some simple high school math and simply picked the nearest. And from there a simple contains query would produce neighborhoods, cities, countries, etc.
  The why part was that I was building a location data startup at the time and we were interested in figuring out localities, neighborhoods, and other areas of interest from raw data with coordinates. The startup failed but the idea was kind of cool. So this was a simple reverse geocoder that I knocked out because the existing ones were kind of limited (also rate limited). And Opencage did not exist yet.

morkalork 6 months ago

If I were giving directions to another human and not using house addresses I'd say something like "Queen street about half way down the block between Crawford and Shaw"

edent 6 months ago

That's great for cities with a grid layout, but ignores most of the world.
How would you give directions to something in the middle of a park?
- zild3d 6 months ago
  
  send a map pin :P
Propelloni 6 months ago

Fascinating to read. Around here people would do something like "follow the road (hand pointing down the road) at the first t-crossing turn left into a smaller road (hand pointing in the meant direction) continue for, I don't know, a few minutes. On your right (hand shows the meant direction) you'll see the park. A road comes up on the left and directly opposite of it is an entrance into the park. Go into the park and follow the path until you reach the first crossing. Turn left (hand shows meant direction) then follow until you reach the end of the park. The park should make a right turn there. The bench should be to your left."
Rarely if ever do people use road names to direct pedestrians, or car drivers. I guess, the people don't know them. I wouldn't.

kylecazar 6 months ago

Good article. FWIW, some major cities offer seating data. New York, for example, returns bench locations as a Point (coordinates). They even have a column in the data for the nearest address of the "seating feature".

https://data.cityofnewyork.us/Transportation/Seating-Locatio...

the_arun 6 months ago

Nicely written article. So simple yet interesting. I wish more people made projects like these.

edent 6 months ago

Thank you! I appreciate that :-)

glitchc 6 months ago

What 3 words (https://what3words.com/) solves this problem, but it doesn't seem to be popular.

If anyone has experience, I would be curious to know why.

juanmacuevas 6 months ago

What3Words had a lot of potential, but they hurt themselves by aggressively going after alternative/open versions with legal threats instead of encouraging an open ecosystem. Also, the system can make serious mistakes — similar-sounding addresses can easily point to completely wrong locations, even nearby, which is dangerous in emergencies.

amelius 6 months ago

Why not take the openstreetmaps address (which is long), chop it into a list of short combinations, then do a lookup for each combination, and see which short address gives you the best (geographically closest) match?

dadadad100 6 months ago

This problem seems to also exist for services like uber. Their solution seems easier, drop a pin on a map. Perhaps working so hard to find a textual description is missing the simpler solution.

blacklight 6 months ago

As the developer of a GPS tracking app that relies a lot on OpenStreetMap, I've faced many of these problems myself. A couple of learned lessons/insights:

- I avoid relying on any generic location name/description provided by these APIs. Always prefer structured data whenever possible, and build the locality name from those components (bonus points if you let the user specify a custom format).

- Identifying those components itself is tricky. As the author mentioned, there are countries that have states, others that have regions, other that have counties, or districts, or any combination of those. And there are cities that have suburbs, neighbourhoods, municipalities, or any combination. Oh, and let's not even get started with address names - house numbers? extensions? localization variants - e.g. even the same API may sometimes return "Marrakesh" and sometimes "Marrakech"? and how about places like India where nearby amenities are commonly used instead of house numbers? I'm not aware of any public APIs out there that provide these "expected" taxonomies, preferably from lat/long input, but I'd love to be proven wrong. In the absence of that, I would suggest that is better to avoid double-guessing - unless your software is only intended to run in a specific country, or in a limited number of countries and you can afford to hardcode those rules. It's probably a good option to provide a sensible default, and then let the user override it. Oh, and good catch about abbreviations - I'd say to avoid them unless the user explicitly enables them, if you want to avoid the "does everybody know that IL is Illinois?" problem. Just use "Illinois" instead, at least by default.

- Localization of addresses is a tricky problem only on the surface. My proposed approach is that, again, the user is king. Provide English by default (unless you want to launch your software in a specific country), and let the user override the localization. I feel like the Nominatim's API approach is probably the cleanest: honor the `Accept-Language` HTTP header if available, and if not available, fallback to English. And then just expose that a setting to the user.

- Bounding boxes/polygons can help a lot with solving the proximity/perimeter issue. But they aren't always present/sufficiently accurate in OSM data. And their proper usage usually requires the client's code to run some non-trivial lat/long geometry processing code, even to answer trivial questions such as "is this point inside of this enclosed amenity?" Oh, and let's not even get started with the "what's the exact lat/long of this address?" problem. Is it the entrance of the park? The middle of it? I remember that when I worked with the Bing in the API in the past they provided more granular information at the level of rooftop location, entrance location etc.

- Providing localization information for public benches isn't what I'd call an orthodox use-case for geo software, so I'm not entirely sure of how to solve the "why doesn't everything have an address?" problem :)

osmanscam 6 months ago

you can use https://map.name for reverse geocoding

dpmdpm 6 months ago

I read this as Reverse Genociding Is Hard, thought I was on a Nethack forum, and thought, No, it's pretty easy with a cursed scroll.

mtmail 6 months ago

At least one spellcheck software likes to correct genocide to geocode. On social media I saw rage posts how Jews and Palestinians are being geocoded.

gmoore 6 months ago

maybe the 'three words' model? Seems like it would be specific enough to locate a bench

Liquid_Fire 6 months ago

That's essentially equivalent to coordinates since you still need to translate it to some human-understandable form, so it doesn't solve the problem.
cjs_ac 6 months ago

WhatThreeWords is a proprietary algorithm and has problems with homophones.
- voidUpdate 6 months ago
  
  I thought W3W was specifically designed to mitigate homophone problems, it seems pointless otherwise...
1970-01-01 6 months ago

Yeah, that would absolutely work, but it's not free and donations may or may not cover their costs: https://accounts.what3words.com/select-plan

Francamaya 6 months ago

[dead]

1970-01-01 6 months ago

>But how do you go from a string of digits to something human readable?

Hasn't What3Words already solved this?

robin_reala 6 months ago

It’s a closed and proprietary system that contains locations like master.beats.slave near Brooklyn and whites.power.life in Washington (to pick just a couple of examples off the top of my head). If your only requirement is “human readable” and assumes the human knows how to read and pronounce English words then I guess it kinda does.
- 1970-01-01 6 months ago
  
  >and assumes the human knows how to read and pronounce English words
  I hit a dozen random benches. If these are the low requirements, W3W would and does do a great job for openbenches: https://openbenches.org/bench/random/
  - cmcconomy 6 months ago
    
    so would google's plus-codes,
    https://maps.google.com/pluscodes/
- mootothemax 6 months ago
  
  Yeesh slave.trade.market too!