< Back

Behind the Scenes: Algolia Places

All of us have at some point or the other been stuck in the middle of nowhere, desperately trying to locate an address, with our “fat fingers” for company; or abandoned a purchase in our virtual shopping baskets because it was too much trouble to fill out our entire address.

Algolia being Algolia, we wondered how we could reduce instances of this in our own way and Algolia Places was born.

In case you haven’t heard, Algolia Places allows you to create beautiful address auto-completion menus with just a few lines of code; it can also be personalized and customized as per your use-case, not to mention enriched with additional sources of data. Today, we’d like to share the story of how we built Algolia Places with you.

How did we do it?

Step 1: Data Collection

To build Algolia Places we relied on the open-source datasets provided by OpenStreetMap & GeoNames. These datasets are actually very different from each other and we chose them for precisely this reason.

  • The OpenStreetMap dataset contains map data: it basically constitutes of the geo representation (polygons, lines & points) of about 200 million geographical features.
  • The GeoNames dataset contains geographical names of about 9 million features: it’s a  regular list in the TSV format and contains names of every single city/country/place associated with some meta-data (population, zip codes, …).

Step 2: The Indices

  1. The city & country index

In order to build the city and country search experience, we exclusively used the GeoNames dataset – it’s a pretty exhaustive list and is quite simple to parse.

For every single city & country name on the TSV files, we created an Algolia record. These records not only include the variations and translated names of countries & cities (if available), but also some meta-data such as associated postcodes, populations, location and various tags & booleans we use for the ranking/filtering capabilities of our API.

The result? ~4 million records.

The address indices

To build the address search experience (including countries, cities, streets & notable places), we used both OpenStreetMap and GeoNames.

The OpenStreetMap initiative is wonderful but the underlying data is not always on point. Why?

  • It may not always be exhaustive (especially for non-famous places)
  • It may sometimes include erroneous values (from our experience, postcodes used to be wrong)
  • It may contain duplicates
  • It may not always follow the same conventions across countries
  • It may lack some internationalisation/translations

To convert the OSM map data into Algolia records, we imported the whole OSM planet (40GB compressed) inside a PostgreSQL+PostGIS engine. The resulting database was a whopping 600GB!

We then used Nominatim to export this huge database to an XML format, rebuilding the hierarchy of places. This actually brought an interesting problem to light: the raw map data of OSM doesn’t have any hierarchy, for instance you don’t have an obvious hierarchy/link between San Francisco, California and the United States. Nominatim was used to rebuild that hierarchy in the exported format so another tool could easily process it.

The resulting XML file weighs 150GB and looks somewhat like:

<osmStructured version="0.1" generator="Nominatim">
 <add>
  <feature place_id="914093" type="R" id="71525" key="boundary" value="administrative" rank="12" importance="12" parent_place_id="913159" parent_type="R" parent_id="8649">
   <names>
    <name type="alt_name:fr">Lutèce</name>
    <name type="alt_name:vi">Ba Lê</name>
    <name type="loc_name:fr">Panam</name>
    <name type="name">Paris</name>
    <name type="name:af">Parys</name>
    <name type="name:am">ፓሪስ</name>
    <name type="name:an">París</name>
    <name type="name:ar">باريس</name>
   …
    <name type="old_name:vi">Ba Lê</name>
    <name type="ref">75</name>
   </names>
   <adminLevel>6</adminLevel>
   <address>
    <state rank="8" type="R" id="8649" key="boundary" value="administrative" distance="0.236572214414259" isaddress="t"/>
   </address>
   <tags>
    <tag type="wikipedia">fr:Paris</tag>
   </tags>
   <osmGeometry>POLYGON((2.224122 48.854199,2.224158 48.854615,2.224257 48.855241,2.224317 48.85555,2.224371 ….))</osmGeometry>
  </feature>
  …
 </add>
</osmStructured>

We also built a SAX parser to process the file and rebuild the features hierarchy by following the “parent_id” and “parent_place_id” XML attributes.

It turned out to be a tiny bit more complex than expected

  • A lot of features were associated with duplicates and inconsistency:
  • Some cities have both “boundary/administrative” and “places/city” features, some have just one of them
  • A street is composed of several segments, so you might see several features representing the same street
  • There are some street names which are common to several cities, so we needed to have multiple records within Algolia

…..some hierarchy couldn’t be resolved

    • Some features did not have any parents
    • Some streets were attached to their countries but not cities
    • Some cities are also states, so the hierarchy was very confusing

and some features also lacked some metadata.

    • They weren’t associated with the corresponding population
    • Some counties didn’t have postcodes
    • Some cities didn’t have translations

Deduplication is generally super easy, but when you need to deduplicate 150GB of data, it leads to all sorts of new problems – in our case, we never seemed to have enough RAM! We also wanted to make sure that parsing was fast enough in order to avoid waiting days to build the Algolia records … after all, milliseconds matter!

So we tried to leverage the previously built GeoNames index as much as possible to fix the missing data we could hit with OSM; but since these 2 datasets didn’t share the same IDs, it was obviously way more complex to aggregate.

For performance reasons (more on that under “Search strategy”), we decided to build multiple indices from those records:

  • 1 index for the whole planet (~20M records)
  • 1 index per country (~6M records for the US, ~1.5M records for France, ….)

That’s about 60GB of indices.

Just so you have an idea, the overall parsing + record generation + indexing, takes ~12 hours at a time, today.

The record schema

Here’s what our final record schema looks like:

{
  "is_city": true,
  "is_country": false,
  "_tags": ["boundary", "boundary/administrative", "country/us", "city"],
  "country_code": "us",
  "_geoloc": [{"lng": -122.4192704, "lat": 37.7792768}],
  "country": {
    "zh": "美国",
    "ro": "Statele Unite ale Americii",
    "it": "Stati Uniti d'America",
    "hu": "Amerikai Egyesült Államok",
    "ar": "الولايات المتّحدة الأمريكيّة",
    "de": "Vereinigte Staaten von Amerika",
    "default": "United States of America",
    "pt": "Estados Unidos da América",
    "pl": "Stany Zjednoczone Ameryki",
    "fr": "États-Unis d'Amérique",
    "ru": "Соединённые Штаты Америки",
    "es": "Estados Unidos de América",
    "nl": "Verenigde Staten van Amerika",
    "ja": "アメリカ合衆国"
  },
  "admin_level": 8,
  "locale_names": {
    "zh": ["旧金山"],
    "default": ["San Francisco", "SF"],
    "pt": ["São Francisco"],
    "ru": ["Сан-Франциско"],
    "ja": ["サンフランシスコ"]
  },
  "importance": 16,
  "is_popular": true,
  "county": {
    "default": [ "San Francisco City and County", "San Francisco", "SF"],
    "ru": ["Сан-Франциско"]
  },
  "is_highway": false,
  "administrative": ["California"],
  "population": 815358,
  "postcode": ["94101", "94102", "94103", "94104", ….
  ]
}

 

Step 3: Index configuration

We constructed the underlying Algolia indices with the following configuration:

  • Search works with all localized names, but considers the default name as the most important
  • The names are considered as more important than the city, the postcode, the county, the administrative area, the suburb, the village and even the country.
Screen Shot 2016-08-05 at 10.22.57
  • We make sure to rank countries above cities and cities above streets, ensuring that we get the highest populated places first.
  • In case the population is not set, we fall back on the OSM’s importance field; which reflects the importance of the underlying place.
Screen Shot 2016-08-05 at 10.24.02

Step 4: The search strategy

We built a custom API endpoint to query the indices and implemented a custom querying strategy:

If the query is performed from your backend (and therefore doesn’t specify any aroundLatLng query parameter) or your source IP address isn’t geo-localized, the results will come from all around the world because we target the “planet” index.

Or, if the query is performed from the end-user browser or device (and hence specifies an aroundLatLng query parameter) or has a source IP address that is geo-localized), the results will be composed of:

  • Nearby places
    • Places around you (<10km): this is a query using our aroundRadius feature,
    • Places in your country: this is a query targeting the specific country index,
  • Popular places all around the world using a query targeting the “planet” index.
  • Specifying a country query parameter will override this behavior, restricting the results to a subset of the specified countries only.
  • Numerical tokens in the query string are considered as optional words to make sure we always find the address even if the postcode & house number are wrong.
  • We also defined a list of stopwords and tagged every stopword of the query string as optional.
  • We queried the underlying indices multiple times to ensure that:
  • Popular places will always be retrieved first.
  • Nearby places will always be better than places in your country, which are better than world-wide results.
  • If both a city and an address match the query, the city will be retrieved first.
  • If the query doesn’t retrieve any results, we fallback on a degraded query strategy where all words are considered optional and we only target cities.

The Result

We’ve had a great time building Algolia Places and the results today make all those coffee-fueled sleepless nights absolutely worth it. Here’s a quick look at them:

  • We’re hosting the Algolia Places infrastructure on 1 Algolia Cluster + 9 DSN replicas
    • 3 main servers in Germany
    • 4 DSN servers in France
    • 4 DSN servers in US
    • 1 DSN server in Singapore
  • Our Algolia Places infrastructure is currently processing 60 queries per second and provides answers in 20-30ms on an average.
  • Our infrastructure is still at <3% of its capacity.
  • We’ve reached 3200 stars on the GitHub repository of the frontend JavaScript library and a bunch of positive feedback.

Want to use it?

Go for it, it could be FREE!

We’re providing a non-authenticated free plan limited to 1,000 queries a day and we increase the free plan usage to 100,000 a month if you signup.

Try Algolia Places now!