Force PushedFP
  • Home
  • Blog
  • Workbooks

Fixing a bad Google address autocomplete schema parser

One of my clients relies heavily on Google address autocomplete to provide a mapped address schema as well as GPS latitude and longitude coordinates. This information is used for a variety of purposes and plays a large part in their sales, quoting, and fulfillment process.

The client relied on some custom code written to convert the Google address schema to a one used internally to represent their address domain.

As more and more orders were quoted using autocomplete addresses provided by Google, agents in the business were more frequently reporting errors associated with incorrect addresses than what the either the customer or the Operator had expected.

What am I up against?

The main culprits of incorrect addresses were neighborhoods, military bases, addresses such as townships or places, or other non-standard shapes than is usually represented by a simple street, city, state and postal code.

What began as a very simple parser and mapping tool quickly devolved into something that wasn't reliable and often resulted in addresses that did not represent what Google reported.

Rather than try and explain what's going on, I'll just show the various snipppets, and maybe explain some bits here and there. Before I get too into it, the documentation on how a developer should interpret a Google autocomplete result is a huge mystery.

Given that, there were two versions of this autocomplete address parsing. Each seemed to be written with no observable standards but maybe both implementations derived some inspiration from the other:

reKeyGeography (geography) {

      const keyMap = {
          administrative_area_level_1: 'addressRegion',
          country: 'addressCountry',
          locality: 'addressLocality',
          postal_code: 'postalCode',
          route: 'streetAddress2',
          street_number: 'streetAddress1',
          sublocality_level_1: 'addressLocality',
          administrative_area_level_3: 'addressLocality',
          neighborhood: 'addressLocality'
      }
      let nextGeography = {}

      for (let i of Object.keys(geography)) {

          const key = keyMap[i] ? keyMap[i] : i

          Object.assign(nextGeography, {
              [key]: geography[i]
          })
      }

      if (nextGeography.streetAddress1 && nextGeography.streetAddress2) {

          Object.assign(nextGeography, {
              streetAddress: `${nextGeography.streetAddress1} ${nextGeography.streetAddress2}`
          })

          delete nextGeography.streetAddress1
          delete nextGeography.streetAddress2
      }

      if (moment().diff(moment(_initData.postalCodeSuffixEnabledAsOfDate, 'YYYY-MM-DD'), 'days') >= 0) {
        if (typeof geography.postal_code_suffix !== 'undefined' &&
          typeof nextGeography.postalCode !== 'undefined') {
          nextGeography = {
            ...nextGeography,
            postalCode: `${nextGeography.postalCode}-${geography.postal_code_suffix}`
          }
        }
      }

      return nextGeography
  }

  reverseGeocode () {

      const { jobsite, onSetJobsite } = this.props
      const { latitude: lat, longitude: lng } = jobsite

      const geocoder = new google.maps.Geocoder

      geocoder.geocode({ location: { lat, lng }}, (results, status) => {

          if (status === google.maps.GeocoderStatus.OK) {

              let place = results[0]

              if (place) {

                  // Preserve original name
                  Object.assign(place, {
                      name: jobsite.addressLocality
                  })

                  const geography = this.getJobsiteGeography(place, true)

                  onSetJobsite(geography)

              } else {
                  alert('No results found')
              }

          } else {
              alert('Geocoder failed due to: ' + status)
          }
      })
  }

  getJobsiteGeography (place, isReverseGeo = false) {

      let geography = {
          isComplete: false,
          isEligableForGeocode: false,
          jobsite: null
      }
      const { types } = place

      // Bail, invalid address. No geometry, no types
      // ex. 1063 U.S. 1, NJ, United States
      if (types === undefined) {
          return geography
      }

      // Establishment; route; neighborhood
      // ex. Angel Stadium of Anaheim, East Gene Autry Way, Anaheim, CA, United States
      // ex. Camp Dodge, Johnston, IA 50131, USA
      if (types.indexOf('establishment') > -1 ||
          types.indexOf('route') > -1 ||
          types.indexOf('neighborhood') > -1) {

          return Object.assign({}, geography, {
              isComplete: true,
              jobsite: this.formatGeography(place, place.name)
          })
      }

      // Premise; street address
      // ex. 1063 McGaw Avenue, Irvine, CA, United States
      if (types.indexOf('premise') > -1 ||
          types.indexOf('street_address') > -1) {

          const name = isReverseGeo === true ? place.name : ''

          return Object.assign({}, geography, {
              isComplete: true,
              jobsite: this.formatGeography(place, name)
          })
      }

      // Military base; city, state
      // ex. March Air Reserve Base, CA, United States
      // ex. Irvine, CA
      if (types.indexOf('locality') > -1) {

          return Object.assign({}, geography, {
              isEligableForGeocode: true,
              jobsite: this.formatGeography(place)
          })
      }

      // Zip code; suburb
      // ex. 92614
      // ex. Brooklyn, NY, United States
      if (types.indexOf('postal_code') > -1 ||
          types.indexOf('sublocality_level_1') > -1) {

          return Object.assign({}, geography, {
              isEligableForGeocode: true,
              jobsite: this.formatGeography(place)
          })
      }
  }

And the second:

function parseGoogleAddress(_addressObj) {
    const address = {}
    const addressObj = _addressObj || {}
    const mappedAddress = {}
    let component, compLen, i, j, type, typeLen

    if (addressObj.geometry && addressObj.geometry.location) {
        if (typeof addressObj.geometry.location.lat === 'function') {
            address.latitude = addressObj.geometry.location.lat()
        }
        if (typeof addressObj.geometry.location.lng === 'function') {
            address.longitude = addressObj.geometry.location.lng()
        }
    }

    if (typeof addressObj.address_components !== 'undefined' && addressObj.address_components !== null) {
        compLen = addressObj.address_components.length
        for (i = 0; i < compLen; i++) {
            component = addressObj.address_components[i]
            if (typeof component.types !== 'undefined' && component.types !== null) {
                typeLen = component.types.length
                for (j = 0; j < typeLen; j++) {
                    type = component.types[j]
                    mappedAddress[type] = component.short_name
                }
            }
        }
    }

    const streetAddress = []
    if (mappedAddress.street_number) {
        streetAddress.push(mappedAddress.street_number)
    }
    if (mappedAddress.route) {
        streetAddress.push(mappedAddress.route)
    }

    address.streetAddress = streetAddress.join(' ') || ''
    // Begin locality mapping
    //  - For now, we need to observe some special cases. From the Google Docs:
    //
    //   "Brooklyn and other parts of New York City do not use the city as part of the address. They use sublocality_level_1 instead."
    address.addressLocality = ''
    if (mappedAddress.neighborhood && !mappedAddress.locality) {
        address.addressLocality = mappedAddress.neighborhood
    } else if (mappedAddress.administrative_area_level_3 && !mappedAddress.locality) {
        address.addressLocality = mappedAddress.administrative_area_level_3
    } else if (mappedAddress.sublocality && !mappedAddress.locality) {
        // I notice sometimes the sub-locality comes through as `sublocality` as well as `sublocality_level_1`
        address.addressLocality = mappedAddress.sublocality
    } else if (mappedAddress.sublocality_level_1 && !mappedAddress.locality) {
        address.addressLocality = mappedAddress.sublocality_level_1
    } else if (mappedAddress.locality) {
        address.addressLocality = mappedAddress.locality
    }
    // End locality mapping
    address.addressNeighborhood = mappedAddress.neighborhood || ''
    address.addressRegion = mappedAddress.administrative_area_level_1 || ''
    address.postalCode = mappedAddress.postal_code || ''
    address.addressCountry = mappedAddress.country || ''
    address.postalCode = address.postalCode + '-' + mappedAddress.postal_code_suffix

    const label = []
    if (address.addressLocality !== '') {
        label.push(address.addressLocality)
    }
    if (address.addressRegion !== '') {
        label.push(address.addressRegion)
    }
    const locationLabel = label.join(', ')
    address.label = locationLabel

    return address
}

Obviously, both of these snippets look very crude and endlessly prone to bugs, nevermind that additional maintenance only adds more complexity.

Lots of Googling

Since I knew the future of the existing implementations was about to quickly come to an end, I started to do some research to understand what is the right way to even approach a problem like this?

I found a ton of content after some quick Google searches of asking a ton of different questions.

First I wanted to just understand how Google represented addresses, and how the geocoder worked. Google has a great set of API specs and these docs really helped me to understand their address schema:

With that information in hand, I wanted to humor myself and query the Stackoverflow pool for any potential slam-dunk answers.

This question provided the most insight into how to understand a given API response. This was when I understood the concept of priority of single and combinations of address component field types:

What I meant by that is if you took a look at the couple of Google API links, you might have seen that are a few dozen different address component types, only a small set of which are ever returned for a given address.

Any address field, let's use the city name as an example, could be mapped from several address component types as referenced by Google. These types could be, sorted by priority:

  • locality
  • administrative_area_level_3
  • administrative_area_level_4
  • administrative_area_level_5
  • administrative_area_level_6
  • administrative_area_level_7
  • sublocality
  • sublocality_level_1
  • sublocality_level_2
  • sublocality_level_3
  • sublocality_level_4
  • sublocality_level_5
  • neighborhood
  • postal_town

So if a response has both locality and administrative_area_level_3, use the value of locality to represent the city name in the address.

OK, that's cool.

I also looked for

This was something I found just searching for stuff like natural language parsing, machine learning type stuff. All of which the clients systems were in no way shape or form capable of doing but it was a fun thought experiment I suppose.

Since both Google and the clients address schema were represented in JSON, I wondered if there was a way to just put one through some kind of interpreter and it would spit out the result:

It initially sounded really promising, but as I read into it more it felt more theoretical or academic and not something that was in a form ready to be consumed by and end user.

So, armed with all of the above knowledge, I set out to create my implementation.

Understand the target address schema

The address the client used assumed it was a physical location in the United States.

Each address had a:

Address country, represented by an ISO 3166-1 alpha-2 country code.

There was a address locality and address region, as I learned of via Wikipedia.

Each address also had a postal code, which could be a zip or zip-plus-four.

And finally, there was the street address and street address 2 lines.

So what was breaking?

Jobsite Address generates some bad results:

Cities were replaced by Township:

"7600 Rockville Rd, Indianapolis, IN 46214" became "7600 Rockville Rd, Wayne Township, IN, 46214, US"

Sometimes, the correct city name would not get returned:

"4 HALFMOON Crossing Blvd, Halfmoon, NY 12065" became "4 Halfmoon Crossing, Clifton Park, NY, 12065, US"

Lastly, addresses that were considered Places (without a real street address per se) were incorrect:

"Anaheim Convention Center" became "Anaheim Convention Center, Anaheim, CA, 92802, US"

What should it look like?

Before I fixed it, I really had to understand what was expected after inputting them into the clients system.

What I discovered was that cities should not be replaced by township:

"7600 Rockville Rd, Indianapolis, IN 46214" should become "7600 Rockville Rd, Indianapolis, IN 46214, USA"

City names should always be returned:

"4 HALFMOON Crossing Blvd, Halfmoon, NY 12065" should become "Home Depot, 4 Halfmoon Crossing, Clifton Park, NY 12065-4171, US"

Lastly, addresses considered as Places should be correct:

"Anaheim Convention Center" should become "Anaheim Convention Center, 800 W Katella Ave, Anaheim, CA 92802, US"

Now I had a pretty clear understanding of what I should expect and how I should parse and map the Google address schema to the one used by the client.

What I ended up with was the below:

Converting Google address schema to Client

digraph G {

    graph [pad="0.5", ranksep="5"];
    node [shape=plain]
    rankdir=LR;

    subgraph cluster_google_address_schema {

        label="Google Address Schema"

        google_address_components [
            shape=plaintext
            label=<
                <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                    <TR><TD PORT="administrative_area_level_1">administrative_area_level_1</TD></TR>
                    <TR><TD PORT="administrative_area_level_2">administrative_area_level_2</TD></TR>
                    <TR><TD PORT="administrative_area_level_3">administrative_area_level_3</TD></TR>
                    <TR><TD PORT="administrative_area_level_4">administrative_area_level_4</TD></TR>
                    <TR><TD PORT="administrative_area_level_5">administrative_area_level_5</TD></TR>
                    <TR><TD PORT="administrative_area_level_6">administrative_area_level_6</TD></TR>
                    <TR><TD PORT="administrative_area_level_7">administrative_area_level_7</TD></TR>
                    <TR><TD PORT="airport">airport</TD></TR>
                    <TR><TD PORT="bus_station">bus_station</TD></TR>
                    <TR><TD PORT="colloquial_area">colloquial_area</TD></TR>
                    <TR><TD PORT="country">country</TD></TR>
                    <TR><TD PORT="establishment">establishment</TD></TR>
                    <TR><TD PORT="floor">floor</TD></TR>
                    <TR><TD PORT="intersection">intersection</TD></TR>
                    <TR><TD PORT="landmark">landmark</TD></TR>
                    <TR><TD PORT="locality">locality</TD></TR>
                    <TR><TD PORT="natural_feature">natural_feature</TD></TR>
                    <TR><TD PORT="neighborhood">neighborhood</TD></TR>
                    <TR><TD PORT="point_of_interest">point_of_interest</TD></TR>
                    <TR><TD PORT="park">park</TD></TR>
                    <TR><TD PORT="parking">parking</TD></TR>
                    <TR><TD PORT="plus_code">plus_code</TD></TR>
                    <TR><TD PORT="political">political</TD></TR>
                    <TR><TD PORT="post_box">post_box</TD></TR>
                    <TR><TD PORT="postal_code">postal_code</TD></TR>
                    <TR><TD PORT="postal_code_suffix">postal_code_suffix</TD></TR>
                    <TR><TD PORT="postal_town">postal_town</TD></TR>
                    <TR><TD PORT="premise">premise</TD></TR>
                    <TR><TD PORT="room">room</TD></TR>
                    <TR><TD PORT="route">route</TD></TR>
                    <TR><TD PORT="street_address">street_address</TD></TR>
                    <TR><TD PORT="street_number">street_number</TD></TR>
                    <TR><TD PORT="sublocality">sublocality</TD></TR>
                    <TR><TD PORT="sublocality_level_1">sublocality_level_1</TD></TR>
                    <TR><TD PORT="sublocality_level_2">sublocality_level_2</TD></TR>
                    <TR><TD PORT="sublocality_level_3">sublocality_level_3</TD></TR>
                    <TR><TD PORT="sublocality_level_4">sublocality_level_4</TD></TR>
                    <TR><TD PORT="sublocality_level_5">sublocality_level_5</TD></TR>
                    <TR><TD PORT="subpremise">subpremise</TD></TR>
                    <TR><TD PORT="train_station">train_station</TD></TR>
                    <TR><TD PORT="transit_station">transit_station</TD></TR>
                </TABLE>
            >
        ]
    }

    subgraph cluster_convert_google_to_us_address {

        label="Google Address to Client US Address"

        subgraph cluster_build_street_address {

            label="Build Street Address"

            build_client_street_address [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <tr><td><b>First Available</b></td></tr>
                        <TR><TD PORT="premise_and_street_address">"premise, street_address"</TD></TR>
                        <TR><TD PORT="premise_and_route_and_street_number">"premise, street_number + route"</TD></TR>
                    </TABLE>
                >
            ]
        }

        subgraph cluster_build_street_address_2 {

            label="Build Street Address 2"

            build_client_street_address_2 [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <TR><TD PORT="floor_and_room">"floor, room"</TD></TR>
                    </TABLE>
                >
            ]
        }

        subgraph cluster_build_address_locality {

            label="Build Address Locality"

            build_client_address_locality [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <tr><td><b>First Available</b></td></tr>
                        <TR><TD PORT="locality">locality</TD></TR>
                        <TR><TD PORT="administrative_area_level_3">administrative_area_level_3</TD></TR>
                        <TR><TD PORT="administrative_area_level_4">administrative_area_level_4</TD></TR>
                        <TR><TD PORT="administrative_area_level_5">administrative_area_level_5</TD></TR>
                        <TR><TD PORT="administrative_area_level_6">administrative_area_level_6</TD></TR>
                        <TR><TD PORT="administrative_area_level_7">administrative_area_level_7</TD></TR>
                        <TR><TD PORT="sublocality">sublocality</TD></TR>
                        <TR><TD PORT="sublocality_level_1">sublocality_level_1</TD></TR>
                        <TR><TD PORT="sublocality_level_2">sublocality_level_2</TD></TR>
                        <TR><TD PORT="sublocality_level_3">sublocality_level_3</TD></TR>
                        <TR><TD PORT="sublocality_level_4">sublocality_level_4</TD></TR>
                        <TR><TD PORT="sublocality_level_5">sublocality_level_5</TD></TR>
                        <TR><TD PORT="neighborhood">neighborhood</TD></TR>
                        <TR><TD PORT="postal_town">postal_town</TD></TR>
                    </TABLE>
                >
            ]
        }

        subgraph cluster_build_address_region {

            label="Build Address Region"

            build_client_address_region [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <tr><td><b>First Available</b></td></tr>
                        <TR><TD PORT="administrative_area_level_1">administrative_area_level_1</TD></TR>
                        <TR><TD PORT="administrative_area_level_2">administrative_area_level_2</TD></TR>
                    </TABLE>
                >
            ]
        }

        subgraph cluster_build_postal_code {

            label="Build Postal Code"

            build_client_postal_code [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <TR><TD PORT="postal_code_plus_suffix">"postal_code-postal_code_suffix"</TD></TR>
                    </TABLE>
                >
            ]
        }

        subgraph cluster_build_address_country {

            label="Build Address Country"

            build_client_address_country [
                shape=rectangle
                label=<
                    <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                        <TR><TD PORT="address_country">address_country</TD></TR>
                    </TABLE>
                >
            ]
        }
    }

    subgraph cluster_client_address_schema {

        label="client Address Schema"

        client_address [
            shape=plaintext
            label=<
                <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0">
                    <TR><TD PORT="client_street_address">Street Address</TD></TR>
                    <TR><TD PORT="client_street_address_2">Street Address 2</TD></TR>
                    <TR><TD PORT="client_address_locality">Address Locality</TD></TR>
                    <TR><TD PORT="client_address_region">Address Region</TD></TR>
                    <TR><TD PORT="client_postal_code">Postal Code</TD></TR>
                    <TR><TD PORT="client_address_country">Address Country</TD></TR>
                </TABLE>
            >
        ]
    }

    google_address_components:country -> build_client_address_country:address_country

    google_address_components:postal_code -> build_client_postal_code:postal_code_plus_suffix
    google_address_components:postal_code_suffix -> build_client_postal_code:postal_code_plus_suffix

    google_address_components:administrative_area_level_1 -> build_client_address_region:administrative_area_level_1
    google_address_components:administrative_area_level_2 -> build_client_address_region:administrative_area_level_2

    google_address_components:administrative_area_level_3 -> build_client_address_locality:administrative_area_level_3
    google_address_components:administrative_area_level_4 -> build_client_address_locality:administrative_area_level_4
    google_address_components:administrative_area_level_5 -> build_client_address_locality:administrative_area_level_5
    google_address_components:administrative_area_level_6 -> build_client_address_locality:administrative_area_level_6
    google_address_components:administrative_area_level_7 -> build_client_address_locality:administrative_area_level_7
    google_address_components:locality -> build_client_address_locality:locality
    google_address_components:neighborhood -> build_client_address_locality:neighborhood
    google_address_components:postal_town -> build_client_address_locality:postal_town
    google_address_components:sublocality -> build_client_address_locality:sublocality
    google_address_components:sublocality_level_1 -> build_client_address_locality:sublocality_level_1
    google_address_components:sublocality_level_2 -> build_client_address_locality:sublocality_level_2
    google_address_components:sublocality_level_3 -> build_client_address_locality:sublocality_level_3
    google_address_components:sublocality_level_4 -> build_client_address_locality:sublocality_level_4
    google_address_components:sublocality_level_5 -> build_client_address_locality:sublocality_level_5

    google_address_components:street_address -> build_client_street_address:premise_and_street_address
    google_address_components:route -> build_client_street_address:premise_and_route_and_street_number
    google_address_components:street_number -> build_client_street_address:premise_and_route_and_street_number
    google_address_components:premise -> build_client_street_address:premise_and_street_address
    google_address_components:premise -> build_client_street_address:premise_and_route_and_street_number

    google_address_components:floor -> build_client_street_address_2:floor_and_room
    google_address_components:room -> build_client_street_address_2:floor_and_room

    build_client_street_address -> client_address:client_street_address
    build_client_street_address_2 -> client_address:client_street_address_2
    build_client_address_locality -> client_address:client_address_locality
    build_client_address_region -> client_address:client_address_region
    build_client_postal_code -> client_address:client_postal_code
    build_client_address_country -> client_address:client_address_country
}

Finally I could start designing an implementation.

I'd do it in 3 parts:

Map all of the Google address information

  1. Create google address class
  2. In a single pass, populate the Google address object using all of the address components and all other properties in the root of the object

Basically, this was just:

function getComponentName (component) {
    if(typeof component.types === 'undefined') {
        throw new Error('Unknown input - missing `types`')
    }
    const intersection = googleAddressComponentKeys.filter(value => component.types.includes(value))
    if(typeof intersection === 'undefined' || intersection.length <= 0)
        throw new Error('Unable to determine component name')
    return intersection[0]
}

function mapGoogleAddressFromAddressComponents (in_googleAddress = {}) {
    const googleAddress = {}
    const { address_components = [] } = in_googleAddress

    for (let i = 0; i < address_components.length; i++) {
        const component = address_components[i];
        const componentName = getComponentName(component)
        googleAddress[componentName] = component[googleAddressComponentsNameKey[componentName]]
    }

    return googleAddress
}

googleAddressComponentKeys and googleAddressComponentsNameKey were just maps I created to inject how to interpret Google results faster without adding a bunch of code.

Create the client address schema

  1. Create a client address factory
  2. Create a generator for each property on the client address
  3. Create prioritized lists for each client address properties

When I say prioritized lists, I mean the equivalent of what I talked about in the begining of this article of which address components from Google to take over others when available.

Once I had these two things done I could then:

  1. Pass in Google address class to client address factory
  2. Populate lists with functions to return result for client address property based on google address class. These would be similar to
"street_num" + "route" => streetAddress

etc, ordered in priority of which should be used first when one is fully populated and others are not.

The final implementation of this looked like:

function buildClientAddressLocality (googleAddress = {}) {
    const addressLocality = [
        googleAddress.locality,
        googleAddress.administrative_area_level_3,
        googleAddress.administrative_area_level_4,
        googleAddress.administrative_area_level_5,
        googleAddress.administrative_area_level_6,
        googleAddress.administrative_area_level_7,
        googleAddress.sublocality,
        googleAddress.sublocality_level_1,
        googleAddress.sublocality_level_2,
        googleAddress.sublocality_level_3,
        googleAddress.sublocality_level_4,
        googleAddress.sublocality_level_5,
        googleAddress.neighborhood,
        googleAddress.postal_town,
    ].filter(x => typeof x !== 'undefined' && x.trim() !== '')[0]
    if(typeof addressLocality === 'undefined')
        throw new Error('Unable to build address locality')
    // Take the first available
    return addressLocality
}

and was created for each of the client address components as shown in the diagram above.

All of these business rules-pattern functions then fed into the central area where the clients address schema object was built.

function buildClientAddress (googleAddress = {}) {
    const clientAddress = {
        addressCountry: buildClientAddressCountry(googleAddress),
        addressLocality: buildClientAddressLocality(googleAddress),
        addressRegion: buildClientAddressRegion(googleAddress),
        streetAddress: buildClientClientStreetAddress(googleAddress),
        streetAddress2: buildClientStreetAddress2(googleAddress),
        postalCode: buildClientPostalCode(googleAddress)
    };

    return clientAddress
}

I also had some fun writing some tests for this, and learning about glob patterns in NodeJS, and technically breaking Google TOS.

More on that later!