Force PushedFP
  • Home
  • Blog
  • Workbooks

How I fixed a broken product catalog search

The objective

I recently completed a project to refine a product catalog search, not so dissimilar to the one I did for company names in SalesForce.

For this project, I was tasked with improving and fixing search functionality for products. The search parameters here were that product names, product ids, and product akas should all be searchable and return the top 6 results. Product AKAs here are just alternative names the products go by, usually unofficial terms customers or people in the industry like to refer to them as.

Search was to be performed as-you-type and show up to 6 results at a time, each supposedly ranked by relevancy. Whatever relevancy meant here was unknown as there were no exact specifications on how search should work since the requirements for the feature were mostly developer-driven at the time.

This feature previously used Fuse.js to power the search, however it wasn't properly configured and the search results were hardly accurate or usable for the actual product requirements. Fuse JS is a great library, however it's pretty overkill for such a simple requirement such as this when only a couple of metrics are being searched against and the set of terms is fixed. Seemed like the entire feature was built on the assumption of a simple plug-and-play using this.

Breaking down the existing implementation

The first part of the existing search implementation to fix was how the search data was generated. What I found was that each searchable string, in this case the product name and all of the "AKAs", were passed through an overly complicated sequence of string manipulations, to "clean" the data I suppose.

catalog.products.map((product) => {
  product.searchString = product.name
    .concat(` ${product.akas}`)
    .replace(/[^\w\s!?]/g, '')
    .toLowerCase()
    .split(' ')
    .filter((item, i, all) => i === all.indexOf(item))
    .join(' ')
  return product
})

Once the search set had been completed, it was passed into a new Fuse constructor to do whatever magic needed to get ready to perform the search.

const fuse = new Fuse(catalog.products, {
  keys: ['searchString'],
  shouldSort: true
})

The above Fuse instantiation was also performed on every keystroke, which made didn't amake any sense.

The next part was the point where the input was captured and passed to the search algorithm. This is where the Fuse JS library was first introduced and revealed more patterns where reliance on 3rd party libraries was a little overdone. When I first dove into this, Fuse JS version 2.5 was being implemented but in that form it was fairly feature rich.

const list = fuse.search(e.target.value)
  .slice(0, 6)

The Fuse JS implementation was using the field composed in the Node JS backed named searchString and specifying sorting should be enabled, even though this was the default behavior.

What exactly happens when the search is performed is a black box, since for one the documentation is out of date as well as the documentation doesn't go into detail on how fuzzy searching is working or how things like ranking and sorting work. Without knowing this, I don't believe this is something that can be used reliably.

I'm not sure if there was some a motivation here to keep search "fast and efficient" or use a simple out of the box solution, but in the way it was implemented it was accomplishing the objective.

My approach

Since Fuse JS was not cutting it, I started from the beginning on what was available for search, and started asking more questions what the expected results were when entering in search strings.

I threw out all of the fuzzy search requirement, since trying to do this can soak up a lot of development time and the benefit of having it is far outweighed by the cost.

I found that the product schema was made up of the following:

  • Akas
  • Name
  • Product Id

Akas and Name I've already addressed but there was also a Product Id to go along with the results. I had originally planned on extending the Fuse JS code to accept the productId key since it looked like there was support for it. However, when I did the results didn't make any sense and trying to add weights to different keys felt really arbitrary and prone to error.

The first thing I would need to do was rebuild the search set. I kept the same toLowerCase operations on the akas and product names, but stayed away from any of the cleaning steps like removing special chars etc. There aren't many special characters in the product data, and the ones that are there are specific to the product itself indicating feet or inches etc.

I create 3 buckets containing the product data along with the respective index of where the product appears in the main product list. My plan for this index is I'll be using it to track hits on the search string and then later map it to the product in order to build the results list.

const akasArr = []
const productIdsArr = []
const productNamesArr = []
catalog.products.forEach((product, i) => {
  productIdsArr.push({
   i,
   productId: product.productId
  })
  akasArr.push({
   aka: (product.akas || '').toLowerCase(),
   i
  })
  productNamesArr.push({
   i,
   name: (product.name || '').toLowerCase()
  })
}

catalog.searchData = {
  akasArr,
  productIdsArr,
  productNamesArr
}

Once the new search set was complete, I could move onto implementing the new search algorithm. I created a simple interface which only took the catalog and the search string from the text input field. The output of this function would return the new product result set limited to the 6 results.

const matchedProducts = applyProductScoringAndSort(
  catalog,
  e.target.value
)

After speaking with a product manager, the search will follow these priorities:

  1. Attempt to match a product ID
  2. Match against any product names
  3. Match against any product akas

A product Id in this context is two integers separated by a dash. Since I can just do a simple regex on the search string to see if it matches the desired format, it's a simple check and won't negatively impact performance much at all. Performance isn't much of a concern anyways since this is a process that's running at a really low frequency anyways.

let productIdMatches = []
if(searchString.match(/\d+-/)) {
    productIdMatches = catalog.searchData.productIdsArr
        .filter(({ productId }) => productId.indexOf(searchString) === 0)
        .map(match => {
          return match.i
        })
}

If the search string looks like a product Id, I do a simple filter on all of the product Ids currently in the catalog, and track the indexes of the ones that matched.

Next I need to match on the product name which gets a little more complex than what was done for the product Id.

My plan here is to:

  1. Filter out all products not matching a computed regex of the search string
  2. Calculate the Levenshtein edit distance of each matching product name to the search string
  3. Calculate the intersection length of each matching product name to the search string
  4. Sort all matched products first by intersection length, and then by the Levenshtein edit distance.

Building a regex of the search string will match individual word tokens in the product name when the product name is more than one word.

In order to calculate the Levenshtein edit distance, this function wil be used:

export const levenshteinDistance = (str1 = '', str2 = '') => {
   const track = Array(str2.length + 1).fill(null).map(() =>
    Array(str1.length + 1).fill(null)
   );
   for (let i = 0; i <= str1.length; i += 1) {
      track[0][i] = i;
   }
   for (let j = 0; j <= str2.length; j += 1) {
      track[j][0] = j;
   }
   for (let j = 1; j <= str2.length; j += 1) {
      for (let i = 1; i <= str1.length; i += 1) {
         const indicator = str1[i - 1] === str2[j - 1] ? 0 : 1;
         track[j][i] = Math.min(
            track[j][i - 1] + 1, // deletion
            track[j - 1][i] + 1, // insertion
            track[j - 1][i - 1] + indicator, // substitution
         );
      }
   }
   return track[str2.length][str1.length];
};

To help me accomplish this, I'll have to 1) build a regex of the search string and 2) get all of the words that make up the search string itself.

const productNameSearchRegex = new RegExp(searchString.replace(' ', '|'))
const productNameSearchTokens = searchString.split(' ')

Having that, I can now calculate the edit distance and the intersection length in one pass through all of the products on each keypress.

The Levenshtein distance I've already mentioned how it's calculated, however the intersection length is simply the total of words existing in both the words of the search string and the words of the product name.

The intersection length is useful for search since the higher the number, obviously the more similar the search string is to the product name. Results with a high intersection score will most likely also have a lower Levenshtein edit distance.

catalog.searchData.productNamesArr
  .filter(({ name }) => productNameSearchRegex.test(name))
  .map(match => {
    const matchTokens = match.name.split(' ')
    return {
      ...match,
      editScore: levenshteinDistance(searchString, match.name),
      intersectionScore: productNameSearchTokens
        .filter(value => matchTokens.includes(value))
        .length
    }
  })

Once I have a list of matches, I perform a sort based off descending intersection length and then ascending Levenshtein edit distance. This will give me the most relevant results towards the top.

Since I only care about product indexes, I make sure to map the resulting array to the indexes only.

  [...].sort((a, b) => {
    if(a.intersectionScore !== b.intersectionScore) {
      return b.intersectionScore - a.intersectionScore
    }
    return a.editScore - b.editScore
  })
  .map(match => {
    return match.i
  })

This last bit is where I assemble the search results and the limit the length to 6 according to the product spec.

Keeping in mind the priority or results, I'm just concatenating all 3 of the arrays and then mapping them back to the actual products in order to get the product data. This will later be used to build the product list showing the name and making the product Id available for API calls.

const matchedProducts = [
    ...productIdMatches,  // Highest priority to product id matches
    ...productNameMatches,
    ...akasMatches,       // Lowest priority to aka matches
  ]
  .map(matchIndex => {
    return catalog.products[matchIndex]
  })
  .slice(0, 6)

Below is the new search function in its entirety, adopting the interface used above and is the function that's injected into the UI when performing the search as the text input field is interacted with.

export const productSearchAndSort = (catalog, rawSearchString) => {
  const searchString = rawSearchString.toLowerCase()

  let productIdMatches = []
  if(searchString.match(/\d+-/)) {
      // When search string prefix matches any amount of decimals and a dash:
      //  Find all product ids using a prefix match
      productIdMatches = catalog.searchData.productIdsArr
          .filter(({ productId }) => productId.indexOf(searchString) === 0)
          .map(match => {
            return match.i
          })
  }

  // Find all product names using an include match
  const productNameSearchRegex = new RegExp(searchString.replace(' ', '|'))
  const productNameSearchTokens = searchString.split(' ')
  const productNameMatches = catalog.searchData.productNamesArr
    .filter(({ name }) => productNameSearchRegex.test(name))
    .map(match => {
      const matchTokens = match.name.split(' ')
      return {
        ...match,
        editScore: levenshteinDistance(searchString, match.name),
        intersectionScore: productNameSearchTokens
          .filter(value => matchTokens.includes(value))
          .length
      }
    })
    .sort((a, b) => {
      if(a.intersectionScore !== b.intersectionScore) {
        return b.intersectionScore - a.intersectionScore
      }
      return a.editScore - b.editScore
    })
    .map(match => {
      return match.i
    })

  // Find all product names using an include match
  const akasMatches = catalog.searchData.akasArr
    .filter(({ aka }) => aka.indexOf(searchString) >= 0)
    .map(match => {
      return match.i
    })

  const matchedProducts = [
      ...productIdMatches,  // Highest priority to product id matches
      ...productNameMatches,
      ...akasMatches,       // Lowest priority to aka matches
    ]
    .filter((matchIndex, i, self) => {
      return self.indexOf(matchIndex) === i
    })
    .map(matchIndex => {
      return catalog.products[matchIndex]
    })
    .slice(0, 6)
  return matchedProducts
}