Build smarter, ship faster, and stand out from the crowd.

Subscribe or follow on X for updates when new posts go live.

Improving company name search using Salesforce and Levenshtein distance

SalesForce should just do this, right?

I only wish this were true.

All too often there is usually a false expectation reached about how a certain 3rd party API or functionality should work that makes it difficult to approach a viable development solution.

I had discovered that some custom SalesForce search functionality that had been built for a client didn't provide the most of results when searching for partial company names and trying to explain the gap to other teams was proving to be difficult. Obviously, I had to change their minds and get them to see it my way.

Perils of a naive fuzzy search approach

The search functionaliy used a standard SOQL search implementation, so doing custom queries using an API wasn’t possible.

Currently, a simple fuzzing algorithm was being used, leaving in any company suffixes and then an ILIKE was performed after replacing all of the spaces with percentages.

static string GetAccountIdsWhereNameLikeQuery(string companyName, IReadOnlyList<string> orderBy, int? limit)
{
     var mungedSearchString = (
         Regex.Replace(
                 companyName,
                 @"[^0-9a-zA-Z ]+",
                 "%"
             )
             .ToLower()
             .Replace(" inc", "%")
             .Replace(" ", "%")
     );

     var orderByClause = (
         orderBy != null
             ? $"ORDER BY {orderBy.Join(", ")}"
             : ""
     );

     var limitClause = (
         limit.HasValue
             ? $"LIMIT {limit}"
             : ""
     );

     return $"SELECT Id FROM Account WHERE RecordType.Name = 'Customer' AND Name LIKE \'%{mungedSearchString}%\' {orderByClause} {limitClause}";
}

Looking into the published product requirements, I didn't see anything specifically stating how it should work or what an agent would expect to see in the search results. I think this was probably just a familiarity problem with what SalesForce was capable of or just an easy to make assumption that doing search isn't a big deal.

Basically the requirements I had looked like the following:

Enter text in a text field to search accounts in SalesForce by
Display results on a page in a list view

Obviously I had work to do to make sure that developers and QA personnel are on the same page when it comes to trying to ship code. Also, when the feature is released to the business the agents are going to need to know how to expect it to function.

What I believe happened in the initial implementation was an engineer had even less requirements to go on and just wanted to take an approach that wasn't a plain exact match. The thought might have been if a SQL ILIKE was turned up to 11, the algorithm would cast as wide a net as possible and return a good amount of results for someone to interpret.

Implementation issues

As can be seen in the code snippet above, there's also some initiative to remove a common business entity "Inc" from the search string before running the query. This is valid assumption but lacks depth, as "Inc" is just one of many that might need to be ignored, which I'll address later in the post.

Another error I saw was that there didn't seem to be any additional processing done on the results. I noticed by studying the query that the any results returned wouldn't have any ordering, and all returned accounts would be treated with the same relative importance. When I observed the results returned from SalesForce, the results seemed to be scored by a closeness score to the search string.

When I compared the actual results of what SalesForce provided and what the previous implementation returned, the comparison wasn’t even close. Sometimes no responses at all were returned when SalesForce returned several. Something was clearly wrong.

SalesForce does do this, but…

Fuzzy matching on data is an entire industry, and due to its programmatic complexity is not well-suited to the SalesForce platform. Most companies that offer comprehensive data de-duplication and other master data management services have their core engine outside of Salesforce.

What the implied behavior of the company name search suggests is admirable and definitely very fun, but in practical terms not possible unless very hard limits are established on the scope of what can be deduped i.e. exact, or near-exact matches.

Entire companies will devote all of their engineering talent to building fuzzy logic systems, but it will still turn out to be limited in scope while being extremely complex and expensive to develop.

Implementing a defined solution

The first thing I did to get started on this was have a conversation with the product manager on what users expected to see in search results. The first couple of answers I got were "a list of companies matching the search" or "all accounts containing what's in the search box". So it wasn't so much an engineering issue as it was just going into more detail of the expected output..

I asked another question:

If the search string matches multiple account names, some exactly and some partially, which appear first?

Well, the exact match would come first, and the partial matches would come after that.

Boom. I had the start of a specification for this feature.

Several more questions were asked and I eventually had the product specification I needed to build something useful to provide to an agent.

Here's the algorithm I came up with which I'll go into detail below:

Remove special characters not inside words and not their own word
If the search string does not solely consist of a common legal business entity, remove all occurences
Convert multiple whitespace to a single whitespace
Surround processed search term with % and OR w/ raw search string surrounded by %
Place processed search term into ILIKE for SOQL and search
Apply Levenshtein edit distance algorithm to all results
Sort by edit distance ascending (Exact matches at the top)

For the last two steps, I chose to use the Levenshtein distance algorithm since it sounded like the most straightforward way to find all results based off a search string that was not guaranteed to be an exact match for any result. It was also the simplest way to implement a form of fuzzy-matching without spending a ton of time in development or costing a lot of compute cycles.

I won't go into a ton of detail about how it works but the simplest way to explain it is the algorithm takes in two strings and finds the number of single-character changes that would need to be made to either string to make the two be exact matches. The smaller the resulting number, the more similar the two strings are.

Cleaning the data

Before an actual search can be performed, the search string needs to be prepared to be inserted into a SalesForce SOQL query.

The first thing to do is remove any common legal business entities. Since a search string might contain one of these, I want to make sure there are no intersections with other companies that only have the same business legal entity in common, polluting the search results.

The most efficient and straightforward way I can think to do this is to tokenize the entire string, and then remove the token if it exists a business entity, then convert the tokens back to string a string separating them with a space.

var commonBusinessEntities = new List<string>() {
    "association",
    "co",
    "company",
    "corp",
    "corporation",
    "dba",
    "inc",
    "limited",
    "ltd",
    "lc",
    "llc",
    "pllc",
    "lp",
    "llp",
    "lllp",
    "pbd"
};

var commonBusinessEntitiesRemoved = convertedToSingleWhitespaces
    .Split(' ')
    .Where(x => !commonBusinessEntities.Contains(x))
    .Join(" ");

Here I'm mostly removing all special characters since usually most of the time agents use special characters to pretty up a company name or do some separation that isn't valid. There is the potential here of some search rankings to be skewed but that's more of an edge case concern I don't want to deal with at this time.

var disallowedCharsRemoved = Regex.Replace(
    searchStr,
    @"\s+\W(?!\s)|(?<!\s)\W\s+",
    " "
);

With this last string replace I'm simply replacing multiple space characters with a single one, since after removing the disallowed characters, I've potentially added additional spaces.

var convertedToSingleWhitespaces = Regex.Replace(
        disallowedCharsRemoved,
        @"\s+",
        " "
    )
    .ToLower();

The final step before performing the search is to properly escape the search string methods advised by SalesForce. Figuring this out took some research as my initial results looked great except for some edge case terms whose results didn't make much sense.

var escapedILikeSearchString = commonBusinessEntitiesRemoved
        .Replace(@"\", @"\\")
        .Replace("'", @"\'")
        .Replace("\"", "\\\"")
        .Replace("_", @"\_")
        .Replace("%", @"\%");

var escapedSearchString = searchStr
    .Replace(@"\", @"\\")
    .Replace("'", @"\'")
    .Replace("\"", "\\\"")
    .Replace("_", @"\_")
    .Replace("%", @"\%");

Querying SalesForce

Once I had the search string in the form I wanted, it was now only a matter of crafting the SOQL statement. Here I search for both the ILIKE search string as well as the initial search string. Doing so helps find any result in the event I missed a cleaning step and the agent has entered a string of an unknown form.

return $"SELECT {AccountFieldsToQuery} FROM Account WHERE RecordType.Name = 'Customer' AND (Name LIKE \'%{escapedILikeSearchString}%\' OR Name LIKE \'%{escapedSearchString}%\')";

Apply Levenshtein distance and Sort

After the results have been returned from SalesForce, I perform a pass for all accounts against the Levenshtein distance algorithm, and sort by score ascending.

You can see here I'm passing in the Levenshtein distance utility method along with the results since I want to be able to inject the algorithm as a dependency to easily swap out other algorithms at a later date if necessary.

return new GetAccountIdsByCompanyNameResponse
{
    AccountIds = await _accountQuery.GetAccountIdsWhereNameLike(
        r.CompanyName,
        Utilities.ScoreAndSortByLevenshteinDistance
    ),
};

accounts.Select(x => new
    {
        x.Id,
        Score = LevenshteinDistance(companyName, x.Name)
    })
    .OrderBy(x => x.Score)
    .Select(x => x.Id)
    .ToList();

There isn't much to this algorithm; I think I found it from an academic resource on computing the Levenshtein distance for sentences.

static int LevenshteinDistance (string firstWord, string secondWord)
{
    var n = firstWord.Length + 1;
    var m = secondWord.Length + 1;
    var matrixD = new int [n, m];

    const int deletionCost = 1;
    const int insertionCost = 1;

    for (var i = 0; i < n; i ++)
    {
        matrixD [i, 0] = i;
    }

    for (var j = 0; j < m; j ++)
    {
        matrixD [0, j] = j;
    }

    for (var i = 1; i < n; i ++)
    {
        for (var j = 1; j < m; j ++)
        {
            var substitutionCost = firstWord[i - 1] == secondWord[j - 1] ? 0 : 1;
            int[] possibilities = {
                matrixD[i - 1, j] + deletionCost, // delete
                matrixD [i, j - 1] + insertionCost, // insert
                matrixD [i - 1, j - 1] + substitutionCost // replacement
            };
            matrixD [i, j] = possibilities.Min();
        }
    }

    return matrixD [n - 1, m - 1];
}

How I fixed company name search

Here's the complete code:

public Task<IReadOnlyList<string>> GetAccountIdsWhereNameLike(string companyName, Func<string, IReadOnlyList<SfAccount>, IReadOnlyList<string>> _applyScoringAndSort, int? limit=default(int?))
{
    if (companyName == null)
        throw new ArgumentNullException(nameof(companyName));

    var results = _readLogCache.GetOrUpdate(
        CacheKey(
            "account-ids-where-name-like",
            $"{ToBase64(companyName)}?limit={limit}"
        ),
        () => _salesForceSoapClientRequestExecutor.ExecuteQuery(
            GetAccountIdsWhereNameLikeQuery(companyName, limit),
            AsSfAccount
        ),
        (result, now) => result == null ? default(Instant?) : now + Duration.FromSeconds(5)
    );

    IReadOnlyList<string> accountIds = _applyScoringAndSort(companyName, results.Result);

    return Task.FromResult(accountIds);
}

static string GetAccountIdsWhereNameLikeQuery(string searchStr)
{
    var commonBusinessEntities = new List<string>() {
        "association",
        "co",
        "company",
        "corp",
        "corporation",
        "dba",
        "inc",
        "limited",
        "ltd",
        "lc",
        "llc",
        "pllc",
        "lp",
        "llp",
        "lllp",
        "pbd"
    };

    if (commonBusinessEntities.Contains(searchStr.ToLowerInvariant()))
    {
        return $"SELECT {AccountFieldsToQuery} FROM Account WHERE RecordType.Name = 'Customer' AND Name LIKE \'%{searchStr}%\' {limitClause}";
    }

    var disallowedCharsRemoved = Regex.Replace(
        searchStr,
        @"\s+\W(?!\s)|(?<!\s)\W\s+",
        " "
    );

    var convertedToSingleWhitespaces = Regex.Replace(
            disallowedCharsRemoved,
            @"\s+",
            " "
        )
        .ToLower();

    var commonBusinessEntitiesRemoved = convertedToSingleWhitespaces
        .Split(' ')
        .Where(x => !commonBusinessEntities.Contains(x))
        .Join(" ");

    // https://developer.salesforce.com/docs/atlas.en-us.soql_sosl.meta/soql_sosl/sforce_api_calls_soql_select_quotedstringescapes.htm
    var escapedILikeSearchString = commonBusinessEntitiesRemoved
            .Replace(@"\", @"\\")
            .Replace("'", @"\'")
            .Replace("\"", "\\\"")
            .Replace("_", @"\_")
            .Replace("%", @"\%");

    var escapedSearchString = searchStr
        .Replace(@"\", @"\\")
        .Replace("'", @"\'")
        .Replace("\"", "\\\"")
        .Replace("_", @"\_")
        .Replace("%", @"\%");

    return $"SELECT {AccountFieldsToQuery} FROM Account WHERE RecordType.Name = 'Customer' AND (Name LIKE \'%{escapedILikeSearchString}%\' OR Name LIKE \'%{escapedSearchString}%\') {limitClause}";
}

And I call it like this:

public async Task<GetAccountIdsByCompanyNameResponse> Get(GetAccountIdsByCompanyName r)
{
    if (r.CompanyName.IsNullOrEmpty())
        throw new HttpError(HttpStatusCode.BadRequest, "Please specify CompanyName");

    return new GetAccountIdsByCompanyNameResponse
    {
        AccountIds = await _accountQuery.GetAccountIdsWhereNameLike(
            r.CompanyName,
            Utilities.ScoreAndSortByLevenshteinDistance,
            r.Limit
        ),
    };
}

I also use dependency injection to wire up the ranking and scoring, using the Levenshtein distance algorithm. Writing the code in this manner will easily allow another engineer to easily make changes to the existing algorithm, as well as maybe back it up with some tests, without having to perform a complete integration test of the query.

The code can also swap out the Levenshtein distance algorithm altogether for a different one, while leaving the query mechanism fully intact and not jeopardize any of the functionality there.

There is conversation that maybe Levenshtein is too slow, and another method might be more performant.

public static IReadOnlyList<string> ScoreAndSortByLevenshteinDistance(string companyName, IReadOnlyList<SfAccount> accounts)
 {
     return accounts.Select(x => new
         {
             x.Id,
             Score = LevenshteinDistance(companyName, x.Name)
         })
         .OrderBy(x => x.Score)
         .Select(x => x.Id)
         .ToList();
 }

 static int LevenshteinDistance (string firstWord, string secondWord)
 {
     var n = firstWord.Length + 1;
     var m = secondWord.Length + 1;
     var matrixD = new int [n, m];

     const int deletionCost = 1;
     const int insertionCost = 1;

     for (var i = 0; i < n; i ++)
     {
         matrixD [i, 0] = i;
     }

     for (var j = 0; j < m; j ++)
     {
         matrixD [0, j] = j;
     }

     for (var i = 1; i < n; i ++)
     {
         for (var j = 1; j < m; j ++)
         {
             var substitutionCost = firstWord[i - 1] == secondWord[j - 1] ? 0 : 1;
             int[] possibilities = {
                 matrixD[i - 1, j] + deletionCost, // delete
                 matrixD [i, j - 1] + insertionCost, // insert
                 matrixD [i - 1, j - 1] + substitutionCost // replacement
             };
             matrixD [i, j] = possibilities.Min();
         }
     }

     return matrixD [n - 1, m - 1];
 }

Final search results

Now the results are much better and what's more, it's something the business actually expects to see and QA knows how to test and what results are expected for a given input.

I'm not sure what algorithm SF uses for their account search, but for some search strings it seems to perform worse than what I've implemented here. I'm definitely glad I pushed some on the initial specification implementing a full replication of what that SlaesForce field does since that would've added a ton of complexity to this project.

I took new observations of all of the same comparisons I did before making my change, and the comparisons were very impressive.

So there you have it. That's how I fixed it.