Sorting by Search-Ranking Score: Part 1
Introduction
TL;DR; This blog explains how to sort a list of strings based on a search ranking score. The calculations are explained with real world examples, implemented in Typescript and Rust.
Have you ever tried to implement a word dictionary? If not, just try as a toy project. Soon you will encounter the problem of sorting the search results in such a way that the relevant results should be on top. If the problem of sorting by relevance sounds familiar to you, welcome, you are at the right place. We will learn together how to sort by relevance, what are the different approaches available and more importantly what are the trade offs. This is a series of two parts, this one is the first one, and you can jump to the second part if you are already gone through this one.
What? The search ranking score
Sorting strings other than ascending or descending is not that trivial! It sucks brain blood!
The search ranking score is just a floating point number! It is calculated based on the criteria that you want to use for sorting. Suppose, you want to sort the strings based on their lengths, then the ranking score is just the length of the strings. Well, that is not a real use case! In real life, we rarely need such a simple ranking score, however a practical use case is calculating the ranking score from the relevancy of a search term with the search result. We will see it in details down below.
Once you have the ranking score i.e. a number, you have, in fact, translated the problem of sorting strings into a problem of sorting numbers. You may already know sorting the numbers is much easier to understand and faster in execution compared to sorting strings.
Why? The use cases
Do you really need it? Well, that’s a valid question. Often, the source of data in our software systems is in our control. For example, getting a list of users from a database, sorted by their first names! We simply sort the list in the database using its query syntax or through its API if it’s a NoSQL database.
However, sometimes, you might need to get a list of cities from your database where the city name contains a certain given sequence of characters. For example, searching cities having “berg” in their names! Another example could be implementing a word dictionary! Searching a word, while also keeping user typos in mind. Or even a worse case is when the source of the data is a third-party service provider. You just provide your filter criteria and get the results. Which means you might not have good control over the sorting. Even in the case of your database, sorting the list in the database might not be that easier! Or it may consume more resources and cause performance problems.
That’s why solving the sorting problem in your application codebase is not only easier to understand, maintain and enhance but also faster in your control.
How?
There are several approaches, but we will start with a simpler ones in this first part and then we will discuss more advanced ones in the second part of this blog series.
Substring in source
Consider this use case. Suppose you have a database of locations of the neighborhoods and boroughs of London. Which looks like the following.
Locations |
---|
Forest Gate (Newham) |
Barking (Barking and Dagenham) |
Dagenham (Barking and Dagenham) |
Forest Hill (Lewisham) |
Barking and Dagenham |
Becontree (Barking and Dagenham) |
Becontree Heath (Barking and Dagenham) |
Barnet |
Forestdale (Croydon) |
Barnet Gate (Barnet) |
Brent Cross (Barnet) |
East Barnet (Barnet) |
Highams Park (Waltham Forest) |
Bexley |
Barnehurst (Bexley) |
Bexleyheath (Bexley) |
Brent |
Brent Park (Brent) |
Brentford (Brent) |
Brondesbury (Brent) |
Suppose you have provided a search box on your web page and a user inputs forest
. No one knows that which exact location is the user searching for! Consider another input: dagenham
. It’s a neighborhood “Dagenham” in “Barking and Dagenham” and also it’s in the name of the borough. That makes it complex to sort.
So what’s the solution? First thing first, we need to filter the strings with contain
criteria. If a string contains the search term provided by the user, we will consider it a match. Hence, in the case of forest
, we will get:
Locations |
---|
Forest Gate (Newham) |
Forest Hill (Lewisham) |
Forestdale (Croydon) |
Highams Park (Waltham Forest) |
Whereas in the case of dagenham
, we will get:
Locations |
---|
Barking (Barking and Dagenham) |
Dagenham (Barking and Dagenham) |
Barking and Dagenham |
Becontree (Barking and Dagenham) |
Becontree Heath (Barking and Dagenham) |
Now how do we sort the search results, such that the most relevant location should be at the top! As mentioned above, we need a function to determine the relevancy of the search term to the strings that we got from the database as the search result. Here is the pseudo code of our ranking function.
INPUT: search_term: String
INPUT: source: String
if(source.starts_with(search_term)){
return 1 + len(search_term)/len(source)
} else if(source.contains(search_term)){
return len(search_term)/len(source)
} else{
return -1
}
It’s simple. The key point to note in this calculation is the division of the length of the search term by the length of the source string. The smaller the source string, the higher the score. Another key point is the prefix matching. If a string contains the search-term as the prefix, it’s score gets a boost of 1
. Whereas, if doesn’t contain the search-term at all, we just return -1
.
When we execute the above function over the example search results, we will get the following:
Locations | length | calculation | score |
---|---|---|---|
Forest Gate (Newham) | 20 | 1+(5/20) | 1.25 |
Forest Hill (Lewisham) | 22 | 1+(5/22) | 1.2273 |
Forestdale (Croydon) | 20 | 1+(5/20) | 1.25 |
Highams Park (Waltham Forest) | 29 | 5/29 | 0.1724 |
When sorted by the score in the descending order, we will get either of the “Forest Gate (Newham)” and “Forestdale (Croydon)” at the top position, since the both have equal but the biggest relevance score in the search result.
Now let’s repeat the same process for the search result of dagenham
.
Locations | length | calculation | score |
---|---|---|---|
Barking (Barking and Dagenham) | 30 | 8/30 | 0.267 |
Dagenham (Barking and Dagenham) | 31 | 1+(8/31) | 1.2581 |
Barking and Dagenham | 21 | 8/21 | 0.381 |
Becontree (Barking and Dagenham) | 32 | 8/32 | 0.25 |
Becontree Heath (Barking and Dagenham) | 38 | 8/38 | 0.2105 |
We can see that we sorted with score in the descending order, we will get “Dagenham (Barking and Dagenham)” at the top.
I have implemented this example in Rust and Typescript, and have provided some unit tests for the both languages.
Typescript
Following is a snippet from Typescript.
const calculateScore = (searchTerm: string, source: string): number => {
if (source.startsWith(searchTerm)) {
return 1 + searchTerm.length / source.length;
} else if (source.includes(searchTerm)) {
return searchTerm.length / source.length;
} else {
return -1;
}
};
export const sortByRankingScore = (
searchTerm: string,
searchResult: string[]
): string[] => {
return searchResult
.map((searchItem) => ({
score: calculateScore(searchTerm, searchItem),
searchItem,
}))
.sort((a, b) => b.score - a.score)
.map((sorted) => sorted.searchItem);
};
You can take a look of the unit tests and can grab the code from this Github repository.
Rust
Now let’s take a look of a snippet from Rust code:
#[inline]
fn calculate_score(search_term: &String, source: &String) -> f32 {
if source.starts_with(search_term) {
1f32 + (search_term.len() as f32 / source.len() as f32)
} else if source.contains(search_term) {
search_term.len() as f32 / source.len() as f32
} else {
-1f32
}
}
#[inline]
fn compare_floats(left: &f32, right: &f32) -> Ordering {
if left > right {
Ordering::Greater
} else if left < right {
Ordering::Less
} else {
Ordering::Equal
}
}
/// Sorts the provided source by the relevance of the provided search term.
/// #### Arguments
/// * `search_term` - the term (a non-empty string) to search in the provided source of strings
/// * `source` - the source of strings to search in and then sort by the relevant score, assuming that all strings are non-empty
pub fn sort_by_ranking_score(search_term: &String, source: Vec<String>) -> Vec<String> {
// calculates ranking score
let mut source_scores: Vec<SourceScore> = source
.iter()
.map(|source| SourceScore {
score: calculate_score(search_term, source),
source: source.to_string(),
})
.collect();
// sorts (in-place) in descending order by using the item b as the left argument
source_scores.sort_by(|a, b| compare_floats(&b.score, &a.score));
// return the source strings only
source_scores
.iter()
.map(|source_score| source_score.source.clone())
.collect()
}
You can take a look of the unit tests and can grab the code from this Github repository.
Even though the code in the both languages is pretty self explanatory and contains the comments as well, I have added unit tests as well to make it even more understandable. Moreover, I will be happy to provide further explanations when required. In such case, feel free to open an issue in the GitHub repository.
Analyses
This approach, I call it substring approach, is simple in the sense that it tries to find the search-term as a whole in the source strings. The best case in terms of time complexity is that the source string starts with the search-term. Suppose the length of the search-term is M
, in case of prefix, the time complexity of the worst case would be O(M)
. However, if the first condition is not true, then the algorithm is going to find whether the source string contains the search-term. In that case, the time complexity of the worst case would generally be O(N*M)
. Where N
is the size of the source string. Please note that we are not going into the details of language-wise fine tuning in the implementation of finding a substring. Some languages implement multiple algorithms and use heuristics to get better performance than O(N * M)
.
There is one down side of this approach however! It doesn’t forgive the user for typos! Luckily we have a solution for that. And that is discussed in length in the second part of this series.