Sorting by Search-Ranking Score: Part 2
Introduction
TL;DR; This blog explains how to sort a list of strings based on a search ranking score. The calculations are explained with real world examples, implemented in Typescript and Rust.
This is the second part in the series to explain sorting by search-ranking score. If you haven’t already gone through the first part, please take a look, it should provide a foundation for this second and the last part.
In the first part, we discussed the first approach i.e sorting the source strings by using the search-term as a substring. At the end, we mentioned that the approach is too strict to forgive the typos. In this part, we will see another two approaches and in those we will see that they both ignore the typos. At the end, we will mix the both approaches to make a hybrid approach in order to improve the performance.
Jaccard Index
The Jaccard index or the Jaccard similarity coefficient is used to determine the similarity between two sets. The higher the score, the more similar the sets are. The score will always be between 1 (the exactly same sets) and 0 (the completely different sets). For our use case, we will consider search-term and the source string as two sets of characters. Explaining it in simple words, the Jaccard index is the ratio of common characters between both strings over the difference of the combined length of unique characters in both strings and the common characters. That is;
jaccard_index(search_term, source) = intersection(search_term, source).size/(unique(search_term).size + unique(source).size - intersection(search_term, source).size)
Let’s first understand it with the help of the following example.
source | search-term | intersection: I | unique(search-term) : A | unique(source): B | score: len(I)/(len(A) + len(B) - L) |
---|---|---|---|---|---|
hello |
hello |
helo |
helo |
helo |
4/(4+4-4)=1 |
world |
hello |
ol |
helo |
world |
2/(4+5-2)=0.28571486 |
world |
pumpkin |
[] |
pumkin |
world |
0/(6+5-0)=0 |
pumpkin |
pumpin (1 typo) |
pumin |
pumin |
pumkin |
5/(5+6-5)=0.833333 |
washington |
wasengtun (3 typos) |
wasngt |
wasengtu |
washingto |
6/(8+9-6)=0.545454 |
The table above should be sufficient to help understanding how the calculations are working. Let’s implement it in Typescript and Rust.
Typescript
Following is the optimized version of Jaccard Index function in Typescript. In case something is not intuitive, try to simplify it, or ping me on Linked or Twitter, or open an issue in the Github repo.
function jaccardIndex(a: string, b: string): number {
if (a.length === 0 || b.length === 0) {
return 0;
}
const aCodePoints = new Set(
Array.from(a.toLowerCase(), (c) => c.codePointAt(0))
);
const bCodePoints = new Set();
const intersection = new Set();
for (let i = 0; i < b.length; i++) {
const bc = b[i].toLowerCase().codePointAt(0);
bCodePoints.add(bc);
if (aCodePoints.has(bc)) {
intersection.add(bc);
}
}
return (
intersection.size /
(aCodePoints.size + bCodePoints.size - intersection.size)
);
}
The important point to note is that the above implementation can be used with UTF-8 and 16. Which means you can use it for any language that is not in simple ASCII e.g German, French, Arabic, Hindi etc. You can see the unit tests in the GitHub repo.
Rust
Following is the similar implementation in Rust. The tests reside in the same repo.
fn calculate_jaccard_index(search_term: &String, source: &String) -> f32 {
if search_term.is_empty() || source.is_empty() {
return 0f32;
}
let mut search_term_chars: HashSet<char> = HashSet::new();
search_term.chars().for_each(|c| {
search_term_chars.insert(c);
});
let mut intersection: HashSet<char> = HashSet::new();
let mut source_chars: HashSet<char> = HashSet::new();
source.chars().for_each(|c| {
source_chars.insert(c);
if search_term_chars.contains(&c) {
intersection.insert(c);
}
});
intersection.len() as f32
/ (search_term_chars.len() as f32 + source_chars.len() as f32 - intersection.len() as f32)
}
Analyses
What is the time complexity of calculating Jaccard Index with any of the above implementations? The heavy lifting we are doing is just iterating over the characters of both strings. Hence we can say that the time complexity C
is O(M+N)
where N
is the size of the source string and M
is the size of the search-term.
While Jaccard index is better than our first approach of substring matching, it also has a weakness! It will yield same score for some words that have exactly same characters but in different order. For example, if you calculate the score for top
as the search-term and pot
is the source string, it will yield 1
, which means the both strings are matching. But that is not the fact. It happens, because Jaccard index is using set theory in its calculations, and sets don’t care about the order! Therefore, the both string are assumed to be the exact same strings. Let’s see how Levenshtein distance can help us solve that issue in the following section.
Levenshtein Distance
The Levenshtein distance is another alternative approach to calculate the ranking score. The Levenshtein distance tries to calculate the minimum number of characters that we can change to make the both strings look exactly same. It means that the minimum the required changes, the better the match. Let’s understand the expected score with the help of the following table.
source | search-term | required changes | score |
---|---|---|---|
hello |
hello |
[] |
0 |
world |
hello |
wrld |
4 |
world |
pumpkin |
pumpkin |
7 |
pumpkin |
pumpin (1 typo) |
k |
1 |
washington |
wasengtun (3 typos) |
hio |
3 |
Now let’s the implementations, first in Typescript and then in Rust.
Typescript
Following is the optimized implementation in Typescript. In case something is not intuitive, try to simplify it, or ping me on Linked or Twitter, or open an issue in the Github repo.
function levenshtein(searchTerm: string, sourceString: string): number {
const stLength = searchTerm.length;
const srcLength = sourceString.length;
if (stLength === 0) {
return srcLength;
}
if (srcLength === 0) {
return stLength;
}
const st = searchTerm.toLowerCase();
const ss = sourceString.toLowerCase();
if (st === ss) {
return 0;
}
const matrix = new Array<number>(stLength + 1);
for (let i = 0; i <= stLength; i++) {
matrix[i] = i;
}
for (let j = 1; j <= srcLength; j++) {
let previous = matrix[0];
matrix[0] = j;
for (let i = 1; i <= stLength; i++) {
const current = matrix[i];
if (st[i - 1].codePointAt(0) === ss[j - 1].codePointAt(0)) {
matrix[i] = previous;
} else {
matrix[i] = Math.min(previous, matrix[i], matrix[i - 1]) + 1;
}
previous = current;
}
}
return matrix[stLength];
}
Same as the implementation of Jaccard-Index, the above implementation of Levenshtein distance, supports UTF-8 and 16 characters. You can see the unit tests in the GitHub repo.
Rust
Following is the similar implementation of Levenshtein distance in Rust. The tests reside in the same repo.
fn calculate_levenshtein_distance(search_term: &String, source: &String) -> usize {
let st_length = search_term.len();
let ss_length = source.len();
if st_length == 0 {
return ss_length;
}
if ss_length == 0 {
return st_length;
}
let st = search_term.to_lowercase();
let ss = source.to_lowercase();
if st == ss {
return 0;
}
let mut matrix: Vec<usize> = (0..=(st_length + 1)).collect();
let search_term_chars: Vec<char> = st.chars().collect();
let source_chars: Vec<char> = ss.chars().collect();
for j in 1..(ss_length + 1) {
let mut previous = matrix[0];
matrix[0] = j;
for i in 1..(st_length + 1) {
let current = matrix[i];
if search_term_chars[i - 1] == source_chars[j - 1] {
matrix[i] = previous;
} else {
matrix[i] = std::cmp::min(previous, std::cmp::min(matrix[i], matrix[i - 1])) + 1;
}
previous = current;
}
}
matrix[st_length]
}
Analyses
Calculating Levenshtein distance between two strings is a bit expensive than calculating the Jaccard Index of same strings. We can see that the time complexity of Levenshtein distance is O(M * N)
where M
is the length of the search-term and N
is the length of the source string. However, Levenshtein distance fixes the issue of same characters in a different order.
Hybrid approach
Now that we have seen a few approaches to calculate the relevance score of search-term in list of strings, we can see that they all have some kind of weakness. Either they don’t forgive typos or they mismatch with reverse string or they are a bit expensive. If we use Levenshtein distance alone on a list of strings having size L, then the total cost would be L * O(M * N)
. How about mixing them in a new approach such that we can overcome the mismatch issue and also not give up on the performance. So the idea would be to first calculate the Jaccard index of all strings in the source list and sort with the score, and then apply Levenshtein distance function on top K results and then resort by the new score. With this hybrid approach the total cost would be L * O(M+N) + K * O(M * N) where K < L
. Suppose we have 10000 strings in the source list and we want to apply Levenshtein distance on top 50 only, then the performance would be much better than applying Levenshtein distance on all 10000 strings.
The following implementation in Typescript shows the idea. I am skipping the implementation in Rust and leaving it as an exercise for you. Implementing in Rust should not be hard though.
function hybridSearch(
searchTerm: string,
source: string[],
top: number
): string[] {
const jaccardResult = source.map((item) => {
return {
item,
score: jaccardIndex(searchTerm, item),
};
});
const sortedJaccardResult = jaccardResult.sort((a, b) => {
return b.score - a.score;
});
const topJaccardResult = sortedJaccardResult.slice(0, top).map((item) => {
return {
item: item.item,
score: levenshtein(searchTerm, item.item),
};
});
return topJaccardResult.map((item) => item.item);
}
This snippet concludes the series of sorting by search ranking score. I hope that the series not only helps with sorting problem but also shows how to break complex problems in to simple ones and then solve them easily. Feel free to contact me at Linked or Twitter, or open an issue in the Github repo if you want to provide feedback or ask any question.