Writing a fuzzy string search algorithm in Nim
I needed a fuzzy string search algorithm for my fzf like library and this website's gimmicky search bar.
I did see that the Nim documentation actually has an implementation which is used for searching their documentation.
I tried this and I actually ran into a few issues. It also didn't quite get the results I expected. Which just felt bad.
So I wrote my own:
proc fzfuzzyMatch*(pattern: string, str: string, longestItemLength: int) : tuple[score: int, matched: bool, item: string] =
var
strIndex = 0
patIndex = 0
lastCharMatchedScore = 0
score = 0
numInRow = 0
while (strIndex < str.len) and (patIndex < pattern.len):
var
patternChar = pattern[patIndex].toLowerAscii
strChar = str[strIndex].toLowerAscii
# Ignore certain characters
if patternChar in {'_', ' ', '.'}:
patIndex += 1
continue
if strChar in {'_', ' ', '.'}:
strIndex += 1
continue
if strIndex == 0 and patternChar == strChar:
score += longestItemLength
lastCharMatchedScore += 2
patIndex += 1
strIndex += 1
numInRow += 1
elif strChar == patternChar:
score += int(longestItemLength/strIndex) * (if numInRow == 0: 1 else: (numInRow * 3))
numInRow += 1
strIndex += 1
patIndex += 1
else:
if not (str[strIndex] in {'_', ' ', '.'}):
numInRow = 0
strIndex += 1
result = (
score: max(0, int(score)),
matched: (int(score) > 0),
item: str
)
This is actually or obviously VERY simple.
The fzfuzzyMatch function takes two arguments pattern and str.
pattern in the string representing the search string you would be typing in and str would be one of the string from a list of strings you want to search over.
My algorithm weights the results based on the matched character's position in the string and requires the length of the longest string being searched. This is passed in via longestItemLength.
All it does it iterate over both the pattern and str string at the same time. Both are converted to lowercase to prevent any upper lower case search issues.
Then it takes the first charcter in the pattern string (from the search box) and checks the first character in str for a match. If the first character matches the score has longestItemLength added to it. This if statement is on line 22.
The score is just a value we keep adding to as matches happen to rate a string higher or lower in the overall search rankings.
The next elif is when a character is matched that is not the first character. Here the score has the longestItemLength/strIndex * numInRow added to it.
longestItemLength/strIndex is just a value that becomes smaller as the matches character moves farther down the string (to the right).
numInRow is the number of characters matched in a row. If a character match is missed then this is reset.
That is basically it!
Pretty simple and provides some pretty nice results. If you check out the example in fznim you can see it highlight the matched characters which is a nice way to visualize some of this too.
To run this example you could checkout the repo and compile it with:
git clone git@github.com:bebrws/fznim.git
cd fznim
nimble install -y; nim c examples/fzf.nim; cat fznim.nim | ./examples/fzf