- Published on
Finding multiple matches - Building a Regex Engine Part 7
In the previous posts we have developed a formal regex engine and built some modern features on top. But it still two major problems:
- The engine can only find one match.
- All regex have an implicit start of string anchor. For example the regex
[ab]
can't find any match in the stringca
even though any other engine would find thea
.
The NFA always computes a single match starting at a specific position of the string. Until now, this position was always the first one. We can solve both problems at the same time by executing the NFA multiple times with different initial positions.
Implementation
The first thing we need to do is define the interface to get a single or multiple matches. Different languages do this in different ways. Javascript
by default gets a single match, and if you want all of them you need to use the global flag (g
). Python's regex have the method search
for a single
one and findAll
for all matches. And, for some reason, Java only has a find
method for a single match, and the only way to get all matches
is to call find()
in a loop. Ironically Java has both replace
and replaceAll
methods, so I don't know why there isn't a findAll
🤷♂️.
The only thing all languages (I know) share is that each time you call the find
method it returns the next match. This means the regex
stores the position of the last match to skip it the next time the method is called.
To keep it simple we'll add two methods find
and findAll
. These are added to the NFARegex
class, not to the NFA
class. Why? As I said in the
introduction, a NFA is supposed to find a single match and always start at a specific index. The regex class is the one responsible for controlling when
and how the NFA is executed.
Find
Now we'll have two methods that need to convert from the NFA memory to the output result, so to simplify it we'll move the code does it to it's own method.
class NFARegex {
// ...
_memoryToFinalResult(memory, string) {
const result = {groups: {}};
for (const [group, start, end] of Object.values(memory.GROUP_MATCHES)) {
result.groups[group] = string.substring(start, end);
}
return result;
}
}
We'll also assume the NFA's compute
method also takes the initial position as an argument. I'll explain that later.
The find
method is just a loop that tries to find the first match in every position, beginning at _currentPointer
. The pointer stores
the index where the last match ended, so that each time you call the method it returns the next match.
class NFARegex {
constructor (regex, modes) {
// ...
this._currentPointer = 0;
}
// ...
find(string) {
for (let i = this._currentPointer; i < string.length; i++) {
const r = this.nfa.compute(string, i);
if (!r) continue;
this._currentPointer = r.GROUP_MATCHES[0][2]; // Ending position of the group 0
return this._memoryToFinalResult(r, string);
}
this._currentPointer = string.length; // This is to cache that there are no more matches.
return null;
}
}
To get the position where the match ended it uses the ending index of the group 0. Notice that we aren't adding one to that index. That's because the ending index already points to the character after the group ends:
Notice that each group is actually a right-open interval. When we run
computeGroups
we have already exited the group, soi
points to the first non-consumed letter after the group, not to the group's last letter. This won't be a problem, since the substring function is also right-open.
FindAll
The findAll
method is similar, but here we always start at 0 and even if it finds a match it keeps looking for more. To be able to change the
current position in a smoother way I'll use a while loop.
class NFARegex {
//...
findAll(string) {
let i = 0;
const matches = [];
while (i < string.length) {
const r = this.nfa.compute(string, i);
if (r) {
matches.push(this._memoryToFinalResult(r, string));
i = i === r.GROUP_MATCHES[0][2] ? i + 1 : r.GROUP_MATCHES[0][2];
} else
i++;
}
return matches;
}
}
When it finds a match, it updates the index to the ending position of the match, unless it's an empty match. If you use lazy quantifiers, they might be matches with 0 length. For example the regex a??
and the string a
produces
one empty match. In this case we must advance the pointer one position to avoid getting stuck in an infinite loop.
The final detail
The final touch to get this working is allowing the NFA to start at a non-0 position. In the NFAEngine
we had:
compute(string) {
const stack = [];
stack.push({i: 0,
currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: []}})
//...
}
The change is as simple as adding an argument and setting i
to that value:
compute(string, i) {
const stack = [];
stack.push({i, // <--- Here
currentState: this.states[this.initialState], memory: {GROUP_MATCHES:{}, EPSILON_VISITED: []}})
//...
}
Finally, we can test it:
const r = new NFARegex("[abc]");
console.log(r.find("abc")); // {groups: {0: "a"}}
console.log(r.find("abc")); // {groups: {0: "b"}}
console.log(r.find("abc")); // {groups: {0: "c"}}
console.log(r.find("abc")); // null
console.log(r.findAll("abc")); // [{groups: {0: "a"}}, {groups: {0: "b"}}, {groups: {0: "c"}}]
Conclusion
With this change we already have a decent regex engine. The interface could be improved even more, with methods like replace
and replaceAll
. It
shouldn't be too difficult, since we have the indexes of each match.
The next posts will be a (hopefully) short mini-series on the remaining features I know about regex: Atomic groups, backreferences, lookahead and lookbehinds. I didn't intend to cover these topics, but I feel that this series would be incomplete without them. Still, I'll make a more simple and experimental implementation. I might not cover all the corner cases (or even realize they exist), since I would like to end this series as fast as possible (which is contradictory with the idea of adding 4 unplanned posts to the series 😆).
Season 1 - Building a Regex Engine (Completed)
- Introduction
- Implementing a NFA
- Creating a NFA from a regex
- Modern features - Part 1: Capturing groups
- Modern features - Part 2: Character classes and escape characters
- Modern features - Part 3: Anchors and multiline mode
- Finding multiple matches
- Modern features - Part 4: Atomic groups
- Modern features - Part 5: Backreferences
- NFA vs DFA