String Algorithms
- Repeated Sequences
- Manacher Algorithm for Longest Palindromic Substring
- KMP Algorithm for Pattern Matching
SECTION INCOMPLETE: Review Content
Repeated Sequences
The generic problem is to find the L
length repeated sequences in a given string. [Leetcode #187]
Rabin-Karp Algorithm
If there are a
unique characters in the string, convert the sub-string of length L
to a a-
decimal number (i.e., base a
). Use hash set to find the repetitions. Time Complexity: O(N)
Bit Manipulation
Instead of converting it to a a-
decimal number convert to a bitmask, and operate accordingly.
Comparison optimization in sliding window approaches: Consider a simpler problem of finding a permutation [Leetcode #567] or anagram [Leetcode #438] with a 26-sized array to count the frequency of the characters. To reduce the computational complexity (i.e. comparing the 26-element arrays at every step of sliding window), keep a variable count
and update accordingly to count the number of mismatched letters. For detailed solution, refer to [Leetcode #567]
Manacher Algorithm for Longest Palindromic Substring
Given a string, find the longest substring which is palindrome. Time Complexity: O(N) [Ref1] [Ref2] [Ref3] [Leetcode #647]
Basic Insights:
- For any
idx
in between(center, right)
interval, the length of palindromic substring is atleast the length calculated for the mirror index (around center, which iscenter - (idx-center)
) - We expand only beyond that since that much length is already known
- If the right boundary expands, we update the
center
andright
def manacher(s):
s = "$#"+"#".join(s)+"#@"
p, center, right = [0]*len(s), 0, 0
for idx in range(1, n-1):
mirror = 2*c-idx
if idx<r: p[idx] = min(r-idx, p[mirror])
while s[idx+(p[idx]+1)]==s[(idx-(p[idx]+1))]: p[idx]+=1
if p[idx]+idx>r: c, r = idx, idx+p[idx]
return p
KMP Algorithm for Pattern Matching
Given a string
and a pattern
, find all occurrences of pattern
.
Building Prefix Array
Given the pattern
, find the longest proper prefix (a proper prefix is prefix with whole string not allowed) which is also a suffix. Return an array lps
such that lps[i]
represents the longest proper prefix of pattern[0..i]
(i
included). A naive algorithm takes $O(n^3)$ time. Note the following insight: $lps[i+1] \leq lps[i]$. Time complexity: O(len(pattern))
- Iteration
i
from1
ton-1
, withlps[0]=0
- Denote
j
as the best prefix length, and initialize withlps[i-1]
. Ifpattern[i] == pattern[j]
thenlps[i]=j+1
otherwise keep settingj
tolps[j-1]
and repeat. Ifj==0
, we setlps[i]=0
and move to the nexti
. [Ref1] [Ref2]
def KMP_prefixArray(pattern):
lps = [0]*len(pattern)
for i in range(1,len(pattern)):
j = lps[i-1]
while j>0 and pattern[i]!=pattern[j]: j = lps[j-1]
if pattern[i]==pattern[j]: lps[i]=j+1
else: lps[i]=0 #Means $j$ was 0
KMP Algorithm
Given the longest proper prefix array (lps
), we do not match all characters multiple times to save time complexity. Time complexity: O(m+n) [Ref1] [Ref2] [Leetcode #214]
i,j
denotes the indexes instring
andpattern
respectively to compare.- If there is a match
s[i]==s[j]
, we increment both pointers - If there is a mismatch, we know that
pattern[0..j-1]
matches withstring[i-j....i-1]
lps[j-1]
is count of characterspattern[0..j-1]
that are both proper prefix and suffix, hence, we do not need to matchlps[j-1]
characters withstring[i-j....i-1]
.
def KMP(string, pattern):
m,n, lps = len(pattern), len(string), KMP_prefixArray(pattern)
i,j,out=0,0,[]
while (i<n):
if string[i]==pattern[j]: i, j = i+1, j+1
if j==m: out.append(i-j)
else:
if j==0: i=i+1
else: j = lps[j-1]