'main': optimization that in my tests shows 2.7s -> 2.2s
Consider indexing long buffers (say 300 chars). E.g. following line:
integer offset=${${buf[start_pos+1,len]}[(i)$needle]}
will gradually go into large indices for first index [a,] and be kept
at large index for second index [,b]. Instead, we can remove already
processed characters from buf:
proc_buf="${proc_buf[offset + $#arg + 1,-1]}"
$offset and $#arg are small, first index will not be large. $proc_buf
will be smaller and smaller and the second index [,-1] will run shorter
and shorter.