Motivation
Understanding what is glob and how glob
matching operates will help you adopt this useful mini-language in your domain whether you’re a user, programmer, or an architect.
What is glob?
Long ago, in UNIX V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterward this became a shell built-in.
Glob, short for global, is many things:
having multiple meaning and such a generic name, not surprisingly, leads to the confusion.
This post focuses on 1. - the DSL to describe text patterns. The pattern language migrated from the Unix shell into variety of platforms, which speaks for its usefulness.
glob
matching has been spotted in:
git
's wildmatch- OpenSSH’s
PATTERNS
- PKI’s Certificates
- routing in Cloudflare
- searching in Datadog
- querying with Postgres (different syntax but the same function)
- TODO: more examples
Grammar
Being a language, it has a grammar, expressed in EBNF, looks like:
pattern:
{ term }
term:
'*' matches any sequence of characters
'?' matches any single character
'[' [ '^' ] { character-range } ']'
character class (must be non-empty)
c matches character c (c != '*', '?', '\\', '[')
'\\' c matches character c
character-range:
c matches character c (c != '\\', '-', ']')
'\\' c matches character c
lo '-' hi matches character c for lo <= c <= hi
There are many flavors of the grammar, but today’s focus is mainly on the wildcard *
, since it’s the most useful and complex piece.
How it works
As mentioned earlier glob is a pattern matching language. The main focus is on *
wildcard and how it matches text.
'*' matches any sequence of characters
Next, some examples:
glob | input text | outcome | comments |
---|---|---|---|
✅ match | empty glob matches empty input ony: literal match | ||
anything | ❌nomatch | ||
he | he | ✅match | literal match: the glob contains no wildcards |
he | hello | ❌nomatch | |
he* | hello | ✅match | prefix match |
he* | hi hello | ❌nomatch | |
*lo | hello | ✅match | postfix match |
*lo | hello world | ❌nomatch | |
*he* | the hello world | ✅match | infix match |
*he* | world | ❌nomatch |
👆 ️note: the empty table cells correspond to empty strings “”.
Since initial version, many glob
implementations extended its functionality, and one of the most used ones is Bash’s extglob
, with new syntax constructs.
Glob vs Regex
ℹ️ Don’t confuse glob with regular expressions(regex): they look alike but aren’t the same.
Unlike glob, regex matches a substring unless expressed to match start or end of the input.
Comparison of glob and the corresponding regular expression:
Glob | Regex | Description |
---|---|---|
^$ | match empty string | |
* | .* | match everything |
. | \. | the dot is treated literally by glob, but it needs to be escaped for regex interpreters |
? | . | match any single symbol |
he | ^he$ | exact match |
he* | ^he.* | prefix |
*he | .*he$ | postfix |
*he* | he | infix. Note regex is shorter this time |
✋ Shells treat inputs with leading
.
specially: the*
glob doesn’t match those, unless explicitly specified, ie:
ls *
doesn’t list “hidden” files starting with.
, butls .*
lists
Implementation
Given that now we understand the DSL, how would we incorporate glob
matching into software? We need to implement it.
Russ Cox posted about glob including the algorithm which is both elegant and performant (there are many with exponential runtime-s out there).
Below Go snippet is simplified to only handle *
wildcard. I’ve commented it while studying the implementation, let me know if there’s a better way to think about it.
func match(glob, str string) bool {
var gx, sx int
var lastgx, nextsx int
for gx < len(glob) || sx < len(str) {
if gx < len(glob) {
switch c := glob[gx]; c {
default:
// literal comparison
// "consume" if both match
if sx < len(str) && str[sx] == c {
gx += 1
sx += 1
continue
}
case '*':
// *lo ~ ablalo
lastgx = gx // remember the '*' position
nextsx = sx + 1 // remember the next position in str
gx += 1 // advance glob position
// next iteration, will
// - either match glob[gx] and str[sx], continuing
// - or no-match
// in which case `gx` will be re-set to `lastgx` from above
// and sx will point to nextsx,
// effectively "consuming" the no-match by '*'
continue
}
}
// glob is exhausted
// try restarting from last '*' and next position in str
// NOTE: `<=` here: signals "reached the end of the string"
if 0 < nextsx && nextsx <= len(str) {
gx = lastgx
sx = nextsx
continue
}
return false
}
return true
}
Algorithm Walkthrough
As an example, below is a matching sequence with glob: *lo*
and input holo
.
Glob | Input | Details |
---|---|---|
*lo* | holo | initial state |
lo* | holo | * consumed; remembering its position and next position in string. Try match leftmost symbols. |
*lo* | olo | no-match in previous iteration: recover * position and consume non-matched symbol by moving to the next input symbol as set remembered in the previous step |
lo* | olo | * consumed; remember again; attempt the literal match |
*lo* | lo | no-match again; recover last * and move to the next symbol |
lo* | lo | * consumed; also literal match; consume both |
o* | o | literal match; continue |
* | consume * ; remember its position again | |
it’s a match: both glob and the input string reached the end simultaneously |
Summary
glob
matching is a useful tool. Despite limited expressiveness, it’s a preferred solution for certain use-cases due to grammar simplicity and good UX.
I’ve picked it up intuitively, while using shell, without really thinking through how it operates, this post addressed the gap.
Next up: regular expressions?