Glob matching

Motivation

Understanding what is glob and how glob matching operates will help you adopt this useful mini-language in your domain whether you’re a user, programmer, or an architect.

What is glob?

Long ago, in UNIX V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterward this became a shell built-in.
— glob(7) — Linux manual page

Glob, short for global, is many things:

having multiple meaning and such a generic name, not surprisingly, leads to the confusion.

This post focuses on 1. - the DSL to describe text patterns. The pattern language migrated from the Unix shell into variety of platforms, which speaks for its usefulness.

glob matching has been spotted in:

git's wildmatch
OpenSSH’s PATTERNS
PKI’s Certificates
routing in Cloudflare
searching in Datadog
querying with Postgres (different syntax but the same function)
TODO: more examples

Grammar

Being a language, it has a grammar, expressed in EBNF, looks like:

pattern:
	{ term }    
term:
	'*'         matches any sequence of characters
	'?'         matches any single character
	'[' [ '^' ] { character-range } ']'
	            character class (must be non-empty)
	c           matches character c (c != '*', '?', '\\', '[')
	'\\' c      matches character c

character-range:
	c           matches character c (c != '\\', '-', ']')
	'\\' c      matches character c
	lo '-' hi   matches character c for lo <= c <= hi

✋ note: original phrasing `non-Separator characters` is replaced with `characters` since we're not dealing with file paths.

There are many flavors of the grammar, but today’s focus is mainly on the wildcard *, since it’s the most useful and complex piece.

How it works

As mentioned earlier glob is a pattern matching language. The main focus is on * wildcard and how it matches text.

 '*'         matches any sequence of characters

Next, some examples:

glob	input text	outcome	comments
		✅ match	empty glob matches empty input ony: literal match
	`anything`	❌nomatch
`he`	`he`	✅match	literal match: the glob contains no wildcards
`he`	`hello`	❌nomatch
`he*`	`hello`	✅match	prefix match
`he*`	`hi hello`	❌nomatch
`*lo`	`hello`	✅match	postfix match
`*lo`	`hello world`	❌nomatch
`he`	`the hello world`	✅match	infix match
`he`	`world`	❌nomatch

👆 ️note: the empty table cells correspond to empty strings “”.

Since initial version, many glob implementations extended its functionality, and one of the most used ones is Bash’s extglob, with new syntax constructs.

Glob vs Regex

ℹ️ Don’t confuse glob with regular expressions(regex): they look alike but aren’t the same.

Unlike glob, regex matches a substring unless expressed to match start or end of the input.

Comparison of glob and the corresponding regular expression:

Glob	Regex	Description
	`^$`	match empty string
`*`	`.*`	match everything
`.`	`\.`	the dot is treated literally by glob, but it needs to be escaped for regex interpreters
`?`	`.`	match any single symbol
`he`	`^he$`	exact match
`he*`	`^he.*`	prefix
`*he`	`.*he$`	postfix
`he`	`he`	infix. Note regex is shorter this time

✋ Shells treat inputs with leading . specially: the * glob doesn’t match those, unless explicitly specified, ie:
ls * doesn’t list “hidden” files starting with ., but
ls .* lists

Implementation

Given that now we understand the DSL, how would we incorporate glob matching into software? We need to implement it.

Russ Cox posted about glob including the algorithm which is both elegant and performant (there are many with exponential runtime-s out there).

Below Go snippet is simplified to only handle * wildcard. I’ve commented it while studying the implementation, let me know if there’s a better way to think about it.


func match(glob, str string) bool {
	var gx, sx int
	var lastgx, nextsx int
	for gx < len(glob) || sx < len(str) {
		if gx < len(glob) {
			switch c := glob[gx]; c {
			default:
				// literal comparison
				// "consume" if both match
				if sx < len(str) && str[sx] == c {
					gx += 1
					sx += 1
					continue
				}
			case '*':
				// *lo ~ ablalo
				lastgx = gx     // remember the '*' position
				nextsx = sx + 1 // remember the next position in str
				gx += 1         // advance glob position
				// next iteration, will
				// - either match glob[gx] and str[sx], continuing
				// - or no-match
				//   in which case `gx` will be re-set to `lastgx` from above
				//   and sx will point to nextsx, 
        //   effectively "consuming" the no-match by '*'
				continue
			}
		}
		// glob is exhausted
		// try restarting from last '*' and next position in str
		// NOTE: `<=` here: signals "reached the end of the string"
		if 0 < nextsx && nextsx <= len(str) {
			gx = lastgx
			sx = nextsx
			continue
		}
		return false
	}

	return true
}

Algorithm Walkthrough

As an example, below is a matching sequence with glob: *lo* and input holo.

Glob	Input	Details
`lo`	`holo`	initial state
`lo*`	`holo`	`*` consumed; remembering its position and next position in string. Try match leftmost symbols.
`lo`	`olo`	no-match in previous iteration: recover `*` position and consume non-matched symbol by moving to the next input symbol as set remembered in the previous step
`lo*`	`olo`	`*` consumed; remember again; attempt the literal match
`lo`	`lo`	no-match again; recover last * and move to the next symbol
`lo*`	`lo`	`*` consumed; also literal match; consume both
`o*`	`o`	literal match; continue
`*`		consume `*`; remember its position again
		it’s a match: both glob and the input string reached the end simultaneously

Summary

glob matching is a useful tool. Despite limited expressiveness, it’s a preferred solution for certain use-cases due to grammar simplicity and good UX.

I’ve picked it up intuitively, while using shell, without really thinking through how it operates, this post addressed the gap.

Next up: regular expressions?