Unbounded repetition#
It’s not uncommon to want to match some unbounded repetition. For example, suppose we had a text file containing the text of a book, and we wanted to get all of the words out of it—ignoring punctuation or numbers. What is a word? We might say it’s a sequence of one-or-more letters, with no upper bound.
Alternatively, suppose we wanted to find all of the numbers in a file that mixed text and numbers. We might look for one-or-more numbers, ignoring the rest.
Finally, we might want to allow an arbitrary amount of whitespace between things (e.g., a credit card number is 4 groups of four, and the spaces don’t matter). We might want to match here against zero-or-more whitespace characters.
All three of these examples rely on unbounded repetition. We can express unbounded repetition using curly braces by just leaving the upper number off:
The {1,}
means one-or-more iterations of [a-zA-Z]
, i.e., Latin letters. The ...{1,}
pattern is so common, that it can be written directly as ...+
:
Similarly, the zero-or-more pattern can be written as ...*
; this operator is called Kleene star, named after Stephen Cole Kleene. (The ...+
is sometimes called Kleene plus.)
For example, here’s a regular expression that matches many floating point numbers:
There’s a lot going on here! Let’s piece it apart:
Optional negative up front
- An overall non-capturing group, with two alternatives
d*.d+ is zero-or-more digits, a decimal point, and one or more digits
d+ is just one or more digits