| ↑ Up |
| Meta | Meaning |
|---|---|
.
| Matches any character. |
pq
| Matches the concatenation of
the patterns p and q.
This is an invisible infix operator, there is no meta character.
|
p*
| Matches the preceding pattern p zero or more times.
(Kleene star)
|
p+
| Matches the preceding pattern p one or more times.
(Kleene plus)
|
p?
| Matches the preceding pattern p
zero times or one time.
|
p|q
| An infix operator. Matches if at least one operand
pattern p or q matches.
This is called alternation.
|
p{m}
| Matches the preceding pattern
exactly m times.
|
p{m,}
| Matches the preceding pattern m or more times.
|
p{m,n}
| Matches the preceding pattern from m
upto n times.
|
()
| Grouping (undermine operator precedence). |
{}
| Escape sequences. |
[]
| Character classes. |
(*...)
| A marked group, the match will be added to the list of groups. |
| Escape sequence | Meaning |
|---|---|
{s}
| "\s" (space)
|
{t}
| "\t" (tabulator)
|
{n}
| "\n" (newline)
|
{r}
| "\r" (carriage return)
|
{.}
| "." (dot)
|
{*}
| "*" (asterisk)
|
{+}
| "+" (plus)
|
{(}
| "(" (left parenthesis)
|
{L}
| "{" (left curly bracket)
|
{R}
| "}" (right curly bracket)
|
{a}
| [A-Za-z] (alphabetic ASCII characters)
|
{d}
| [0-9] (ASCII digits)
|
{x}
| [0-9A-Fa-f] (hexadecimal ASCII digits)
|
{l}
| Lowercase ASCII letters |
{u}
| Uppercase ASCII letters |
{ua}
| Alphabetic Unicode characters |
{ul}
| Lowercase Unicode letters |
{uu}
| Uppercase Unicode letters |
{g}
| [\u{21}-\u{7e}] (graphical: visible ASCII)
|
{_}
| Whitespace |
{B}
| Beginning of the string |
{E}
| End of the string |
{LB}
| Beginning of a line |
{LE}
| End of a line |
use regex: re
# Some string matches itself
> re("café").match("café")
true
# Matches if any of two alternatives occur:
> r = re("moon|soon")
> [r.match("moon"), r.match("soon")]
[true, true]
# Character concatenation has a higher binding than '|',
# but we can undermine this by grouping.
> r = re("(m|s)oon")
> [r.match("moon"), r.match("soon")]
[true, true]
# A digit
re("0|1|2|3|4|5|6|8|9")
# A digit, expressed as a character class
re("[0-9]")
# A digit, shortest notation
re("{d}")
# A binary literal
re("(0|1)+")
# A binary literal, digits expressed as a character class
re("[01]+")
# A date (year-month-day)
re("{d}{d}{d}{d}-{d}{d}-{d}{d}")
# A date, leading zeros not needed
re("{d}{d}?{d}?{d}?-{d}{d}?-{d}{d}?")
# A date, using advanced quantifiers
re("{d}{4}-{d}{2}-{d}{2}")
re("{d}{1,4}-{d}{1,2}-{d}{1,2}")
# Muhkuh (moo(ing)? cow)
> re("Mu+h+").match("Muuuuuhhh")
true
# Kleene star
> r = re("(x|y)*")
> ["", "x", "y", "xx", "xy", "yx", "yy",
"xxx", "xxy", "xxyxxxyyx"].all(|s| r.match(s))
true
# Integer literals
> r = re("[+-]?{d}+")
> r.match("-12")
true
# Simple floating point literals
re("[+-]?{d}+{.}{d}+")
# Full floating point literals
re("[+-]?({d}+({.}{d}*)?|{.}{d}+)([Ee][+-]?{d}+)?")
# Whitespace has no meaning inside of a regular expression
re("""
[+-]?
( {d}+ ({.} {d}*)?
| {.} {d}+
)
([Ee] [+-]? {d}+)?
""")
# Whitespace has to be stated explicitly
> r = re("a{s}*b")
> r.match("a\s\s\s\s\sb")
true
> r.match("a\s\tb")
false
Task to the reader: How to state a pattern for full floating point literals that excludes integer literals?
# Maches a single characters from a list.
re("(a|b|c|d|1|2)")
# Such a list may be written briefer as a character class.
re("[abcd12]")
# And ranges of characters can be stated,
# using range notation.
re("[a-d12]")
# Any range from Unicode code point upto another
# Unicode code point can be such an range.
# For example, the greek alphabet is:
re("[\u{0391}-\u{03a9}\u{03b1}-\u{03c9}]")
# That is:
re("[Α-Ωα-ω]")
# Escape sequences can occour inside of character classes.
re("[{d}{a}]")
# This is the same as:
re("[0-9A-Za-z]")
Often one wants to find all non-overlapping patterns
in a string. If r is some regex,
then r.list(s) returns the list of
all non-overlapping occurences of r in s.
use regex: re
word = re("{a}+")
text = "The quick brown fox jumps over the lazy dog."
print(word.list(text))
# Output:
# ["The", "quick", "brown", "fox", "jumps",
# "over", "the", "lazy", "dog"]
Groups can be extracted from a string, according to a
regular expression. A group is formed by a pair of parentheses that
has an asterisk after the opening parenthesis.
There is a method groups that returns null
if the regex does not match, otherwise the list of groups.
use regex: re
r = re("(*{d}{d}{d}{d})-(*{d}{d})-(*{d}{d})")
while true
s = input("Date: ")
t = r.groups(s)
if t is null
print("A well formed date please!")
else
print(t)
end
end
# Date: 2016-10-14
# ["2016", "10", "14"]
Rather than returning the list of non-overlapping matches,
these matches x may be replaced by f(x).
To achieve this, there is the method r.replace(s,f).
use regex: re
text = "The quick brown fox jumps over the lazy dog."
print(re("{a}+").replace(text,|x| "["+x+"]"))
# [The] [quick] [brown] [fox] [jumps] [over] [the] [lazy] [dog].
use regex: re
function tokenizer(d)
r = re(d["r"])
f = d["f"] if "f" in d else null
return fn|s|
a = r.list(s)
return a if f is null else a.map(f)
end
end
words = tokenizer({
r = "{a}+"
})
integers = tokenizer({
r = "{d}+",
f = int
})
numbers = tokenizer({
r = "({d}|{.})+",
f = |x| float(x) if '.' in x else int(x)
})
for line in input
a = numbers(line)
print(a)
end
use regex: re
function bind_regex(rs)
r = re(rs)
return |s| r.match(s)
end
isalpha_german = bind_regex("[A-Za-zÄÖÜäöüß]*")
isalpha_latin = bind_regex("""[
A-Z a-z
\u{00c0}-\u{00d6}
\u{00d8}-\u{00f6}
\u{00f8}-\u{024f}
]*""")
As one can see, the Unicode letter range was fragmented by
throwing in two mathematical operators (\u{d7},
\u{f7}). Matching upper and lower case is
even more complicated. Furthermore, in general a letter may
be followed by combining characters.