Regular expressions syntax
Introduction
Regular expressions are a widely used way of discribing templates for searching texts and checking that the text is equal to the template. Special metasymbols allows to search a substring in the beginning of input string.
Simple comparison
Any symbol is equal to itself if it's not a meatsymbol. Sequence of symbols coinceds with the same sequence in input string, so tamplate "foobar" coinceds with substring "foobar" in input string.
If you want metasymbols or escape-sequence to work as normal symbols, put "\" before it. For example metasymbol "^" usually means the beginning of a line. But if you write it as "\^" it will be equal to "^".
Example:
- foobar finds 'foobar'
- \^FooBarPtr finds '^FooBarPtr'
Escape-sequences
Any symbol can be defined by escape-sequence (as in Perl or C languages).
For example \nn in which nn is a consequence of hexadecimal numbers means a symbol with ASCII code nn. If you want to define a double-byte (Unicode) symbol, use the format '\x{nnnn}' in which 'nnnn' is one or more hexademical number.
- \xnn - a symbol with hexadecimal code nn
- \x{nnnn} - a symbol with hexadecimal code nnnn (more than 1 byte is possible only in Unicode)
- \t - tab (HT/TAB), also \x09
- \n - new line (NL), also \x0a
- \r - enter (CR),also \x0d
- \f - another format (FF), also \x0c
- \a - ring (BEL), also \x07
- \e - escape (ESC), also \x1b
Example:
- foo\x20bar finds 'foo bar' (note a blank space between the words)
- \tfoobar finds 'foobar' before the tab
Symbols index
You can define an index if you put symbols in brackets. The index will coincide with any symbol from it. If the first symbol if the index is "^" then the index coincides with any symbol not mentioned in the index.
Example:
- foob[aeiou]r finds 'foobar', 'foober' and so on, but not 'foobbr', 'foobcr' etc.
- foob[^aeiou]r finds 'foobbr', 'foobcr' and so on, but not 'foobar', 'foober' etc.
Inside the list symbol "-" can be used for defining the range of the symbols. For example a-z defines all the symbols between "a" and "z".
If you want to include symbol "-" in the index, put it in the beginning or the end of the index or set "\" before it. If you want to include "]" in the index put it at the very beginning or set "\" before it.
Example:
- [-az] finds 'a', 'z' and '-'
- [az-] finds 'a', 'z' and '-'
- [a\-z] 'a', 'z' and '-'
- [a-z] finds all 26 lowercase latin letters from 'a' till 'z'
- [\n-\x0D] finds #10, #11, #12, #13.
- [\d-t] finds a number, '-' or 't'
- []-a] finds a symbol from the range ']'..'a'
Metasymbols
There are some groups of metasymbols.
Metasymbols - newlines
- ^ - beginning of line
- $ - end of line
- \A - beginning of text
- \Z - end of text
- . - any symbol in the line
-
Example:
- ^foobar - finds 'foobar' if it's at the beginning of the line
- foobar$ - finds 'foobar' if it's at he end of the line
- ^foobar$ - finds 'foobar' if it's the only worrd in the line
- foob.r - finds 'foobar', 'foobbr', 'foob1r' etc.
Metasymbol "^" by default coincides only at the beginning of input text. Internal newlines will not coincide with "^". If you want the text to be multiline ("^" to coincide after every newline inside the text)use modifier (read about it below).
Metasymbols \A and \Z are similar to "^'' but you can't use modifier /m/ They always coincide with the beginning or end of the input text.
Metasymbol "." coincides with any symbol by defaut unless you put off modifier /s. In this case "." will coincide with newlines.
"^" coincides with the beginning og the input tex. If the modifier is on it coincides with the point after \x0D\x0A, \x0A or \x0D. Note that it doesn't coincide in the interval inside sequence \x0D\x0A.
"$" coincides with the end of the input text. If the modifier /m is on it coincides with the point before \x0D\x0A, \x0A or \x0D. Note that it doesn't coincide in the interval inside sequence \x0D\x0A.
"." coincides with any symbol, but if the modifier /m is on, than "." doesn't coincide with \x0D\x0A и \x0A or \x0D.
Note that "^.*$" doesn't coincide with the empty line \x0D\x0A, but coincides with \x0A\x0D.
Standard symbols index
- \w - alphanumeric symbol or "_"
- \W - not \w
- \d - numerical symbol
- \D - not \d
- \s - any blank symbol (by default - [ \t\n\r\f])
- \S - not \s
Standard indexes \w, \d и \s can be used inside symbol indexes.
Example:
- foob\dr - finds 'foob1r', ''foob6r' etc but not 'foobar', 'foobbr' etc.
- foob[\w\s]r - finds 'foobar', 'foob r', 'foobbr' etc. but not 'foob1r', 'foob=r' etc.
Metasymbols - words boundaries
- \b - coincides on the boundary of the word
- \B - coincides not on the boundary of the word
Word boundary (\b) is a point between 2 symbols, one of which satisfies \w and the other \W (in any order). At the same time there is meant \W before the beginning and after the end of the string.
Metasymbols - repeater
After any element of the regular expression there can follow a repeater.
Repetor allows you to define the number of possible repetitions of the previous symbol, metasymbol or subexpression.
- * - zero or more times ("greedy"), like {0,}
- + - zero or more times ("greedy"), like {1,}
- ? - ноль или один раз ("greedy"), like {0,1}
- {n} - exactly n times ("greedy")
- {n,} - not less than n times ("greedy")
- {n,m} - not less than n but not more than m times ("greedy")
- *? - zero or more times ("not greedy"), like {0,}?
- +? - one or more times ("not greedy"), like {1,}?
- ?? - zero or one time ("not greedy"), like {0,1}?
- {n}? - exactly n times ("not greedy")
- {n,}? - not less than n times ("not greedy")
- {n,m}? - not less than n but not more than m times ("not greedy")
So {n,m} sets minimum n repetitions and maximun m repetitions. Repeater {n}is equivalent to {n,n} and sets exactly n repetitions. Repeator {n,} sets n repetitions at the minimum. In theory n and m can be as big as you like but it is strongly recommended that big values shouldn't be set becuse it might take quite a time to complete the task.
If the braces are put in the wrong place they can't be taken as repeaters. In this case they are just symbols.
Example:
- foob.*r - finds 'foobar', 'foobalkjdflkj9r' and 'foobr'
- foob.+r - finds 'foobar', 'foobalkjdflkj9r' but not 'foobr'
- foob.?r - finds 'foobar', 'foobbr' и 'foobr' but not 'foobalkj9r'
- fooba{2}r - finds 'foobaar'
- fooba{2,}r - finds 'foobaar', 'foobaaar', 'foobaaaar' etc.
- fooba{2,3}r - finds 'foobaar' or 'foobaaar' but not 'foobaaaar'
Some words about greedy and not greedy variants. Those that are greedy tend to take the most part of the input text, those that are not greedy - the least.
Example: "b+" as "b*" in the input string "abbbbc" will find "bbbb" but "b+?" will find only "b", and "b*?" will find a blank string. "b{2,3}?" will find "bb" but "b{2,3}" will find "bbb".
Use modifier /g to switch all the repeaters in the expression to the not greedy mode.
Metasymbols - variants
You can define an index of variants using metasymbol "|" to devide them. For example "fee|fie|foe" will find "fee'' or "fie'' or "foe'', (just as "f(e|i|o)e").
For the first variant is taken everything from metasymbol "(" or "[" or form the beginning of the expression till the first "|". For the last one is taken everything from the last "|" till the end of the expression or till the next metasymbol ")". Usually the nember of variants is put into braces not to miss anything.
Variant will be satisfied when the whole part of the expression coincides. It means that variants are not nessecerelly greedy. For example "foo|foot" in the input string "barefoot" will find "foo'' as this variant let the whole expression coincide.
Note that metasymbol "|" is taken for an ordinary metasymbol inside symbol index. For example [fee|fie|foe] means the same as [feio|].
Example:
- foo(bar|foo) - finds "foobar" or "foofoo".
Metasymbols - subexpressions
Metasymbols ( ... ) can also set subexpressions.
Example:
- (foobar){8,10} - finds the string with 8, 9 or 10 copies of "foobar"
- foob([0-9]|a+)r - finds "foob0r", "foob1r" , "foobar", "foobaar" etc.
Metasymbols - reverse links
Metasymbols from \1 till \9 are taken as reverse links. \<n> coincides will the expression #<n> that was found before.
Example:
- (.)\1+ - finds "aaaa" and "cc"
- (.+)\1+ - also finds "abab" and "123123"
- (['"]?)(\d+)\1 - finds "13" (in double quotes), or '4' (in simple quotes) or 77 (without quotes) etc.
Modifiers
Modifiers can change the modes.
Any modifier can change with the help of the (?...) structure inside a regular expression.
#i
Not depending on the register mode (by default uses the language that is chosen in operating system by default)
#m
Take the input text as multiline, at the same time metasymbols "^'' and "$''
coincide not only at the beginning or the end of the text but also at the beginning and the end of all the strings in the text.
#s
Take the input text as one line. Metasymbol ".''coincides with any symbol. If this modifier is off the metasymbol doesn't coincide with the newline.
#g
Put it off to switch all the repeaters to the not greedy mode (by default this modifier is on). When the modifier is off all '+' work as '+?', '*' as '*?'etc.
#r
Non-standard modifier. If it's on the ranges like a-Z letters.
#(?imsxr-imsxr)
Lets change modifier's meaning
Example:
- (?i)Saint-Petersburg - finds 'Saint-petersburg' and 'Saint-Petersburg'
- (?i)Saint-(?-i)Petersburg - finds 'Saint-Petersburg' but not 'Saint-petersburg'
- (?i)(Saint-)?Petersburg - finds 'Saint-petersburg' and 'saint-petersburg'
- ((?i)Saint-)?Petersburg - finds 'saint-Petersburg', but not 'saint-petersburg'
|