Ansicht
Dokumentation

ABENREGEX_PCRE_SYNTAX_SPECIALS - REGEX PCRE SYNTAX SPECIALS

ABENREGEX_PCRE_SYNTAX_SPECIALS - REGEX PCRE SYNTAX SPECIALS

BAL Application Log Documentation   General Data in Customer Master  
This documentation is copyright by SAP AG.
SAP E-Book

- Special Characters

The following tables summarize the special characters in PCRE regular expressions.

Note

See also PCRE2 documentation pcre2syntax man page.

Pattern Syntax

Quoting

Syntax Description
\x handle x as a literal if x has no special meaning
\Q...\E, handle enclosed characters as literal

Escaped Characters

Syntax Description
\a alarm (BEL character, 0x07)
\cx control-x, where x is any ASCII printing character
\e escape (0x1B)
\f form feed (0x0C)
\n line feed (by default 0x0A; depends on the active line-feed-mode)
\r carriage return (0x0D)
\t tab (0x09)
\0dd character with octal code 0dd
\ddd character with octal code ddd, or backreference
\o{ddd..} character with octal code ddd..
\N{U+hh..} character with Unicode code point hh..
\xhh character with hex code hh
\x{hh..} character with hex code hh..

Character Types

Syntax Description
. any character except line feed (unless in dotall mode, then any character)
\C one code unit; only allowed in regular expressions created with the class CL_ABAP_REGEX with UNICODE_HANDLING set to RELAXED, as it could partially match UTF-16 characters otherwise
\d a digit (respecting Unicode character properties)
\D a character that is not a digit
\h a horizontal white space character (respecting Unicode character properties)
\H a character that is not a horizontal white space character
\N a character that is not a line feed (depends on the active line-feed-mode)
\p{xx} a character with the xx Unicode character property (see below)
\P{xx} a character without the xx Unicode character property (see below)
\R a line feed sequence; by default matches any Unicode line feed sequence
\s a white space character (respecting Unicode character properties)
\S a character that is not a white space character
\v a vertical white space character (respecting Unicode character properties)
\V a character that is not a vertical white space character
\w a "word" character (respecting Unicode character properties)
\W a "non-word" character
\X a Unicode extended grapheme cluster

General Categories for Properties \p and \P

Based on the general categories as defined by the Unicode standard.

Category Identifier Description
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate
L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter
L& Ll, Lu, or Lt
M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark
N Number
Nd Decimal number
Nl Letter number
No Other number
P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation
S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol
Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator

Script Names for \p and \P

  • Adlam
  • Ahom
  • Anatolian_Hieroglyphs
  • Arabic
  • Armenian
  • Avestan
  • Balinese
  • Bamum
  • Bassa_Vah
  • Batak
  • Bengali
  • Bhaiksuki
  • Bopomofo
  • Brahmi
  • Braille
  • Buginese
  • Buhid
  • Canadian_Aboriginal
  • Carian
  • Caucasian_Albanian
  • Chakma
  • Cham
  • Cherokee
  • Chorasmian
  • Common
  • Coptic
  • Cuneiform
  • Cypriot
  • Cyrillic
  • Deseret
  • Devanagari
  • Dives_Akuru
  • Dogra
  • Duployan
  • Egyptian_Hieroglyphs
  • Elbasan
  • Elymaic
  • Ethiopic
  • Georgian
  • Glagolitic
  • Gothic
  • Grantha
  • Greek
  • Gujarati
  • Gunjala_Gondi
  • Gurmukhi
  • Han
  • Hangul
  • Hanifi_Rohingya
  • Hanunoo
  • Hatran
  • Hebrew
  • Hiragana
  • Imperial_Aramaic
  • Inherited
  • Inscriptional_Pahlavi
  • Inscriptional_Parthian
  • Javanese
  • Kaithi
  • Kannada
  • Katakana
  • Kayah_Li
  • Kharoshthi
  • Khitan_Small_Script
  • Khmer
  • Khojki
  • Khudawadi
  • Lao
  • Latin
  • Lepcha
  • Limbu
  • Linear_A
  • Linear_B
  • Lisu
  • Lycian
  • Lydian
  • Mahajani
  • Makasar
  • Malayalam
  • Mandaic
  • Manichaean
  • Marchen
  • Masaram_Gondi
  • Medefaidrin
  • Meetei_Mayek
  • Mende_Kikakui
  • Meroitic_Cursive
  • Meroitic_Hieroglyphs
  • Miao
  • Modi
  • Mongolian
  • Mro
  • Multani
  • Myanmar
  • Nabataean
  • Nandinagari
  • New_Tai_Lue
  • Newa
  • Nko
  • Nushu
  • Nyakeng_Puachue_Hmong
  • Ogham
  • Ol_Chiki
  • Old_Hungarian
  • Old_Italic
  • Old_North_Arabian
  • Old_Permic
  • Old_Persian
  • Old_Sogdian
  • Old_South_Arabian
  • Old_Turkic
  • Oriya
  • Osage
  • Osmanya
  • Pahawh_Hmong
  • Palmyrene
  • Pau_Cin_Hau
  • Phags_Pa
  • Phoenician
  • Psalter_Pahlavi
  • Rejang
  • Runic
  • Samaritan
  • Saurashtra
  • Sharada
  • Shavian
  • Siddham
  • SignWriting
  • Sinhala
  • Sogdian
  • Sora_Sompeng
  • Soyombo
  • Sundanese
  • Syloti_Nagri
  • Syriac
  • Tagalog
  • Tagbanwa
  • Tai_Le
  • Tai_Tham
  • Tai_Viet
  • Takri
  • Tamil
  • Tangut
  • Telugu
  • Thaana
  • Thai
  • Tibetan
  • Tifinagh
  • Tirhuta
  • Ugaritic
  • Vai
  • Wancho
  • Warang_Citi
  • Yezidi
  • Yi
  • Zanabazar_Square

Character Classes

Syntax Description
[...] positive character class
[^...] negative character class
[x-y] range
[[:xxx:]] positive POSIX named set (see below)
[[:^xxx:]] negative POSIX named set (see below)

Names for POSIX Named Sets

Syntax Description
alnum alphanumeric
alpha alphabetic
ascii 0-127
blank space or tab
cntrl control character
digit decimal digit
graph printing, excluding space
lower lower case letter
print printing, including space
punct printing, excluding alphanumeric
space white space
upper upper case letter
word same as \w
xdigit hexadecimal digit

POSIX named sets also make use of Unicode character properties if applicable.

Quantifiers

Syntax Description
? 0 or 1, greedy
?+ 0 or 1, possessive
?? 0 or 1, lazy
* 0 or more, greedy
*+ 0 or more, possessive
*? 0 or more, lazy
+ 1 or more, greedy
++ 1 or more, possessive
+? 1 or more, lazy
{n} exactly n
{n,m} at least n, no more than m, greedy
{n,m}+ at least n, no more than m, possessive
{n,m}? at least n, no more than m, lazy
{n,} n or more, greedy
{n,}+ n or more, possessive
{n,}? n or more, lazy

Anchors and Basic Assertions

Syntax Description
\b word boundary
\B not a word boundary
^ start of subject (also after an internal line feed, that is a line feed that does not occur at the end of the subject, in multiline mode)
\A start of subject (if matching on a subject is done with a starting offset, \A can never match)
$ end of subject and before a line feed at the end of the subject (also before line feed in multiline mode)
\Z end of subject and before a line feed at the end of the subject
\z end of subject
\G first matching position in subject (true if the current matching position is at the start point of the matching process, which may differ from the start of the subject e.g. if a starting offset is specified)

Reported Match Point Setting

Syntax Description
\K set reported start of match; e.g. the regex foo\Kbar matches foobar but reports that it has matched only bar

\K is respected in positive assertions, but ignored in negative ones.

Alternation

Syntax Description
| start of alternative branch

Grouping and Capturing

Syntax Description
(...) capture group
(?<name>...) named capture group (Perl style)
(?'name'...) named capture group (Perl style)
(?P?<name>...) named capture group (Python style)
(?:...) non-capture group
(?|...) non-capture group; reset group numbers for capture groups in each alternative
(?>...) atomic non-capture group
(*atomic:...) atomic non-capture group

A name must not start with a digit. Unicode names are allowed.

Comments

Syntax Description
(?#...) comment (cannot be nested)
#... extended mode: comment

In extended mode, an unescaped # introduces a comment which in this case continues to immediately after the next line feed character or character sequence in the pattern. This has to be a literal line feed character or character sequence, escape sequences that happen to represent a line feed like \n do not count.

Option Setting

Syntax Description
(?i) caseless / case-insensitive search
(?J) allow duplicate named groups
(?m) multiline mode
(?n) no auto capture
(?s) single line mode (dotall)
(?U) default ungreedy quantifiers (lazy)
(?x) extended mode: ignore white space except in classes
(?xx) same as (?x) but also ignore space and tab in classes
(?-...) unset option(s)
(?^) unset i, m, n, s and x options

Changes of these options within a group are automatically cancelled at the end of the group.

Several options may be set at once, and a mixture of setting and unsetting such as (?i-x) is allowed, but there may be only one hyphen. Setting (but no unsetting) is allowed after (?^ for example (?^in). An option setting may appear at the start of a non-capture group, for example (?i:...).

Special Control Verbs

Special control verbs are only recognized at the very start of a pattern.

Syntax Description
(*LIMIT_DEPTH=d) set the backtracking limit to d
(*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
(*LIMIT_MATCH=d) set the match limit to d
(*NOTEMPTY) lock out matching of empty strings entirely
(*NOTEMPTY_ATSTART) lock out matching of empty strings at the start of the subject
(*NO_AUTO_POSSES) prevents quantifiers from automatically being made possessive when what follows cannot match the repeated item (e.g. by default, a+b is handled as a++b as an optimization)
(*NO_DOTSTAR_ANCHOR) disable optimizations that apply to patterns whose top-level branches start with .*
(*NO_JIT) do not JIT-compile this pattern
(*NO_START_OPT) disable several optimizations for quickly reaching a "no match" result; this can be useful if you want callouts or backtracking control verbs to be executed in any case
(*UTF) enable UTF-mode; this verb cannot be used in regular expressions created with the class CL_ABAP_REGEX with UNICODE_HANDLING set to RELAXED, as it would clash with usages of \C
(*UCP) enable usage of Unicode character properties; for ABAP regular expressions this option is already enabled by default

The following special control verbs control the line break convention, i.e. what gets recognized as a line feed character. They do not affect \R:

Syntax Description
(*CR) carriage return only
(*LF) line feed only
(*CRLF) carriage return followed by line feed
(*ANYCRLF) all three of the above
(*ANY) any Unicode line feed sequence
(*NUL) the NUL character (binary zero)

The following special control verbs control what \R matches:

Syntax Description
(*BSR_ANYCRLF) CR, LF and CRLF
(*BSR_UNICODE) any Unicode line feed sequence

Look-ahead and Look-behind Assertions

Syntax Description
(?=...) positive look-ahead
(*pla:...) positive look-ahead
(*positive_look-ahead:...) positive look-ahead
(?!...) negative look-ahead
(*nla:...) negative look-ahead
(*negative_look-ahead:...) negative look-ahead
(?<=...) positive look-behind
(*plb:...) positive look-behind
(*positive_look-behind:...) positive look-behind
(?<!...) negative look-behind
(*nlb:...) negative look-behind
(*negative_look-behind:...) negative look-behind

Each top-level branch of a look-behind must be of fixed length.

Non-Atomic Look-around Assertions

Syntax Description
(?*...) non-atomic positive look-ahead
(*napla:...) non-atomic positive look-ahead
(*non_atomic_positive_look-ahead:...) non-atomic positive look-ahead
(?<*...) non-atomic positive look-behind
(*naplb:...) non-atomic positive look-behind
(*non_atomic_positive_look-behind:...) non-atomic positive look-behind

Backreferences

Syntax Description
\n reference by number n (can be ambiguous, see octal escapes)
\gn reference by number n
\g{n} reference by number n
\g+n relative reference by number n
\g-n relative reference by number n
\g{+n} relative reference by number n
\g{-n} relative reference by number n
\k?<name> reference by name (Perl style)
\k'name' reference by name (Perl style)
\g{name} reference by name (Perl style)
\k{name} reference by name (.NET style)
(?P=name) reference by name (Perl style)

Subroutine References

Syntax Description
(?R) recurse whole pattern
(?n) call subroutine by absolute number n
\g<n> call subroutine by absolute number n (Oniguruma style)
\g'n' call subroutine by absolute number n (Oniguruma style)
(?+n) call subroutine by relative number n
(?-n) call subroutine by relative number n
\g<+n> call subroutine by relative number n
\g'+n' call subroutine by relative number n
\g<-n> call subroutine by relative number n
\g'-n' call subroutine by relative number n
(?&name) call subroutine by name (Perl style)
(?P>name) call subroutine by name (Python style)
\g?<name> call subroutine by name (Oniguruma style)
\g'name' call subroutine by name (Oniguruma style)

Subroutine calls can be recursive. Left recursion is not possible however.

Parsing with PCRE Regular Expression

Conditional Patterns

Syntax Description
(?(condition)yes-pattern) match the yes-pattern if the condition holds
(?(condition)yes-pattern|no-pattern) match the yes-pattern if the condition holds, otherwise match the false-pattern

Where condition can be one of the following or any other assertion like a look-ahead or look-behind assertion:

Syntax Description
n absolute number n reference condition
+n relative number n reference condition
-n relative number n reference condition
?<name> named reference condition (Perl style)
'name' named reference condition (Perl style)
R overall recursion condition
Rn specific number n group recursion condition
R&name specific named group recursion condition
DEFINE define groups for reference (always yields false)
VERSION[>]=n.m test PCRE2 version

Backtracking Control

The following backtracking control verbs are triggered immediately when they are reached:

Syntax Description
(*ACCEPT) force successful match;\lbr if triggered inside a group that is called as a subroutine, only the group is ended successfully;\lbr if triggered inside a positive assertion, the assertion succeeds;\lbrif triggered inside a negative assertion, the assertion fails
(*FAIL) force backtrack
(*F) force backtrack
(?!) force backtrack; actually a negative look-ahead for the empty string, which always matches thus always failing the look-ahead; (*FAIL) and (*F) are synonyms for (?!)
(*MARK:NAME) mark a position with name NAME, see (*SKIP:NAME) below; synonym (:NAME); NAME can contain any sequence of characters that does not include the closing parenthesis; an empty name will cause the mark to have no effect

The following backtracking control verbs are only triggered when a subsequent match failure causes a backtrack to reach them. All of them force a match failure, but differ in what happens afterwards:

Syntax Description
(*COMMIT) overall failure, no advance of starting point
(*PRUNE) advance to next starting character; only applies if the pattern is not anchored, otherwise behaves the same
(*SKIP) advance to current matching position
(*SKIP:NAME) advance to position corresponding to an earlier (*MARK:NAME); if not found, the (*SKIP) is ignored
(*THEN) local failure, backtrack to next alternation

If one of these verbs is located in a group called as a subroutine, its effects are confined to the subroutine call.

Callouts

Syntax Description
(?C) callout (assumed number 0)
(?Cn) callout with numerical data n
(?C"text") callout with string data text

PCRE Regular Expression with Callouts

Replacement Syntax

Capture Group Substitution

Syntax Description
$id substitute for the content of the capture group identified by id, $0 being the content of the whole match; id can either be a number referring to a capture group or the name of a named capture group
\dollar{id} substitute for the content of the capture group identified by id, \dollar{0} being the content of the whole match; id can either be a number referring to a capture group or the name of a named capture group

When referring to a capture group that is not set (i.e. was not participating) in the match, the empty string will be substituted.

When referring to a capture group that does not exist, it is assumed unset and thus the empty string is substituted.

Conditional Substitution

Syntax Description
{id:+matched:unmatched} substitute for matched if the capture group identified by id was set in the match, otherwise for unmatched; id can either be a number referring to a capture group or the name of a named capture group
{id:-default} substitute for the content of the capture group identified by id if said capture group was set in the match, otherwise for default; shorthand for {id:+\dollar{id}:default}; id can either be a number referring to a capture group or the name of a named capture group

When referring to a capture group that is not set (i.e. was not participating) in the match, the empty string will be substituted.

When referring to a capture group that does not exist, it is assumed unset and thus the empty string is substituted.

Case Conversion

Syntax Description
\u the first character after \u that is inserted into the replacement text is converted to uppercase
\U all characters after \U up to the next \L or \E that are inserted into the replacement text are converted to uppercase
\l the first character after \l that is inserted into the replacement text is converted to lowercase
\L all characters after \L up to the next \U or \E that are inserted into the replacement text are converted to lowercase
\E terminates the current upper- or lowercase transformation

Substituting Special Characters

Syntax Description
\t insert a tab (0x09)
\r insert a carriage return (0x0D)
\n insert a line feed
\f insert a form feed (0x0C)
\xhh insert the character with hex code hh
\x{hh..} insert the character with hex code hh..

Quoting

Syntax Description
\\ insert a literal \
\$ insert a literal $
\x if \x has no special meaning, insert a literal x






PERFORM Short Reference   General Data in Customer Master  
This documentation is copyright by SAP AG.

Length: 79625 Date: 20240425 Time: 064640     sap01-206 ( 343 ms )