perlre - Perl 正規表現 (和訳10%)

目次


NAME

perlre - Perl 正規表現 (和訳10%)


説明

このページでは Perl での正規表現の構文について説明します.

もしこれまでに正規表現を使ったことがないのであれば, perlrequick [CPAN]にクイックスタートが, perlretut [CPAN] に 長めのチュートリアルがあります.

正規表現をマッチ操作でどのように使うかやそれに関する様々な例に 関しては, perlop 内 "Regexp Quote-Like Operators" [CPAN] にある m//, s///, qr// 及び ?? の説明を参照して下さい.

マッチ操作には様々な修飾子(modifier)があります. 修飾子は 正規表現内の解釈に関連する物で, 次に一覧にしています. Perl が 正規表現を使う方法を変更する修飾子は perlop 内 "Regexp Quote-Like Operators" [CPAN] 及び perlop 内 "Gory details of parsing quoted constructs" [CPAN] に 説明されています.

i

大文字小文字を区別しないパターンマッチを行います.

use locale が有効になっている場合には大文字小文字の対応表は 現在のロケールから取られます. perllocale [CPAN] を参照してください.

m

文字列を複数行として扱います. つまり, "^" 及び "$" は 文字列の最初と最後に対するマッチから, 文字列中の各行の先頭と末尾に 対するマッチへと変更されます.

s

文字列を1行として扱います. つまり, "." は任意の1文字, 通常はマッチしない改行でさえもマッチするように変更されます.

/s 及び /m 修飾子は共に $* の設定を上書きします. つまり, $* に何が入っているのかに関わらず, /m のない /s は "^" は文字列の先頭にのみマッチするように, そして "$" は 文字列の末尾(若しくは末尾の改行の直前)にのみマッチするように 強制されます. /ms として共に使うと, "^" 及び "$" はそれぞれ 文字列中の改行の直前及び直後のマッチでありつつ, "." は任意の文字に マッチするようになります.

x

空白やコメントを許可してパターンを読みやすくするように拡張します.

これらは通常 "/x 修飾子" のように記述され, これは区切りがスラッシュ でなくてもそう記述されます. また, これらはいずれも (?...) 構築子 を使って正規表現内に埋め込まれることもあります.

/x に関してはもう少し説明が必要です. これはバックスラッシュで エスケープされているか文字クラスの中にあるか以外の空白文字を 無視するように規表現パーサに伝えます. これを使うことで正規表現を (若干)より読みやすいパーツに分解することができます. そして # 文字は通常の Perl コードのようにコメントを始めるメタ文字として 扱われるようになります. これはパターンの中(それも/xの影響を 受けない文字クラスの外)で実際に空白や # 文字を必要とする ときには, スケープするか8進数若しくは16進数エスケープをしなければ なりません. /x でこれらの機能を利用することで, Perl の 正規表現をより読み易くするのに非常に役立ちます. ただ, コメントの 中にパターン区切り子を書いてしまわないようには注意してください -- perl にはパターンを早めに終わらせようとした訳じゃないことを 知る方法がないのです. C スタイルのコメントを削除するコードは perlop [CPAN] を参照してください.

正規表現

Perl のパターンマッチで使われるパターンは Version 8 正規表現 ルーチンで提供されているものからの派生です. (このルーチンは Henry Spencer の自由に再配布可能な V8 ルーチンの再実装から (遠くにはなれつつ)派生しています). 詳細は "Version 8 Regular Expressions" を参照してください.

特に以下のメタ文字は標準のegrepっぽい意味を持っています:

    \	Quote the next metacharacter
    ^	Match the beginning of the line
    .	Match any character (except newline)
    $	Match the end of the line (or before newline at the end)
    |	Alternation
    ()	Grouping
    []	Character class
    
    \	次のメタ文字をエスケープ
    ^	行の先頭にマッチ
    .	任意の文字にマッチ(但し改行は除く)
    $	行の終端にマッチ(若しくは終端の改行の前)
    |	代替
    ()	グループ化
    []	文字クラス

デフォルトでは, 文字 "^" は文字列の先頭にのみ, そして文字 "$" は 末尾(若しくは末尾の改行の前)にのみマッチすることを保証し, そして Perl は文字列が1行のみを含んでいるという仮定でいくつかの 最適化を行います. 埋め込まれている改行文字は "^" や "$" とは マッチしません. しかし文字列には複数行が格納されていて, "^" は任意の改行の後, そして "$" は任意の改行の前で マッチさせたいこともあるでしょう. 小さなオーバーヘッドは ありますが, これはパターンマッチで /m 修飾子を使うことで 行うことができます. (古いプログラムでは <$*> を設定することで これを行っていましたがこれは今では廃止されています.)

複数行での利用を簡単にするために, 文字 "." は /s 修飾子を 使って Perl に文字列を1行として処理すると伝えない限り 改行にはマッチしません. /s 修飾子は, 別のモジュールで $* セットするような(素行のよくない)古いコードでは それを上書きします.

以下の標準的な量指定子を使えます:

    *	   Match 0 or more times
    +	   Match 1 or more times
    ?	   Match 1 or 0 times
    {n}    Match exactly n times
    {n,}   Match at least n times
    {n,m}  Match at least n but not more than m times
    
    *	   0 回以上のマッチ
    +	   1 回以上のマッチ
    ?	   1 回若しくは 0 回のマッチ
    {n}    ちょうど n 回のマッチ
    {n,}   n 回以上のマッチ
    {n,m}  n 回以上 m 回以下のマッチ

(これ以外のコンテキストで波括弧が使われたときには 普通の文字として使われます. また, 下限は省略可能では ありません.) "*" 修飾子は {0,} と, "+" 修飾子は {1,} と, そして "?" 修飾子は {0,1} と等価です. n 及び m は perl をビルドしたときに定義した既定の制限より小さな整数回に 制限されます. これは大抵のプラットフォームでは 32766 回に なっています. 実際の制限は次のようなコードを実行すると 生成されるエラーメッセージで見ることができます:

    $_ **= $_ , / {$_} / for 2 .. 42;

デフォルトでは, パターンで行われる量指定は"貪欲"です, つまりそれはパターンの残りの部分が可能な範囲で, (始めた地点から)可能な限り多くを先にあるパターンで マッチさせます. もし最小回数でのマッチを 行いたいのであれば, 量指定子の後ろに "?" を続けます. 意味は変更されずに"貪欲さ"だけを変更できます:

    *?	   Match 0 or more times
    +?	   Match 1 or more times
    ??	   Match 0 or 1 time
    {n}?   Match exactly n times
    {n,}?  Match at least n times
    {n,m}? Match at least n but not more than m times
    
    *?	   0 回以上のマッチ
    +?	   1 回以上のマッチ
    ??	   0 回若しくは 1 回のマッチ
    {n}?    ちょうど n 回のマッチ
    {n,}?   n 回以上のマッチ
    {n,m}?  n 回以上 m 回以下のマッチ

パターンはダブルクオート文字列として処理されるため, 以下のエスケープ文字も動作します:

    \t		tab                   (HT, TAB)
    \n		newline               (LF, NL)
    \r		return                (CR)
    \f		form feed             (FF)
    \a		alarm (bell)          (BEL)
    \e		escape (think troff)  (ESC)
    \033	octal char (think of a PDP-11)
    \x1B	hex char
    \x{263a}	wide hex char         (Unicode SMILEY)
    \c[		control char
    \N{name}	named char
    \l		lowercase next char (think vi)
    \u		uppercase next char (think vi)
    \L		lowercase till \E (think vi)
    \U		uppercase till \E (think vi)
    \E		end case modification (think vi)
    \Q		quote (disable) pattern metacharacters till \E
    
    \t		タブ                  (水平タブ;HT, TAB)
    \n		改行                  (LF, NL)
    \r		復帰                  (CR)
    \f		フォームフィード      (FF)
    \a		アラーム (ベル)       (BEL)
    \e		エスケープ (troff 的) (ESC)
    \033	8進文字 (PDP-11 的)
    \x1B	16進文字
    \x{263a}	ワイド16進文字        (Unicode SMILEY)
    \c[		制御文字
    \N{name}	名前付き文字
    \l		次の文字を小文字に (vi 的)
    \u		次の文字を大文字に (vi 的)
    \L		\E まで小文字に (vi 的)
    \U		\E まで大文字に (vi 的)
    \E		変更の終端 (vi 的)
    \Q		\E までパターンメタ文字の引用(無効化)

use locale の影響下であれば, \l, \L, \u, \U による大文字小文字変換は現在のロケールで処理されます. perllocale [CPAN] を参照してください. \N{name} に関するドキュメントは charnames [CPAN] を参照してください.

\Q シーケンス内であっても $ 及び @ のリテラルは 含めることはできません. エスケープされていない $ 及び @ は対応する変数の埋め込みとなり, エスケープ \$ することでマッチさせるためのリテラル文字列 を生成させます. m/\Quser\E\@\Qhost/ といった感じに 記述する必要があります.

加えて, Perl は以下のものを定義しています:

    \w	Match a "word" character (alphanumeric plus "_")
    \W	Match a non-"word" character
    \s	Match a whitespace character
    \S	Match a non-whitespace character
    \d	Match a digit character
    \D	Match a non-digit character
    \pP	Match P, named property.  Use \p{Prop} for longer names.
    \PP	Match non-P
    \X	Match eXtended Unicode "combining character sequence",
        equivalent to (?:\PM\pM*)
    \C	Match a single C char (octet) even under Unicode.
	NOTE: breaks up characters into their UTF-8 bytes,
	so you may end up with malformed pieces of UTF-8.
	Unsupported in lookbehind.
    
    \w	"単語" 文字のマッチ (英数字及び"_")
    \W	非"単語"文字のマッチ
    \s	白空白文字のマッチ
    \S	非白空白文字のマッチ
    \d	数字のマッチ
    \D	非数字のマッチ
    \pP	名前属性 P のマッチ. 長い名前であれば \p{Prop}.
    \PP	Pでないマッチ
    \X	拡張 Unicode "複合文字シーケンス (combining character sequence)"
        のマッチ, (?:\PM\pM*)と等価
    \C	1つの C 文字 (8進数)とのマッチ, Unicode 環境でも同じ.
	補足: 文字をUTF-8バイト列へと変換するので, 壊れた
	UTF-8 片となるかもしれません. lookbehind はサポートしていません.

\w は単語全体ではなく, 1つの英数字(アルファベット 若しくは数字)若しくは _ にマッチします. Perl で識別子 となる文字列(これは英単語とは異なります)にマッチさせるためには \w+ を使います. use locale の影響下であれば, \w で適用されるアルファベットは現在のロケールから 採用されます. perllocale [CPAN] を参照してください. \w, \W, \s, \S, \d, そして \D は文字クラスでも 利用できますが, 範囲の終端には使わないでください, それは 範囲にならないので "-" はリテラルとして処理されます. もし Unicode 環境下であれば \s は "\x{85}", "\x{2028}", そして "\x{2029}" にもマッチします, c<\pp>, c<\pp>, 及び c<\x> の 詳細は perlunicode [CPAN] を参照してください, また Unicode 一般 に関しては perluniintro [CPAN] を参照してください. 独自の \p 及び \P 属性を定義すこともできます, perlunicode [CPAN] を参照してください.

POSIX 文字クラス構文

    [:class:]

も利用可能です. 利用可能な文字クラスとそれと等価な バックスラッシュ記法(提供されていれば)は以下の通りです:

    alpha
    alnum
    ascii
    blank		[1]
    cntrl
    digit       \d
    graph
    lower
    print
    punct
    space       \s	[2]
    upper
    word        \w	[3]
    xdigit
[1]

[ \t] と等価な GNU 拡張, "全ての水平白空白".

[2]

[[:space:]] には(とても稀な)"水平タブ", "\ck", ch(11) が含まれるため \s と完全な等価ではありません.

[3]

Perl 拡張, 既出.

例えば全ての大文字にマッチさせるためには [:upper:] を 使うことができます. [][::] 構成子の一部であって, 完全な文字クラスの一部ではありません. 例えば:

    [01[:alpha:]%]

は, 0, 1, 任意の英字, そしてパーセント記号にマッチします.

以下の Unicode \p{} 構成子及び等価なバックスラッシュ 文字クラス(提供されていれば)の等式:

    [:...:]	\p{...}		backslash

    alpha       IsAlpha
    alnum       IsAlnum
    ascii       IsASCII
    blank       IsSpace
    cntrl       IsCntrl
    digit       IsDigit        \d
    graph       IsGraph
    lower       IsLower
    print       IsPrint
    punct       IsPunct
    space       IsSpace
                IsSpacePerl    \s
    upper       IsUpper
    word        IsWord
    xdigit      IsXDigit

例えば [:lower:]\p{IsLower} は等価です.

utf8 プラグマは使われていなくて locale プラグマが 使われていた場合にはクラスは通常の isalpha(3) インターフェースと 相互に関連します(但し "word" 及び "blank" は除く).

明確ではない名前付き文字は以下の通りです:

cntrl

任意の制御文字. 通常出力を生成する代わりにターミナルの制御を 行う文字: 例えば改行やバックスペースは制御文字です. ord() が32未満となる全ての文字は制御文字として 分類され(ASCII, ISO ラテン文字集合, 及び Unicodeを仮定), ord() が値 127 となる文字(DEL)も同様です.

graph

全ての英数字及び句読点(特殊)文字.

print

全ての英数字, 句読点(特殊)文字, 及び空白文字.

punct

全ての句読点(特殊)文字.

xdigit

全ての16進数字. これはいまひとつぽいけれど([0-9A-Fa-f]も ちゃんと動作します), 完全性のために含まれています.

クラス名の前に '^' をおくことで [::] 文字クラスの 補集合を使うこともできます. これは Perl での拡張です. 例:

    POSIX	traditional Unicode
    POSIX	普段の      Unicode

    [:^digit:]      \D      \P{IsDigit}
    [:^space:]	    \S	    \P{IsSpace}
    [:^word:]	    \W	    \P{IsWord}

Perl は POSIX 文字クラスにある POSIX 標準は 文字クラスのみをサポートしていると考えます. POSIX 文字クラス [.cc.] 及び [=cc=] は認識されますが, サポートはされておらずそれを使うとエラーとなるでしょう.

Perl は以下のゼロ幅のアサーションを定義しています:

    \b	Match a word boundary
    \B	Match a non-(word boundary)
    \A	Match only at beginning of string
    \Z	Match only at end of string, or before newline at the end
    \z	Match only at end of string
    \G	Match only at pos() (e.g. at the end-of-match position
        of prior m//g)
    
    \b	単語境界にマッチ
    \B	単語境界以外にマッチ
    \A	文字列の開始にのみマッチ
    \Z	文字列の終端若しくは終端の改行前にのみマッチ
    \z	文字列の終端にのみマッチ
    \G	pos() の位置にのみマッチ (つまり前のm//gのマッチ終端位置)

単語境界(\b)は\W にマッチする文字列の始まりと 終わりを連想するような, 片側を \w, もう片側を \W で挟まれている点です. (文字クラスにおいては \b は単語境界ではなくバックスペースを表します, ちょうどダブルクオート文字列と同じように.) \A 及び \Z は "^" 及び "$" と同様ですが, /m 修飾子が指定されているときに "^" 及び "$" は全ての内部的な行境界にマッチするのに対して \A 及び \Z は複数回のマッチにはなりません. 文字列の本当の末尾にマッチさせ, 省略可能である 末尾の改行を無視するには \z を使います.

\G アサーションはグローバルなマッチ(m//g)を 連結するために使います, これは perlop 内 "Regexp Quote-Like Operators" [CPAN] にも説明されて います. これは文字列に対していくつかのパターンを 次々にマッチさせたいといった, lex ライクなスキャナを 書きたいときにも便利です, 同じリファレンスを参照してください. \G が実際にマッチできる位置は pos() を左辺値として 使うことで変更できます: perlfunc 内 "pos" [CPAN] を参照してください. 現在のところ \G はパターンの最初に使われる時のみを サポートしています; /(?<=\G..)./g のように 他の場所で使う個tもできますが, そのような使い方 (例えば /.\G/g)は問題となるため今のところそういった 使い方は避けるべきです.

The bracketing construct ( ... ) creates capture buffers. To refer to the digit'th buffer use \<digit> within the match. Outside the match use "$" instead of "\". (The \<digit> notation works in certain circumstances outside the match. See the warning below about \1 vs $1 for details.) Referring back to another part of the match is called a backreference.

There is no limit to the number of captured substrings that you may use. However Perl also uses \10, \11, etc. as aliases for \010, \011, etc. (Recall that 0 means octal, so \011 is the character at number 9 in your coded character set; which would be the 10th character, a horizontal tab under ASCII.) Perl resolves this ambiguity by interpreting \10 as a backreference only if at least 10 left parentheses have opened before it. Likewise \11 is a backreference only if at least 11 left parentheses have opened before it. And so on. \1 through \9 are always interpreted as backreferences.

Examples:

    s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words

     if (/(.)\1/) {                 # find first doubled char
         print "'$1' is the first doubled character\n";
     }

    if (/Time: (..):(..):(..)/) {   # parse out values
	$hours = $1;
	$minutes = $2;
	$seconds = $3;
    }

Several special variables also refer back to portions of the previous match. $+ returns whatever the last bracket match matched. $& returns the entire matched string. (At one point $0 did also, but now it returns the name of the program.) $` returns everything before the matched string. $' returns everything after the matched string. And $^N contains whatever was matched by the most-recently closed group (submatch). $^N can be used in extended patterns (see below), for example to assign a submatch to a variable.

The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+, $&, $`, $', and $^N) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See "Compound Statements" in perlsyn [CPAN].)

NOTE: failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match.

警告: Perl は, 一旦プログラム中のどこかで $&, $`, 若しくは $' のいずれかを必要としていることを見つけると, 全てのパターンマッチでそれらを提供しなければなりません. これはあなたのプログラムを大幅に遅くさせるでしょう. Perl は $1, $2, 等の生成にも同じメカニズムを使っているので, キャプチャの括弧に含まれるそれぞれのパターンにも 同じ料金を払っています. (グループ化の振る舞いを維持しつつ このコストを削減するには拡張正規表現 (?: ... ) を代わりに 使います. (訳注:Perl拡張というだけで /x 修飾子は不要.)) ですが $&, $` 若しくは $' を一度も使わなければ, キャプチャの括弧をもたないパターンではこの不利益は なくなります. この為, 可能であれば $&, $', 及び $` を削除しましょう, しかしそれができなかった(そしてそれらを 本当に理解しているアルゴリズムがあるのであれば), 一旦 それらを使った時点でそれ以降は自由にそれらを使うことができます, なぜならあなたは(一度使った時点で)既に代価を払っているので. 5.005 であれば $& は他の2つほど高価ではありません.

Backslashed metacharacters in Perl are alphanumeric, such as \b, \w, \n. Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric. So anything that looks like \\, \(, \), \<, \>, \{, or \} is always interpreted as a literal character, not a metacharacter. This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

    $pattern =~ s/(\W)/\\$1/g;

(If use locale is set, then this depends on the current locale.) Today it is more common to use the quotemeta() function or the \Q metaquoting escape sequence to disable all metacharacters' special meanings like this:

    /$unquoted\Q$quoted\E$unquoted/

Beware that if you put literal backslashes (those not inside interpolated variables) between \Q and \E, double-quotish backslash interpolation may lead to confusing results. If you need to use literal backslashes within \Q...\E, consult "Gory details of parsing quoted constructs" in perlop [CPAN].

Extended Patterns

Perl also defines a consistent extension syntax for features not found in standard tools like awk and lex. The syntax is a pair of parentheses with a question mark as the first thing within the parentheses. The character after the question mark indicates the extension.

The stability of these extensions varies widely. Some have been part of the core language for many years. Others are experimental and may change without warning or be completely removed. Check the documentation on an individual feature to verify its current status.

A question mark was chosen for this and for the minimal-matching construct because 1) question marks are rare in older regular expressions, and 2) whenever you see one, you should stop and "question" exactly what is going on. That's psychology...

(?#text)

A comment. The text is ignored. If the /x modifier enables whitespace formatting, a simple # will suffice. Note that Perl closes the comment as soon as it sees a ), so there is no way to put a literal ) in the comment.

(?imsx-imsx)

One or more embedded pattern-match modifiers, to be turned on (or turned off, if preceded by -) for the remainder of the pattern or the remainder of the enclosing pattern group (if any). This is particularly useful for dynamic patterns, such as those read in from a configuration file, read in as an argument, are specified in a table somewhere, etc. Consider the case that some of which want to be case sensitive and some do not. The case insensitive ones need to include merely (?i) at the front of the pattern. For example:

    $pattern = "foobar";
    if ( /$pattern/i ) { } 

    # more flexible:

    $pattern = "(?i)foobar";
    if ( /$pattern/ ) { }

These modifiers are restored at the end of the enclosing group. For example,

    ( (?i) blah ) \s+ \1

will match a repeated (including the case!) word blah in any case, assuming x modifier, and no i modifier outside this group.

(?:pattern)
(?imsx-imsx:pattern)

This is for clustering, not capturing; it groups subexpressions like "()", but doesn't make backreferences as "()" does. So

    @fields = split(/\b(?:a|b|c)\b/)

is like

    @fields = split(/\b(a|b|c)\b/)

but doesn't spit out extra fields. It's also cheaper not to capture characters if you don't need to.

Any letters between ? and : act as flags modifiers as with (?imsx-imsx). For example,

    /(?s-i:more.*than).*million/i

is equivalent to the more verbose

    /(?:(?s-i)more.*than).*million/i
(?=pattern)

A zero-width positive look-ahead assertion. For example, /\w+(?=\t)/ matches a word followed by a tab, without including the tab in $&.

(?!pattern)

A zero-width negative look-ahead assertion. For example /foo(?!bar)/ matches any occurrence of "foo" that isn't followed by "bar". Note however that look-ahead and look-behind are NOT the same thing. You cannot use this for look-behind.

If you are looking for a "bar" that isn't preceded by a "foo", /(?!foo)bar/ will not do what you want. That's because the (?!foo) is just saying that the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will match. You would have to do something like /(?!foo)...bar/ for that. We say "like" because there's the case of your "bar" not having three characters before it. You could cover that this way: /(?:(?!foo)...|^.{0,2})bar/. Sometimes it's still easier just to say:

    if (/bar/ && $` !~ /foo$/)

For look-behind see below.

(?<=pattern)

A zero-width positive look-behind assertion. For example, /(?<=\t)\w+/ matches a word that follows a tab, without including the tab in $&. Works only for fixed-width look-behind.

(?<!pattern)

A zero-width negative look-behind assertion. For example /(?<!bar)foo/ matches any occurrence of "foo" that does not follow "bar". Works only for fixed-width look-behind.

(?{ code })

警告: この拡張正規表現の機能は強く実験的なものと考えてください, また追う通知なしに変更若しくは削除されるかもしれません.

このゼロ幅アサーションは埋め込まれた任意の Perl コードを評価します. これは常に(正規表現として)成功し, その code は埋め込まれません. 今のところ, code が終わる場所を認識するルールは少々複雑です.

この機能では一緒にネストした括弧の数を数えなくとも1つ前の マッチ結果をキャプチャ特殊変数 $^N を使うことができます.

  $_ = "The brown fox jumps over the lazy dog";
  /the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
  print "color = $color, animal = $animal\n";

(?{...}) ブロックの中では $_ は正規表現をマッチさせている文字列を 参照します. pos() を使ってこの文字列で現在のマッチ位置を知ることも できます.

code は次の感じで適切にスコープを持ちます: もしアサーションが バックトラックされている("Backtracking" 参照)のなら, local されなかった後の全ての変更, つまり

  $_ = 'a' x 8;
  m< 
     (?{ $cnt = 0 })			# Initialize $cnt.
     (
       a 
       (?{
           local $cnt = $cnt + 1;	# Update $cnt, backtracking-safe.
       })
     )*  
     aaaa
     (?{ $res = $cnt })			# On success copy to non-localized
					# location.
   >x;

$res = 4 を設定します. マッチの後で $cnt はグローバルに設定された値を 返します, なぜなら local 演算子で制限されたスコープは巻き戻されるためです.

このアサーションは (?(condition)yes-pattern|no-pattern) スイッチ として使われるかもしれません. この方法で使われなかったのなら, code の評価結果は特殊変数 $^R におかれます. これはすぐに 行われるので $^R は同じ正規表現内の他の $?{ code }) アサーションで使うことができます.

この $^R への設定は適切にlocal化されるため, $^R の古い 値はバックトラックしたときには復元されます; "Backtracking" を見てください.

セキュリティ的な理由により, 正規表現を実行時に変数から構築する ことは, 危険な use re 'eval' プラグマが使われている(re [CPAN]参照)か 変数が qr// 演算子(perlop 内 "qr/STRING/imosx" [CPAN]参照)の結果を 含んでいる時以外は拒否されます.

この制限はとても広まっていてとても便利な実行時に決まる文字列を パターンとして使う風習によるものです. 例えば:

    $re = <>;
    chomp $re;
    $string =~ /$re/;

Perl がパターンの中にあるコードをどうやって実行するかを知る前は この操作はセキュリティ的な視点で不正なパターンで例外を発生させは しますが完全に安全でした. もし use re 'eval' を有効にしている のなら, これはもはやセキュアではありません, そして taint チェックを 使っているときにだけ行うべきです. より良い方法としては, Safe の区画内で注意深く制限された評価を使うべきでしょう. この 双方のメカニズムについての詳細は perlsec [CPAN] を参照してください.

(??{ code })

警告: この拡張正規表現の機能は強く実験的なものと考えてください, また追う通知なしに変更若しくは削除されるかもしれません. 構文の簡単なバージョンは一般的に使われる慣用句として導入されるかも しれません.

これは"先送りされた"正規部分表現です. code は実行時に評価され, そのときにこの部分表現にマッチさせます. 評価の結果は正規表現として 受け取られ, この構成子の代わりに入れられていたかのようにマッチ されます.

code は埋め込まれません. 先の時と同様に code が終了していると 決定するルールは少々複雑です.

次のパターンは括弧で囲まれたグループにマッチします:

  $re = qr{
	     \(
	     (?:
		(?> [^()]+ )	# Non-parens without backtracking
	      |
		(??{ $re })	# Group with matching parens
	     )*
	     \)
	  }x;
(?>pattern)

WARNING: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice.

An "independent" subexpression, one which matches the substring that a standalone pattern would match if anchored at the given position, and it matches nothing other than this substring. This construct is useful for optimizations of what would otherwise be "eternal" matches, because it will not backtrack (see "Backtracking"). It may also be useful in places where the "grab all you can, and do not give anything back" semantic is desirable.

For example: ^(?>a*)ab will never match, since (?>a*) (anchored at the beginning of string, as above) will match all characters a at the beginning of string, leaving no a for ab to match. In contrast, a*ab will match the same as a+b, since the match of the subgroup a* is influenced by the following group ab (see "Backtracking"). In particular, a* inside a*ab will match fewer characters than a standalone a*, since this makes the tail match.

An effect similar to (?>pattern) may be achieved by writing (?=(pattern))\1. This matches the same substring as a standalone a+, and the following \1 eats the matched string; it therefore makes a zero-length assertion into an analogue of (?>...). (The difference between these two constructs is that the second one uses a capturing group, thus shifting ordinals of backreferences in the rest of a regular expression.)

Consider this pattern:

    m{ \(
	  ( 
	    [^()]+		# x+
          | 
            \( [^()]* \)
          )+
       \) 
     }x

That will efficiently match a nonempty group with matching parentheses two levels deep or less. However, if there is no such group, it will take virtually forever on a long string. That's because there are so many different ways to split a long string into several substrings. This is what (.+)+ is doing, and (.+)+ is similar to a subpattern of the above pattern. Consider how the pattern above detects no-match on ((()aaaaaaaaaaaaaaaaaa in several seconds, but that each extra letter doubles this time. This exponential performance will make it appear that your program has hung. However, a tiny change to this pattern

    m{ \( 
	  ( 
	    (?> [^()]+ )	# change x+ above to (?> x+ )
          | 
            \( [^()]* \)
          )+
       \) 
     }x

which uses (?>...) matches exactly when the one above does (verifying this yourself would be a productive exercise), but finishes in a fourth the time when used on a similar string with 1000000 as. Be aware, however, that this pattern currently triggers a warning message under the use warnings pragma or -w switch saying it "matches null string many times in regex".

On simple groups, such as the pattern (?> [^()]+ ), a comparable effect may be achieved by negative look-ahead, as in [^()]+ (?! [^()] ). This was only 4 times slower on a string with 1000000 as.

The "grab all you can, and do not give anything back" semantic is desirable in many situations where on the first sight a simple ()* looks like the correct solution. Suppose we parse text with comments being delimited by # followed by some optional (horizontal) whitespace. Contrary to its appearance, #[ \t]* is not the correct subexpression to match the comment delimiter, because it may "give up" some whitespace if the remainder of the pattern can be made to match that way. The correct answer is either one of these:

    (?>#[ \t]*)
    #[ \t]*(?![ \t])

For example, to grab non-empty comments into $1, one should use either one of these:

    / (?> \# [ \t]* ) (        .+ ) /x;
    /     \# [ \t]*   ( [^ \t] .* ) /x;

Which one you pick depends on which of these expressions better reflects the above specification of comments.

(?(condition)yes-pattern|no-pattern)
(?(condition)yes-pattern)

WARNING: This extended regular expression feature is considered highly experimental, and may be changed or deleted without notice.

Conditional expression. (condition) should be either an integer in parentheses (which is valid if the corresponding pair of parentheses matched), or look-ahead/look-behind/evaluate zero-width assertion.

For example:

    m{ ( \( )? 
       [^()]+ 
       (?(1) \) ) 
     }x

matches a chunk of non-parentheses, possibly included in parentheses themselves.

Backtracking

NOTE: This section presents an abstract approximation of regular expression behavior. For a more rigorous (and complicated) view of the rules involved in selecting a match among possible alternatives, see "Combining pieces together".

A fundamental feature of regular expression matching involves the notion called backtracking, which is currently used (when needed) by all regular expression quantifiers, namely *, *?, +, +?, {n,m}, and {n,m}?. Backtracking is often optimized internally, but the general principle outlined here is valid.

For a regular expression to match, the entire regular expression must match, not just part of it. So if the beginning of a pattern containing a quantifier succeeds in a way that causes later parts in the pattern to fail, the matching engine backs up and recalculates the beginning part--that's why it's called backtracking.

Here is an example of backtracking: Let's say you want to find the word following "foo" in the string "Food is on the foo table.":

    $_ = "Food is on the foo table.";
    if ( /\b(foo)\s+(\w+)/i ) {
	print "$2 follows $1.\n";
    }

When the match runs, the first part of the regular expression (\b(foo)) finds a possible match right at the beginning of the string, and loads up $1 with "Foo". However, as soon as the matching engine sees that there's no whitespace following the "Foo" that it had saved in $1, it realizes its mistake and starts over again one character after where it had the tentative match. This time it goes all the way until the next occurrence of "foo". The complete regular expression matches this time, and you get the expected output of "table follows foo."

Sometimes minimal matching can help a lot. Imagine you'd like to match everything between "foo" and "bar". Initially, you write something like this:

    $_ =  "The food is under the bar in the barn.";
    if ( /foo(.*)bar/ ) {
	print "got <$1>\n";
    }

Which perhaps unexpectedly yields:

  got <d is under the bar in the >

That's because .* was greedy, so you get everything between the first "foo" and the last "bar". Here it's more effective to use minimal matching to make sure you get the text between a "foo" and the first "bar" thereafter.

    if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
  got <d is under the >

Here's another example: let's say you'd like to match a number at the end of a string, and you also want to keep the preceding part of the match. So you write this:

    $_ = "I have 2 numbers: 53147";
    if ( /(.*)(\d*)/ ) {				# Wrong!
	print "Beginning is <$1>, number is <$2>.\n";
    }

That won't work at all, because .* was greedy and gobbled up the whole string. As \d* can match on an empty string the complete regular expression matched successfully.

    Beginning is <I have 2 numbers: 53147>, number is <>.

Here are some variants, most of which don't work:

    $_ = "I have 2 numbers: 53147";
    @pats = qw{
	(.*)(\d*)
	(.*)(\d+)
	(.*?)(\d*)
	(.*?)(\d+)
	(.*)(\d+)$
	(.*?)(\d+)$
	(.*)\b(\d+)$
	(.*\D)(\d+)$
    };

    for $pat (@pats) {
	printf "%-12s ", $pat;
	if ( /$pat/ ) {
	    print "<$1> <$2>\n";
	} else {
	    print "FAIL\n";
	}
    }

That will print out:

    (.*)(\d*)    <I have 2 numbers: 53147> <>
    (.*)(\d+)    <I have 2 numbers: 5314> <7>
    (.*?)(\d*)   <> <>
    (.*?)(\d+)   <I have > <2>
    (.*)(\d+)$   <I have 2 numbers: 5314> <7>
    (.*?)(\d+)$  <I have 2 numbers: > <53147>
    (.*)\b(\d+)$ <I have 2 numbers: > <53147>
    (.*\D)(\d+)$ <I have 2 numbers: > <53147>

As you see, this can be a bit tricky. It's important to realize that a regular expression is merely a set of assertions that gives a definition of success. There may be 0, 1, or several different ways that the definition might succeed against a particular string. And if there are multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve.

When using look-ahead assertions and negations, this can all get even trickier. Imagine you'd like to find a sequence of non-digits not followed by "123". You might try to write that as

    $_ = "ABC123";
    if ( /^\D*(?!123)/ ) {		# Wrong!
	print "Yup, no 123 in $_\n";
    }

But that isn't going to match; at least, not the way you're hoping. It claims that there is no 123 in the string. Here's a clearer picture of why that pattern matches, contrary to popular expectations:

    $x = 'ABC123';
    $y = 'ABC445';

    print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
    print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;

    print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
    print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;

This prints

    2: got ABC
    3: got AB
    4: got ABC

You might have expected test 3 to fail because it seems to a more general purpose version of test 1. The important difference between them is that test 3 contains a quantifier (\D*) and so can use backtracking, whereas test 1 will not. What's happening is that you've asked "Is it true that at the start of $x, following 0 or more non-digits, you have something that's not 123?" If the pattern matcher had let \D* expand to "ABC", this would have caused the whole pattern to fail.

The search engine will initially match \D* with "ABC". Then it will try to match (?!123 with "123", which fails. But because a quantifier (\D*) has been used in the regular expression, the search engine can backtrack and retry the match differently in the hope of matching the complete regular expression.

The pattern really, really wants to succeed, so it uses the standard pattern back-off-and-retry and lets \D* expand to just "AB" this time. Now there's indeed something following "AB" that is not "123". It's "C123", which suffices.

We can deal with this by using both an assertion and a negation. We'll say that the first part in $1 must be followed both by a digit and by something that's not "123". Remember that the look-aheads are zero-width expressions--they only look, but don't consume any of the string in their match. So rewriting this way produces what you'd expect; that is, case 5 will fail, but case 6 succeeds:

    print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
    print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;

    6: got ABC

In other words, the two zero-width assertions next to each other work as though they're ANDed together, just as you'd use any built-in assertions: /^$/ matches only if you're at the beginning of the line AND the end of the line simultaneously. The deeper underlying truth is that juxtaposition in regular expressions always means AND, except when you write an explicit OR using the vertical bar. /ab/ means match "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not a zero-width assertion, but a one-width assertion.

WARNING: particularly complicated regular expressions can take exponential time to solve because of the immense number of possible ways they can use backtracking to try match. For example, without internal optimizations done by the regular expression engine, this will take a painfully long time to run:

    'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/

And if you used *'s in the internal groups instead of limiting them to 0 through 5 matches, then it would take forever--or until you ran out of stack space. Moreover, these internal optimizations are not always applicable. For example, if you put {0,5} instead of * on the external group, no current optimization is applicable, and the match takes a long time to finish.

A powerful tool for optimizing such beasts is what is known as an "independent group", which does not backtrack (see "C<< (?>pattern) >>"). Note also that zero-length look-ahead/look-behind assertions will not backtrack to make the tail match, since they are in "logical" context: only whether they match is considered relevant. For an example where side-effects of look-ahead might have influenced the following match, see "C<< (?>pattern) >>".

Version 8 Regular Expressions

In case you're not familiar with the "regular" Version 8 regex routines, here are the pattern-matching rules not described above.

Any single character matches itself, unless it is a metacharacter with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern blurfl would match "blurfl" in the target string.

You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character specifies a range, so that a-z represents all characters between "a" and "z", inclusive. If you want either "-" or "]" itself to be a member of a class, put it at the start of the list (possibly after a "^"), or escape it with a backslash. "-" is also taken literally when it is at the end of the list, just before the closing "]". (The following all specify the same class of three characters: [-az], [az-], and [a\-z]. All are different from [a-z], which specifies a class containing twenty-six characters, even on EBCDIC based coded character sets.) Also, if you try to use the character classes \w, \W, \s, \S, \d, or \D as endpoints of a range, that's not a range, the "-" is understood literally.

Note also that the whole range idea is rather unportable between character sets--and even within character sets they may cause results you probably didn't expect. A sound principle is to use only ranges that begin from and end at either alphabets of equal case ([a-e], [A-E]), or digits ([0-9]). Anything else is unsafe. If in doubt, spell out the character sets in full.

Characters may be specified using a metacharacter syntax much like that used in C: "\n" matches a newline, "\t" a tab, "\r" a carriage return, "\f" a form feed, etc. More generally, \nnn, where nnn is a string of octal digits, matches the character whose coded character set value is nnn. Similarly, \xnn, where nn are hexadecimal digits, matches the character whose numeric value is nn. The expression \cx matches the character control-x. Finally, the "." metacharacter matches any character except "\n" (unless you use /s).

You can specify a series of alternatives for a pattern using "|" to separate them, so that fee|fie|foe will match any of "fee", "fie", or "foe" in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ("(", "[", or the beginning of the pattern) up to the first "|", and the last alternative contains everything from the last "|" to the next pattern delimiter. That's why it's common practice to include alternatives in parentheses: to minimize confusion about where they start and end.

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against "barefoot", only the "foo" part will match, as that is the first alternative tried, and it successfully matches the target string. (This might not seem important, but it is important when you are capturing matched text using parentheses.)

Also remember that "|" is interpreted as a literal within square brackets, so if you write [fee|fie|foe] you're really only matching [feio|].

Within a pattern, you may designate subpatterns for later reference by enclosing them in parentheses, and you may refer back to the nth subpattern later in the pattern using the metacharacter \n. Subpatterns are numbered based on the left to right order of their opening parenthesis. A backreference matches whatever actually matched the subpattern in the string being examined, not the rules for that subpattern. Therefore, (0|0x)\d*\s\1\d* will match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 matched "0x", even though the rule 0|0x could potentially match the leading 0 in the second number.

Warning on \1 vs $1

Some people get too used to writing things like:

    $pattern =~ s/(\W)/\\\1/g;

This is grandfathered for the RHS of a substitute to avoid shocking the sed addicts, but it's a dirty habit to get into. That's because in PerlThink, the righthand side of an s/// is a double-quoted string. \1 in the usual double-quoted string means a control-A. The customary Unix meaning of \1 is kludged in for s///. However, if you get into the habit of doing that, you get yourself into trouble if you then add an /e modifier.

    s/(\d+)/ \1 + 1 /eg;    	# causes warning under -w

Or if you try to do

    s/(\d+)/\1000/;

You can't disambiguate that by saying \{1}000, whereas you can fix it with ${1}000. The operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the left side of the s///.

Repeated patterns matching zero-length substring

WARNING: Difficult material (and prose) ahead. This section needs a rewrite.

Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability to wreak havoc.

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

    'foo' =~ m{ ( o? )* }x;

The o? can match at the beginning of 'foo', and since the position in the string is not moved by the match, o? would match again and again because of the * modifier. Another common way to create a similar cycle is with the looping modifier //g:

    @matches = ( 'foo' =~ m{ o? }xg );

or

    print "match: <$&>\n" while 'foo' =~ m{ o? }xg;

or the loop implied by split().

However, long experience has shown that many programming tasks may be significantly simplified by using repeated subexpressions that may match zero-length substrings. Here's a simple example being:

    @chars = split //, $string;		  # // is not magic in split
    ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// /

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy modifiers *+{}, and for higher-level ones like the /g modifier or split() operator.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring. Thus

   m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x;

is made equivalent to

   m{   (?: NON_ZERO_LENGTH )* 
      | 
        (?: ZERO_LENGTH )? 
    }x;

The higher level-loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see "Backtracking"), and so the second best match is chosen if the best match is of zero length.

For example:

    $_ = 'bar';
    s/\w??/<$&>/g;

results in <><b><><a><><r><>. At each position of the string the best match given by non-greedy ?? is the zero-length match, and the second best match is what is matched by \w. Thus zero-length matches alternate with one-character-long matches.

Similarly, for repeated m/()/g the second-best match is the match at the position one notch further in the string.

The additional state of being matched with zero-length is associated with the matched string, and is reset by each assignment to pos(). Zero-length matches at the end of the previous match are ignored during split.

Combining pieces together

Each of the elementary pieces of regular expressions which were described before (such as ab or \Z) could match at most one substring at the given position of the input string. However, in a typical regular expression these elementary pieces are combined into more complicated patterns using combining operators ST, S|T, S* etc (in these examples S and T are regular subexpressions).

Such combinations can include alternatives, leading to a problem of choice: if we match a regular expression a|ab against "abc", will it match substring "a" or "ab"? One way to describe which substring is actually matched is the concept of backtracking (see "Backtracking"). However, this description is too low-level and makes you think in terms of a particular implementation.

Another description starts with notions of "better"/"worse". All the substrings which may be matched by the given regular expression can be sorted from the "best" match to the "worst" match, and it is the "best" match which is chosen. This substitutes the question of "what is chosen?" by the question of "which matches are better, and which are worse?".

Again, for elementary pieces there is no such question, since at most one match at a given position is possible. This section describes the notion of better/worse for combining operators. In the description below S and T are regular subexpressions.

ST

Consider two possible matches, AB and A'B', A and A' are substrings which can be matched by S, B and B' are substrings which can be matched by T.

If A is better match for S than A', AB is a better match than A'B'.

If A and A' coincide: AB is a better match than AB' if B is better match for T than B'.

S|T

When S can match, it is a better match than when only T can match.

Ordering of two matches for S is the same as for S. Similar for two matches for T.

S{REPEAT_COUNT}

Matches as SSS...S (repeated as many times as necessary).

S{min,max}

Matches as S{max}|S{max-1}|...|S{min+1}|S{min}.

S{min,max}?

Matches as S{min}|S{min+1}|...|S{max-1}|S{max}.

S?, S*, S+

Same as S{0,1}, S{0,BIG_NUMBER}, S{1,BIG_NUMBER} respectively.

S??, S*?, S+?

Same as S{0,1}?, S{0,BIG_NUMBER}?, S{1,BIG_NUMBER}? respectively.

(?>S)

Matches the best match for S and only that.

(?=S), (?<=S)

Only the best match for S is considered. (This is important only if S has capturing parentheses, and backreferences are used somewhere else in the whole regular expression.)

(?!S), (?<!S)

For this grouping operator there is no need to describe the ordering, since only whether or not S can match is important.

(??{ EXPR })

The ordering is the same as for the regular expression which is the result of EXPR.

(?(condition)yes-pattern|no-pattern)

Recall that which of yes-pattern or no-pattern actually matches is already determined. The ordering of the matches is the same as for the chosen subexpression.

The above recipes describe the ordering of matches at a given position. One more rule is needed to understand how a match is determined for the whole regular expression: a match at an earlier position is always better than a match at a later position.

Creating custom RE engines

Overloaded constants (see overload [CPAN]) provide a simple way to extend the functionality of the RE engine.

Suppose that we want to enable a new RE escape-sequence \Y| which matches at boundary between whitespace characters and non-whitespace characters. Note that (?=\S)(?<!\S)|(?!\S)(?<=\S) matches exactly at these positions, so we want to have each \Y| in the place of the more complicated version. We can create a module customre to do this:

    package customre;
    use overload;

    sub import {
      shift;
      die "No argument to customre::import allowed" if @_;
      overload::constant 'qr' => \&convert;
    }

    sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"}

    # We must also take care of not escaping the legitimate \\Y|
    # sequence, hence the presence of '\\' in the conversion rules.
    my %rules = ( '\\' => '\\\\', 
		  'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ );
    sub convert {
      my $re = shift;
      $re =~ s{ 
                \\ ( \\ | Y . )
              }
              { $rules{$1} or invalid($re,$1) }sgex; 
      return $re;
    }

Now use customre enables the new escape in constant regular expressions, i.e., those without any runtime variable interpolations. As documented in overload [CPAN], this conversion will work only over literal parts of regular expressions. For \Y|$re\Y| the variable part of this regular expression needs to be converted explicitly (but only if the special meaning of \Y| should be enabled inside $re):

    use customre;
    $re = <>;
    chomp $re;
    $re = customre::convert $re;
    /\Y|$re\Y|/;

バグ

This document varies from difficult to understand to completely and utterly opaque. The wandering prose riddled with jargon is hard to fathom in several places.

This document needs a rewrite that separates the tutorial content from the reference content.


関連項目

perlrequick [CPAN].

perlretut [CPAN].

"Regexp Quote-Like Operators" in perlop [CPAN].

"Gory details of parsing quoted constructs" in perlop [CPAN].

perlfaq6 [CPAN].

"pos" in perlfunc [CPAN].

perllocale [CPAN].

perlebcdic [CPAN].

Mastering Regular Expressions by Jeffrey Friedl, published by O'Reilly and Associates.

perlre - Perl 正規表現 (和訳10%)

索引

perlre - Perl 正規表現 (和訳10%)