regonaut

package module
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 13, 2025 License: MIT Imports: 7 Imported by: 0

README

regonaut

regonaut is a Go implementation of ECMAScript Regular Expressions.

It aims to be fully compatible with JavaScript's RegExp, including all ES2025 features and the Annex B legacy extensions.

Compatibility is verified against all test262 tests related to regular expressions.

That means a pattern that works in modern browsers or Node.js will behave the same way in Go.

Internally, the engine uses a backtracking approach. See Russ Cox's blog post for background on backtracking vs. other regexp implementations.

Installation

go get github.com/auvred/regonaut

Usage

TL;DR
package main

import (
	"fmt"
	"github.com/auvred/regonaut"
)

func main() {
	re := MustCompile(".+(?<foo>bAr)", FlagIgnoreCase)
	m := re.FindMatch([]byte("_Bar_"))
	fmt.Printf("Groups[0] - %q\n", m.Groups[0].Data())
	fmt.Printf("Groups[1] - %q\n", m.Groups[1].Data())
	fmt.Printf("NamedGroups[\"foo\"] - %q\n", m.NamedGroups["foo"].Data())
}
Unicode handling

ECMAScript and Go have different models for representing strings, and that difference is central to how this library works.

In ECMAScript, strings are defined as sequences of UTF-16 code units, and they can be ill-formed. For example, a string may contain a lone surrogate such as "\uD800", which is not a valid Unicode character on its own but is still considered a valid ECMAScript string. You can read more about it here.

Regular expressions in ECMAScript operate in two modes:

  • Non-Unicode mode: both the pattern and the input string are treated as raw sequences of code units.

  • Unicode mode: both the pattern and the input string are treated as sequences of code points.

Unicode mode is enabled when the u or v flag is provided.

Go, on the other hand, uses UTF-8 encoded strings. Because of this mismatch, the library provides two execution modes:

  • Works with regular Go string values
  • Unicode awareness is always implied (the u flag is always enabled)
  • If you want features specific to the v flag, you must still explicitly enable it
  • Both the pattern and the input must be valid UTF-8 strings
  • They are processed as runes (each rune corresponds to a code point)
  • Capturing group indices are reported as byte offsets within the original UTF-8 string
UTF-16 mode
  • Works with []uint16 slices
  • By default, each element of the slice is treated as a single code unit
  • When the u or v flag is used, valid surrogate pairs are combined into single code points, while lone surrogates remain as they are
  • Use this mode only if you specifically need ECMAScript-style UTF-16 handling (e.g., when implementing or testing against a JavaScript engine)
Example
package main

import (
	"fmt"
	"github.com/auvred/regonaut"
)

func main() {
	var pattern = "c(.)(.)"
	var patternUtf16 = []uint16{'c', '(', '.', ')', '(', '.', ')'}

	var source = []byte("c🐱at")
	var sourceUtf16 = []uint16{'c', 0xD83D, 0xDC31, 'a', 't'}

	reUtf8 := regonaut.MustCompile(pattern, 0)
	m1 := reUtf8.FindMatch(source)
	fmt.Printf("UTF-8:                   %q, %q\n", m1.Groups[1].Data(), m1.Groups[2].Data())

	reUtf8Unicode := regonaut.MustCompile(pattern, FlagUnicode)
	m2 := reUtf8Unicode.FindMatch(source)
	fmt.Printf("UTF-8 (with 'u' flag):   %q, %q\n", m2.Groups[1].Data(), m2.Groups[2].Data())

	reUtf16 := regonaut.MustCompileUtf16(patternUtf16, 0)
	m3 := reUtf16.FindMatch(sourceUtf16)
	fmt.Printf("UTF-16:                  %#v, %#v\n", m3.Groups[1].Data(), m3.Groups[2].Data())

	reUtf16Unicode := regonaut.MustCompileUtf16(patternUtf16, FlagUnicode)
	m4 := reUtf16Unicode.FindMatch(sourceUtf16)
	fmt.Printf("UTF-16 (with 'u' flag):  %#v, %#v\n", m4.Groups[1].Data(), m4.Groups[2].Data())
}

Outputs:

UTF-8:                   "🐱", "a"
UTF-8 (with 'u' flag):   "🐱", "a"
UTF-16:                  []uint16{0xd83d}, []uint16{0xdc31}
UTF-16 (with 'u' flag):  []uint16{0xd83d, 0xdc31}, []uint16{0x61}
Mode Flags Matching semantics Group 1 (m.Groups[1].Data()) Group 2 (m.Groups[2].Data())
UTF-8 Code points (UTF-8 mode implies u) "🐱" "a"
UTF-8 u Code points "🐱" "a"
UTF-16 Code units (surrogates not paired) []uint16{0xd83d} []uint16{0xdc31}
UTF-16 u Code points (surrogates paired) []uint16{0xd83d, 0xdc31} []uint16{0x61}

[!NOTE] The U+1F431 CAT FACE (🐱). In UTF-16 without u, it appears as two separate surrogate code units (0xD83D, 0xDC31). With u, those are paired into one code point.

Local Development

Prerequisites
  • Go
  • Node.js with Type Stripping support (version 22.18.0+, 23.6.0+, or 24+)
  • pnpm
Setup

Make sure the test262 submodule is initialized:

git submodule update --init

Generate the test262 tests:

cd tools
pnpm i
pnpm run gen-test262-tests
cd ..
Running tests
# Run all tests, including test262
go test

# Run all tests, except test262
go test -skip 262

# Run all test, excluding generated property-escapes tests (they are slow)
go test -skip 262/built-ins/RegExp/property-escapes/generated

License

MIT

Documentation

Overview

Package regonaut is an implementation of ECMAScript Regular Expressions.

Example
re := MustCompile(".+(?<foo>bAr)", FlagIgnoreCase)
m := re.FindMatch([]byte("_Bar_"))
fmt.Printf("Groups[0] - %q\n", m.Groups[0].Data())
fmt.Printf("Groups[1] - %q\n", m.Groups[1].Data())
fmt.Printf("NamedGroups[\"foo\"] - %q\n", m.NamedGroups["foo"].Data())
Output:


Groups[0] - "_Bar"
Groups[1] - "Bar"
NamedGroups["foo"] - "Bar"
Example (Utf8_vs_Utf16)

The U+1F431 CAT FACE (🐱). In UTF-16 without 'u', it appears as two separate surrogate code units (0xD83D, 0xDC31). With 'u', those are paired into one code point.

var pattern = "c(.)(.)"
var patternUtf16 = []uint16{'c', '(', '.', ')', '(', '.', ')'}

var source = []byte("c🐱at")
var sourceUtf16 = []uint16{'c', 0xD83D, 0xDC31, 'a', 't'}

reUtf8 := MustCompile(pattern, 0)
m1 := reUtf8.FindMatch(source)
fmt.Printf("UTF-8:                   %q, %q\n", m1.Groups[1].Data(), m1.Groups[2].Data())

reUtf8Unicode := MustCompile(pattern, FlagUnicode)
m2 := reUtf8Unicode.FindMatch(source)
fmt.Printf("UTF-8 (with 'u' flag):   %q, %q\n", m2.Groups[1].Data(), m2.Groups[2].Data())

reUtf16 := MustCompileUtf16(patternUtf16, 0)
m3 := reUtf16.FindMatch(sourceUtf16)
fmt.Printf("UTF-16:                  %#v, %#v\n", m3.Groups[1].Data(), m3.Groups[2].Data())

reUtf16Unicode := MustCompileUtf16(patternUtf16, FlagUnicode)
m4 := reUtf16Unicode.FindMatch(sourceUtf16)
fmt.Printf("UTF-16 (with 'u' flag):  %#v, %#v\n", m4.Groups[1].Data(), m4.Groups[2].Data())
Output:


UTF-8:                   "🐱", "a"
UTF-8 (with 'u' flag):   "🐱", "a"
UTF-16:                  []uint16{0xd83d}, []uint16{0xdc31}
UTF-16 (with 'u' flag):  []uint16{0xd83d, 0xdc31}, []uint16{0x61}

Index

Examples

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Flag

type Flag uint16

Flag is a bitmask of RegExp options. The zero value corresponds to /pattern/ with no flags. Combine flags with bitwise OR, e.g. FlagIgnoreCase|FlagMultiline.

const (
	// Case-insensitive matching ("i" flag).
	FlagIgnoreCase Flag = 1 << iota

	// "^" and "$" match line boundaries ("m" flag).
	FlagMultiline

	// "." matches line terminators ("s" flag).
	FlagDotAll

	// Unicode-aware mode ("u" flag).
	// If this flag is set, FlagAnnexB is ignored.
	FlagUnicode

	// Unicode set notation and string properties ("v" flag).
	// If this flag is set, FlagAnnexB is ignored.
	FlagUnicodeSets

	// Sticky match from current position ("y" flag).
	FlagSticky

	// Enables Annex B web-compat features.
	// When FlagUnicode or FlagUnicodeSets is set,
	// this flag is cleared automatically by the compiler.
	FlagAnnexB
)

type Group

type Group struct {

	// Start is the inclusive start index of the captured substring,
	// or -1 if the group did not participate in the match.
	Start int
	// End is the exclusive end index of the captured substring,
	// or -1 if the group did not participate in the match.
	End int
	// Name is the group name if defined, otherwise empty.
	Name string
	// contains filtered or unexported fields
}

Group represents a single captured substring from a regular expression match against UTF-8 encoded input. It is safe for concurrent use by multiple goroutines.

func (Group) Data

func (g Group) Data() []byte

Data returns the captured substring as a UTF-8 byte slice. If the group did not participate in the match (Start == -1), it returns nil.

type GroupUtf16

type GroupUtf16 struct {

	// Start is the inclusive start index of the captured substring,
	// or -1 if the group did not participate in the match.
	Start int
	// End is the exclusive end index of the captured substring,
	// or -1 if the group did not participate in the match.
	End int
	// Name is the group name if defined, otherwise empty.
	Name string
	// contains filtered or unexported fields
}

GroupUtf16 represents a single captured substring from a regular expression match against UTF-16 encoded input. It is safe for concurrent use by multiple goroutines.

func (GroupUtf16) Data

func (g GroupUtf16) Data() []uint16

Data returns the captured substring as a UTF-16 code units slice. If the group did not participate in the match (Start == -1), it returns nil.

type Match

type Match struct {
	// Groups is the ordered list of captures.
	// Groups[0] is the full match; subsequent entries correspond to
	// the capturing groups in the pattern.
	Groups []Group
	// NamedGroups maps a group name to its captured group.
	NamedGroups map[string]Group
}

Match holds the result of a successful match against UTF-8 input. It is safe for concurrent use by multiple goroutines.

type MatchUtf16

type MatchUtf16 struct {
	// Groups is the ordered list of captures.
	// Groups[0] is the full match; subsequent entries correspond to
	// the capturing groups in the pattern.
	Groups []GroupUtf16
	// NamedGroups maps a group name to its captured group.
	NamedGroups map[string]GroupUtf16
}

MatchUtf16 holds the result of a successful match against UTF-16 input. It is safe for concurrent use by multiple goroutines.

type RegExp

type RegExp struct {
	// contains filtered or unexported fields
}

RegExp represents a compiled regular expression. It is safe for concurrent use by multiple goroutines. All methods on RegExp do not mutate internal state.

func Compile

func Compile(pattern string, flags Flag) (*RegExp, error)

Compile parses a regular expression pattern and returns a RegExp that can be applied against UTF-8 encoded input.

The pattern must be a valid ECMAScript regular expression.

func MustCompile

func MustCompile(pattern string, flags Flag) *RegExp

MustCompile is like Compile but panics if the expression cannot be parsed. It simplifies safe initialization of global variables containing regular expressions.

func (*RegExp) FindMatch

func (r *RegExp) FindMatch(source []byte) *Match

FindMatch applies r to a UTF-8 encoded byte slice and returns the first match. If no match is found, it returns nil.

func (*RegExp) FindMatchStartingAt

func (r *RegExp) FindMatchStartingAt(source []byte, pos int) *Match

FindMatchStartingAt applies r to a UTF-8 encoded byte slice beginning the search at pos, where pos is a byte index into source. It returns the first match found at or after pos. If pos is out of range or no match is found, it returns nil.

func (*RegExp) FindNextMatch

func (r *RegExp) FindNextMatch(match *Match) *Match

FindNextMatch searches for the next match of r in the same UTF-8 encoded input as a previously returned match.

The search begins at match.Groups[0].End. If the previous match was zero-length (Start == End), the search position is advanced by one input position before matching again to avoid returning the same empty match repeatedly.

If match is nil, or if no further match is found, FindNextMatch returns nil.

type RegExpUtf16

type RegExpUtf16 struct {
	// contains filtered or unexported fields
}

RegExpUtf16 represents a compiled regular expression. It is safe for concurrent use by multiple goroutines. All methods on RegExpUtf16 do not mutate internal state.

func CompileUtf16

func CompileUtf16(pattern []uint16, flags Flag) (*RegExpUtf16, error)

CompileUtf16 parses a regular expression pattern expressed as UTF-16 code units and returns a RegExpUtf16 that can be applied against UTF-16 encoded input.

The pattern must be a valid ECMAScript regular expression.

func MustCompileUtf16

func MustCompileUtf16(pattern []uint16, flags Flag) *RegExpUtf16

MustCompileUtf16 is like CompileUtf16 but panics if the expression cannot be parsed. It simplifies safe initialization of global variables containing regular expressions.

func (*RegExpUtf16) FindMatch

func (r *RegExpUtf16) FindMatch(source []uint16) *MatchUtf16

FindMatch applies r to a UTF-16 slice and returns the first match. If no match is found, it returns nil.

func (*RegExpUtf16) FindMatchStartingAt

func (r *RegExpUtf16) FindMatchStartingAt(source []uint16, pos int) *MatchUtf16

FindMatchStartingAt applies r to a UTF-16 slice beginning the search at pos, where pos is an index in UTF-16 code units. It returns the first match found at or after pos. If pos is out of range or no match is found, it returns nil.

func (*RegExpUtf16) FindMatchStartingAtSticky

func (r *RegExpUtf16) FindMatchStartingAtSticky(source []uint16, pos int) *MatchUtf16

FindMatchStartingAtSticky applies r to a UTF-16 slice requiring the match to start exactly at pos (sticky behavior), where pos is an index in UTF-16 code units. If the input at pos does not begin a match, or if pos is out of range, it returns nil.

This method is particularly useful for JavaScript engine implementers. The ECMAScript specification defines RegExp.prototype[Symbol.split] to create a new RegExp with the "y" (sticky) flag in order to constrain matching to the current position. By calling FindMatchStartingAtSticky instead, it is possible to avoid the overhead of allocating and compiling a new RegExp object, while still honoring the sticky semantics.

func (*RegExpUtf16) FindNextMatch

func (r *RegExpUtf16) FindNextMatch(match *MatchUtf16) *MatchUtf16

FindNextMatch searches for the next match of r in the same UTF-16 encoded input as a previously returned match.

The search begins at match.Groups[0].End. If the previous match was zero-length (Start == End), the search position is advanced by one input position before matching again to avoid returning the same empty match repeatedly.

If match is nil, or if no further match is found, FindNextMatch returns nil.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL