regonaut

package module

v0.0.1 Latest Latest Go to latest Published: Sep 13, 2025 License: MIT Imports: 7 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/auvred/regonaut

Links

Open Source Insights

README ¶

regonaut

regonaut is a Go implementation of ECMAScript Regular Expressions.

It aims to be fully compatible with JavaScript's RegExp, including all ES2025 features and the Annex B legacy extensions.

Compatibility is verified against all test262 tests related to regular expressions.

That means a pattern that works in modern browsers or Node.js will behave the same way in Go.

Internally, the engine uses a backtracking approach. See Russ Cox's blog post for background on backtracking vs. other regexp implementations.

Installation

go get github.com/auvred/regonaut

Usage

TL;DR

package main

import (
	"fmt"
	"github.com/auvred/regonaut"
)

func main() {
	re := MustCompile(".+(?<foo>bAr)", FlagIgnoreCase)
	m := re.FindMatch([]byte("_Bar_"))
	fmt.Printf("Groups[0] - %q\n", m.Groups[0].Data())
	fmt.Printf("Groups[1] - %q\n", m.Groups[1].Data())
	fmt.Printf("NamedGroups[\"foo\"] - %q\n", m.NamedGroups["foo"].Data())
}

Unicode handling

ECMAScript and Go have different models for representing strings, and that difference is central to how this library works.

In ECMAScript, strings are defined as sequences of UTF-16 code units, and they can be ill-formed. For example, a string may contain a lone surrogate such as "\uD800", which is not a valid Unicode character on its own but is still considered a valid ECMAScript string. You can read more about it here.

Regular expressions in ECMAScript operate in two modes:

Non-Unicode mode: both the pattern and the input string are treated as raw sequences of code units.
Unicode mode: both the pattern and the input string are treated as sequences of code points.

Unicode mode is enabled when the u or v flag is provided.

Go, on the other hand, uses UTF-8 encoded strings. Because of this mismatch, the library provides two execution modes:

UTF-8 mode (recommended)

Works with regular Go string values
Unicode awareness is always implied (the u flag is always enabled)
If you want features specific to the v flag, you must still explicitly enable it
Both the pattern and the input must be valid UTF-8 strings
They are processed as runes (each rune corresponds to a code point)
Capturing group indices are reported as byte offsets within the original UTF-8 string

UTF-16 mode

Works with []uint16 slices
By default, each element of the slice is treated as a single code unit
When the u or v flag is used, valid surrogate pairs are combined into single code points, while lone surrogates remain as they are
Use this mode only if you specifically need ECMAScript-style UTF-16 handling (e.g., when implementing or testing against a JavaScript engine)

Example

package main

import (
	"fmt"
	"github.com/auvred/regonaut"
)

func main() {
	var pattern = "c(.)(.)"
	var patternUtf16 = []uint16{'c', '(', '.', ')', '(', '.', ')'}

	var source = []byte("c🐱at")
	var sourceUtf16 = []uint16{'c', 0xD83D, 0xDC31, 'a', 't'}

	reUtf8 := regonaut.MustCompile(pattern, 0)
	m1 := reUtf8.FindMatch(source)
	fmt.Printf("UTF-8:                   %q, %q\n", m1.Groups[1].Data(), m1.Groups[2].Data())

	reUtf8Unicode := regonaut.MustCompile(pattern, FlagUnicode)
	m2 := reUtf8Unicode.FindMatch(source)
	fmt.Printf("UTF-8 (with 'u' flag):   %q, %q\n", m2.Groups[1].Data(), m2.Groups[2].Data())

	reUtf16 := regonaut.MustCompileUtf16(patternUtf16, 0)
	m3 := reUtf16.FindMatch(sourceUtf16)
	fmt.Printf("UTF-16:                  %#v, %#v\n", m3.Groups[1].Data(), m3.Groups[2].Data())

	reUtf16Unicode := regonaut.MustCompileUtf16(patternUtf16, FlagUnicode)
	m4 := reUtf16Unicode.FindMatch(sourceUtf16)
	fmt.Printf("UTF-16 (with 'u' flag):  %#v, %#v\n", m4.Groups[1].Data(), m4.Groups[2].Data())
}

Outputs:

UTF-8:                   "🐱", "a"
UTF-8 (with 'u' flag):   "🐱", "a"
UTF-16:                  []uint16{0xd83d}, []uint16{0xdc31}
UTF-16 (with 'u' flag):  []uint16{0xd83d, 0xdc31}, []uint16{0x61}

Mode	Flags	Matching semantics	Group 1 (`m.Groups[1].Data()`)	Group 2 (`m.Groups[2].Data()`)
UTF-8	—	Code points (UTF-8 mode implies `u`)	`"🐱"`	`"a"`
UTF-8	`u`	Code points	`"🐱"`	`"a"`
UTF-16	—	Code units (surrogates not paired)	`[]uint16{0xd83d}`	`[]uint16{0xdc31}`
UTF-16	`u`	Code points (surrogates paired)	`[]uint16{0xd83d, 0xdc31}`	`[]uint16{0x61}`

[!NOTE] The U+1F431 CAT FACE (🐱). In UTF-16 without u, it appears as two separate surrogate code units (0xD83D, 0xDC31). With u, those are paired into one code point.

Local Development

Prerequisites

Go
Node.js with Type Stripping support (version 22.18.0+, 23.6.0+, or 24+)
pnpm

Setup

Make sure the test262 submodule is initialized:

git submodule update --init

Generate the test262 tests:

cd tools
pnpm i
pnpm run gen-test262-tests
cd ..

Running tests

# Run all tests, including test262
go test

# Run all tests, except test262
go test -skip 262

# Run all test, excluding generated property-escapes tests (they are slow)
go test -skip 262/built-ins/RegExp/property-escapes/generated

License

MIT

Documentation ¶

Overview ¶

Package regonaut is an implementation of ECMAScript Regular Expressions.

Example ¶

re := MustCompile(".+(?<foo>bAr)", FlagIgnoreCase)
m := re.FindMatch([]byte("_Bar_"))
fmt.Printf("Groups[0] - %q\n", m.Groups[0].Data())
fmt.Printf("Groups[1] - %q\n", m.Groups[1].Data())
fmt.Printf("NamedGroups[\"foo\"] - %q\n", m.NamedGroups["foo"].Data())

Output:


Groups[0] - "_Bar"
Groups[1] - "Bar"
NamedGroups["foo"] - "Bar"

Example (Utf8_vs_Utf16) ¶

The U+1F431 CAT FACE (🐱). In UTF-16 without 'u', it appears as two separate surrogate code units (0xD83D, 0xDC31). With 'u', those are paired into one code point.

var pattern = "c(.)(.)"
var patternUtf16 = []uint16{'c', '(', '.', ')', '(', '.', ')'}

var source = []byte("c🐱at")
var sourceUtf16 = []uint16{'c', 0xD83D, 0xDC31, 'a', 't'}

reUtf8 := MustCompile(pattern, 0)
m1 := reUtf8.FindMatch(source)
fmt.Printf("UTF-8:                   %q, %q\n", m1.Groups[1].Data(), m1.Groups[2].Data())

reUtf8Unicode := MustCompile(pattern, FlagUnicode)
m2 := reUtf8Unicode.FindMatch(source)
fmt.Printf("UTF-8 (with 'u' flag):   %q, %q\n", m2.Groups[1].Data(), m2.Groups[2].Data())

reUtf16 := MustCompileUtf16(patternUtf16, 0)
m3 := reUtf16.FindMatch(sourceUtf16)
fmt.Printf("UTF-16:                  %#v, %#v\n", m3.Groups[1].Data(), m3.Groups[2].Data())

reUtf16Unicode := MustCompileUtf16(patternUtf16, FlagUnicode)
m4 := reUtf16Unicode.FindMatch(sourceUtf16)
fmt.Printf("UTF-16 (with 'u' flag):  %#v, %#v\n", m4.Groups[1].Data(), m4.Groups[2].Data())

Output:


UTF-8:                   "🐱", "a"
UTF-8 (with 'u' flag):   "🐱", "a"
UTF-16:                  []uint16{0xd83d}, []uint16{0xdc31}
UTF-16 (with 'u' flag):  []uint16{0xd83d, 0xdc31}, []uint16{0x61}

Index ¶

type Flag
type Group
- func (g Group) Data() []byte
type GroupUtf16
- func (g GroupUtf16) Data() []uint16
type Match
type MatchUtf16
type RegExp
- func Compile(pattern string, flags Flag) (*RegExp, error)
- func MustCompile(pattern string, flags Flag) *RegExp
type RegExpUtf16
- func CompileUtf16(pattern []uint16, flags Flag) (*RegExpUtf16, error)
- func MustCompileUtf16(pattern []uint16, flags Flag) *RegExpUtf16

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Flag ¶

type Flag uint16

Flag is a bitmask of RegExp options. The zero value corresponds to /pattern/ with no flags. Combine flags with bitwise OR, e.g. FlagIgnoreCase|FlagMultiline.

const (
	// Case-insensitive matching ("i" flag).
	FlagIgnoreCase Flag = 1 << iota

	// "^" and "$" match line boundaries ("m" flag).
	FlagMultiline

	// "." matches line terminators ("s" flag).
	FlagDotAll

	// Unicode-aware mode ("u" flag).
	// If this flag is set, FlagAnnexB is ignored.
	FlagUnicode

	// Unicode set notation and string properties ("v" flag).
	// If this flag is set, FlagAnnexB is ignored.
	FlagUnicodeSets

	// Sticky match from current position ("y" flag).
	FlagSticky

	// Enables Annex B web-compat features.
	// When FlagUnicode or FlagUnicodeSets is set,
	// this flag is cleared automatically by the compiler.
	FlagAnnexB
)

type Group ¶

type Group struct {

	// Start is the inclusive start index of the captured substring,
	// or -1 if the group did not participate in the match.
	Start int
	// End is the exclusive end index of the captured substring,
	// or -1 if the group did not participate in the match.
	End int
	// Name is the group name if defined, otherwise empty.
	Name string
	// contains filtered or unexported fields
}

Group represents a single captured substring from a regular expression match against UTF-8 encoded input. It is safe for concurrent use by multiple goroutines.

func (Group) Data ¶

func (g Group) Data() []byte

Data returns the captured substring as a UTF-8 byte slice. If the group did not participate in the match (Start == -1), it returns nil.

type GroupUtf16 ¶

type GroupUtf16 struct {

	// Start is the inclusive start index of the captured substring,
	// or -1 if the group did not participate in the match.
	Start int
	// End is the exclusive end index of the captured substring,
	// or -1 if the group did not participate in the match.
	End int
	// Name is the group name if defined, otherwise empty.
	Name string
	// contains filtered or unexported fields
}

GroupUtf16 represents a single captured substring from a regular expression match against UTF-16 encoded input. It is safe for concurrent use by multiple goroutines.

func (GroupUtf16) Data ¶

func (g GroupUtf16) Data() []uint16

Data returns the captured substring as a UTF-16 code units slice. If the group did not participate in the match (Start == -1), it returns nil.

type Match ¶

type Match struct {
	// Groups is the ordered list of captures.
	// Groups[0] is the full match; subsequent entries correspond to
	// the capturing groups in the pattern.
	Groups []Group
	// NamedGroups maps a group name to its captured group.
	NamedGroups map[string]Group
}

Match holds the result of a successful match against UTF-8 input. It is safe for concurrent use by multiple goroutines.

type MatchUtf16 ¶

type MatchUtf16 struct {
	// Groups is the ordered list of captures.
	// Groups[0] is the full match; subsequent entries correspond to
	// the capturing groups in the pattern.
	Groups []GroupUtf16
	// NamedGroups maps a group name to its captured group.
	NamedGroups map[string]GroupUtf16
}

MatchUtf16 holds the result of a successful match against UTF-16 input. It is safe for concurrent use by multiple goroutines.

type RegExp ¶

type RegExp struct {
	// contains filtered or unexported fields
}

RegExp represents a compiled regular expression. It is safe for concurrent use by multiple goroutines. All methods on RegExp do not mutate internal state.

func Compile ¶

func Compile(pattern string, flags Flag) (*RegExp, error)

Compile parses a regular expression pattern and returns a RegExp that can be applied against UTF-8 encoded input.

The pattern must be a valid ECMAScript regular expression.

func MustCompile ¶

func MustCompile(pattern string, flags Flag) *RegExp

MustCompile is like Compile but panics if the expression cannot be parsed. It simplifies safe initialization of global variables containing regular expressions.

func (*RegExp) FindMatch ¶

func (r *RegExp) FindMatch(source []byte) *Match

FindMatch applies r to a UTF-8 encoded byte slice and returns the first match. If no match is found, it returns nil.

func (*RegExp) FindMatchStartingAt ¶

func (r *RegExp) FindMatchStartingAt(source []byte, pos int) *Match

FindMatchStartingAt applies r to a UTF-8 encoded byte slice beginning the search at pos, where pos is a byte index into source. It returns the first match found at or after pos. If pos is out of range or no match is found, it returns nil.

func (*RegExp) FindNextMatch ¶

func (r *RegExp) FindNextMatch(match *Match) *Match

FindNextMatch searches for the next match of r in the same UTF-8 encoded input as a previously returned match.

The search begins at match.Groups[0].End. If the previous match was zero-length (Start == End), the search position is advanced by one input position before matching again to avoid returning the same empty match repeatedly.

If match is nil, or if no further match is found, FindNextMatch returns nil.

type RegExpUtf16 ¶

type RegExpUtf16 struct {
	// contains filtered or unexported fields
}

RegExpUtf16 represents a compiled regular expression. It is safe for concurrent use by multiple goroutines. All methods on RegExpUtf16 do not mutate internal state.

func CompileUtf16 ¶

func CompileUtf16(pattern []uint16, flags Flag) (*RegExpUtf16, error)

CompileUtf16 parses a regular expression pattern expressed as UTF-16 code units and returns a RegExpUtf16 that can be applied against UTF-16 encoded input.

The pattern must be a valid ECMAScript regular expression.

func MustCompileUtf16 ¶

func MustCompileUtf16(pattern []uint16, flags Flag) *RegExpUtf16

MustCompileUtf16 is like CompileUtf16 but panics if the expression cannot be parsed. It simplifies safe initialization of global variables containing regular expressions.

func (*RegExpUtf16) FindMatch ¶

func (r *RegExpUtf16) FindMatch(source []uint16) *MatchUtf16

FindMatch applies r to a UTF-16 slice and returns the first match. If no match is found, it returns nil.

func (*RegExpUtf16) FindMatchStartingAt ¶

func (r *RegExpUtf16) FindMatchStartingAt(source []uint16, pos int) *MatchUtf16

FindMatchStartingAt applies r to a UTF-16 slice beginning the search at pos, where pos is an index in UTF-16 code units. It returns the first match found at or after pos. If pos is out of range or no match is found, it returns nil.

func (*RegExpUtf16) FindMatchStartingAtSticky ¶

func (r *RegExpUtf16) FindMatchStartingAtSticky(source []uint16, pos int) *MatchUtf16

FindMatchStartingAtSticky applies r to a UTF-16 slice requiring the match to start exactly at pos (sticky behavior), where pos is an index in UTF-16 code units. If the input at pos does not begin a match, or if pos is out of range, it returns nil.

This method is particularly useful for JavaScript engine implementers. The ECMAScript specification defines RegExp.prototype[Symbol.split] to create a new RegExp with the "y" (sticky) flag in order to constrain matching to the current position. By calling FindMatchStartingAtSticky instead, it is possible to avoid the overhead of allocating and compiling a new RegExp object, while still honoring the sticky semantics.

func (*RegExpUtf16) FindNextMatch ¶

func (r *RegExpUtf16) FindNextMatch(match *MatchUtf16) *MatchUtf16

FindNextMatch searches for the next match of r in the same UTF-16 encoded input as a previously returned match.

The search begins at match.Groups[0].End. If the previous match was zero-length (Start == End), the search position is advanced by one input position before matching again to avoid returning the same empty match repeatedly.

If match is nil, or if no further match is found, FindNextMatch returns nil.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL