xmltext

package
v0.0.14 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 19, 2026 License: MIT Imports: 12 Imported by: 0

README

xmltext

xmltext is a streaming XML 1.0 tokenizer optimized for low-allocation parsing with caller-owned buffers. It is used by internal/xml and the validator to parse XML without building a DOM.

Goals

  • fast, streaming tokenization over io.Reader
  • minimal allocations with caller-owned buffers
  • explicit options for entity expansion and token emission

XML declaration validation

Strict(true) validates XML declarations (<?xml ...?>): version must be 1.0, and encoding and standalone (if present) must follow in that order with valid values.

dec := xmltext.NewDecoder(r, xmltext.Strict(true))

Encoding

The decoder accepts UTF-8 by default. If the input indicates a non-UTF-8 encoding (BOM or XML declaration), the decoder calls the configured charset reader. When no charset reader is set, it returns an "unsupported encoding" error.

Use WithCharsetReader to provide a decoder; xmltext does not ship charset implementations.

Usage

dec := xmltext.NewDecoder(r,
    xmltext.ResolveEntities(true),
    xmltext.CoalesceCharData(true),
)
var tok xmltext.Token

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }

    if tok.Kind == xmltext.KindStartElement {
        name := tok.Name
        // use name within the lifetime of this buffer
        _ = name
    }
}

Examples

On-demand entity expansion for text:

dec := xmltext.NewDecoder(r,
    xmltext.ResolveEntities(false),
    xmltext.CoalesceCharData(true),
)
var tok xmltext.Token
scratch := make([]byte, 256)

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    if tok.Kind != xmltext.KindCharData {
        continue
    }

    text := tok.Text
    if tok.TextNeeds {
        for {
            n, err := dec.UnescapeInto(scratch, tok.Text)
            if err == io.ErrShortBuffer {
                scratch = make([]byte, len(scratch)*2+len(tok.Text))
                continue
            }
            if err != nil {
                return err
            }
            text = scratch[:n]
            break
        }
    }
    _ = text
}

Attribute values without forcing expansion:

dec := xmltext.NewDecoder(r, xmltext.ResolveEntities(false))
var tok xmltext.Token
scratch := make([]byte, 256)

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    if tok.Kind != xmltext.KindStartElement {
        continue
    }

    for _, attr := range tok.Attrs {
        name := attr.Name
        value := attr.Value
        if attr.ValueNeeds {
            for {
                n, err := dec.UnescapeInto(scratch, attr.Value)
                if err == io.ErrShortBuffer {
                    scratch = make([]byte, len(scratch)*2+len(attr.Value))
                    continue
                }
                if err != nil {
                    return err
                }
                value = scratch[:n]
                break
            }
        }
        _ = name
        _ = value
    }
}

Retaining token data beyond the next decoder call:

var tok xmltext.Token
err := dec.ReadTokenInto(&tok)
if err != nil {
    return err
}
stable := append([]byte(nil), tok.Name...)
_ = stable

SAX-Style Struct Unmarshaling

Unlike encoding/xml.Unmarshal which builds a DOM, xmltext streams tokens for manual struct population. Track element context and populate fields on events:

type Book struct {
    Title  string
    Author string
    Year   string
}

func UnmarshalBook(r io.Reader) (Book, error) {
    dec := xmltext.NewDecoder(r,
        xmltext.ResolveEntities(true),
        xmltext.CoalesceCharData(true),
    )

    var book Book
    var current string // tracks current element
    var tok xmltext.Token

    for {
        err := dec.ReadTokenInto(&tok)
        if err == io.EOF {
            break
        }
        if err != nil {
            return Book{}, err
        }

        switch tok.Kind {
        case xmltext.KindStartElement:
            current = string(tok.Name)
        case xmltext.KindCharData:
            text := string(tok.Text)
            switch current {
            case "title":
                book.Title = text
            case "author":
                book.Author = text
            case "year":
                book.Year = text
            }
        case xmltext.KindEndElement:
            current = ""
        }
    }
    return book, nil
}

For nested structures, use a stack or state machine to track depth:

type Library struct {
    Books []Book
}

func UnmarshalLibrary(r io.Reader) (Library, error) {
    dec := xmltext.NewDecoder(r,
        xmltext.ResolveEntities(true),
        xmltext.CoalesceCharData(true),
    )

    var lib Library
    var current Book
    var inBook bool
    var field string
    var tok xmltext.Token

    for {
        err := dec.ReadTokenInto(&tok)
        if err == io.EOF {
            break
        }
        if err != nil {
            return Library{}, err
        }

        switch tok.Kind {
        case xmltext.KindStartElement:
            name := string(tok.Name)
            if name == "book" {
                inBook = true
                current = Book{}
            } else if inBook {
                field = name
            }
        case xmltext.KindCharData:
            if !inBook {
                continue
            }
            text := string(tok.Text)
            switch field {
            case "title":
                current.Title = text
            case "author":
                current.Author = text
            case "year":
                current.Year = text
            }
        case xmltext.KindEndElement:
            name := string(tok.Name)
            if name == "book" {
                lib.Books = append(lib.Books, current)
                inBook = false
            }
            field = ""
        }
    }
    return lib, nil
}

This approach avoids reflection and DOM allocation, giving full control over parsing. Use SkipValue() to skip unwanted subtrees efficiently.

Token lifetimes

Token slices are backed by the token's internal buffers and are overwritten on the next ReadTokenInto call that reuses the token. Copy slices if you need to keep them.

ReadValueInto

ReadValueInto writes the next subtree or token payload into dst and returns the number of bytes written. When ResolveEntities(true) is set, entity expansion is applied. It returns io.ErrShortBuffer if dst is too small.

Error model

Well-formedness errors return *xmltext.SyntaxError, which includes line and column information when TrackLineColumn(true) is enabled.

Footguns

  • token slices are reused; copy them if you need to keep data past the next call
  • ReadTokenInto overwrites the Token contents every time
  • Token retains its largest slices; assign a zero value to release memory
  • ReadValueInto writes into dst; use the returned length to slice the buffer
  • CDATA and CharData merge into a single CharData token when coalescing is on
  • ResolveEntities(false) leaves entity references in Text/Attr values
  • non-UTF-8 encodings require WithCharsetReader

Options

Common options include:

  • WithCharsetReader (decode non-UTF-8 encodings)
  • WithEntityMap (custom named entity replacements)
  • ResolveEntities
  • Strict
  • CoalesceCharData
  • TrackLineColumn
  • EmitComments, EmitPI, EmitDirectives
  • MaxDepth, MaxAttrs, MaxTokenSize
  • FastValidation

MaxDepth, MaxAttrs, and MaxTokenSize are unlimited by default (0). Set them when parsing untrusted input to cap memory growth; tokens exactly MaxTokenSize bytes long are allowed. FastValidation() does not set MaxTokenSize.

Strict validates XML declarations: version must be 1.0, and encoding and standalone (if present) must follow in that order with valid values. In non-strict mode, the declaration is treated like a PI and only checked for general PI well-formedness.

See docs/xmltext-architecture.md for the design and buffer model.

Documentation

Overview

Package xmltext provides a streaming XML 1.0 tokenizer that returns caller-owned bytes and avoids building a DOM.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Attr added in v0.0.7

type Attr struct {
	Name  []byte
	Value []byte
	// ValueNeeds reports whether Value includes unresolved entity references.
	ValueNeeds bool
}

Attr holds an attribute name and value for a start element token. Name and Value are backed by the Token that produced them.

type Decoder

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder streams XML tokens and copies token bytes into caller-owned storage.

func NewDecoder

func NewDecoder(r io.Reader, opts ...Options) *Decoder

NewDecoder creates a new XML decoder for the reader.

func (*Decoder) InputOffset

func (d *Decoder) InputOffset() int64

InputOffset reports the absolute byte offset of the next read position.

func (*Decoder) ReadTokenInto

func (d *Decoder) ReadTokenInto(dst *Token) error

ReadTokenInto reads the next XML token into dst. Slices in dst are overwritten on the next call that reuses dst.

func (*Decoder) ReadValueInto added in v0.0.7

func (d *Decoder) ReadValueInto(dst []byte) (int, error)

ReadValueInto writes the next element subtree or token into dst and returns the number of bytes written. It returns io.ErrShortBuffer if dst is too small and still consumes the value.

func (*Decoder) Reset

func (d *Decoder) Reset(r io.Reader, opts ...Options)

Reset prepares the decoder for reading from r with new options.

func (*Decoder) SkipValue

func (d *Decoder) SkipValue() error

SkipValue skips the current value without materializing it.

func (*Decoder) StackPointer

func (d *Decoder) StackPointer() string

StackPointer renders the current stack path using local names.

func (*Decoder) UnescapeInto added in v0.0.7

func (d *Decoder) UnescapeInto(dst, data []byte) (int, error)

UnescapeInto expands entity references in data into dst and returns the number of bytes written. It returns io.ErrShortBuffer if dst is too small.

type Kind

type Kind byte

Kind identifies the syntactic kind of an XML token.

const (
	KindNone Kind = iota
	KindStartElement
	KindEndElement
	KindCharData
	KindComment
	KindPI
	KindDirective
	KindCDATA
)

func (Kind) String

func (k Kind) String() string

String returns a stable name for the kind, suitable for debugging.

type Options

type Options struct {
	// contains filtered or unexported fields
}

Options holds decoder configuration values. The zero value means no overrides.

func CoalesceCharData

func CoalesceCharData(value bool) Options

CoalesceCharData merges adjacent text tokens into a single CharData token.

func EmitComments

func EmitComments(value bool) Options

EmitComments controls whether comment tokens are emitted.

func EmitDirectives

func EmitDirectives(value bool) Options

EmitDirectives controls whether directive tokens are emitted.

func EmitPI

func EmitPI(value bool) Options

EmitPI controls whether processing instruction tokens are emitted.

func FastValidation

func FastValidation() Options

FastValidation returns a preset tuned for validation throughput.

func JoinOptions

func JoinOptions(srcs ...Options) Options

JoinOptions combines multiple option sets into one in declaration order. Later options override earlier ones when set.

func MaxAttrs

func MaxAttrs(value int) Options

MaxAttrs limits the number of attributes on a start element.

func MaxDepth

func MaxDepth(value int) Options

MaxDepth limits element nesting depth.

func MaxQNameInternEntries

func MaxQNameInternEntries(value int) Options

MaxQNameInternEntries limits the number of interned QNames. Zero means no limit.

func MaxTokenSize

func MaxTokenSize(value int) Options

MaxTokenSize limits the maximum size of a single token in bytes. Tokens exactly MaxTokenSize bytes long are allowed.

func ResolveEntities

func ResolveEntities(value bool) Options

ResolveEntities controls whether entity references are expanded.

func Strict added in v0.0.7

func Strict(value bool) Options

Strict enables XML declaration validation. It enforces version and encoding/standalone ordering and values.

func TrackLineColumn

func TrackLineColumn(value bool) Options

TrackLineColumn controls whether line and column tracking is enabled.

func WithCharsetReader

func WithCharsetReader(fn func(label string, r io.Reader) (io.Reader, error)) Options

WithCharsetReader registers a decoder for non-UTF-8/UTF-16 encodings.

func WithEntityMap

func WithEntityMap(values map[string]string) Options

WithEntityMap configures custom named entity replacements.

func (Options) QNameInternEntries added in v0.0.10

func (opts Options) QNameInternEntries() (int, bool)

QNameInternEntries reports the configured QName interner limit.

type SyntaxError

type SyntaxError struct {
	// Err is the underlying parser error.
	Err error
	// Path is the stack path at the error location.
	Path string
	// Snippet is a short input slice near the failure point.
	Snippet []byte
	// Offset is the absolute byte offset in the input stream.
	Offset int64
	// Line is the 1-based line number when tracking is enabled.
	Line int
	// Column is the 1-based column number when tracking is enabled.
	Column int
}

SyntaxError reports a well-formedness error with location context.

func (*SyntaxError) Error

func (e *SyntaxError) Error() string

Error formats the syntax error with location and cause.

func (*SyntaxError) Unwrap

func (e *SyntaxError) Unwrap() error

Unwrap exposes the underlying error.

type Token

type Token struct {
	Attrs []Attr
	Text  []byte
	Name  []byte

	Line      int
	Column    int
	TextNeeds bool
	IsXMLDecl bool
	Kind      Kind
	// contains filtered or unexported fields
}

Token is a decoded XML token with caller-owned byte slices. Slices are backed by the Token's internal buffers and remain valid until the next ReadTokenInto call that reuses the Token.

func (*Token) Reserve added in v0.0.9

func (t *Token) Reserve(sizes TokenSizes)

Reserve ensures the token has at least the requested capacities. It resets the buffer lengths to zero.

type TokenSizes added in v0.0.9

type TokenSizes struct {
	Name      int
	Text      int
	Attrs     int
	AttrName  int
	AttrValue int
}

TokenSizes controls initial buffer capacities.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL