xmltext

package

v0.0.14 Latest Latest Go to latest Published: Jan 19, 2026 License: MIT Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/jacoelho/xsd

Links

Open Source Insights

README ¶

xmltext

xmltext is a streaming XML 1.0 tokenizer optimized for low-allocation parsing with caller-owned buffers. It is used by internal/xml and the validator to parse XML without building a DOM.

Goals

fast, streaming tokenization over io.Reader
minimal allocations with caller-owned buffers
explicit options for entity expansion and token emission

XML declaration validation

Strict(true) validates XML declarations (<?xml ...?>): version must be 1.0, and encoding and standalone (if present) must follow in that order with valid values.

dec := xmltext.NewDecoder(r, xmltext.Strict(true))

Encoding

The decoder accepts UTF-8 by default. If the input indicates a non-UTF-8 encoding (BOM or XML declaration), the decoder calls the configured charset reader. When no charset reader is set, it returns an "unsupported encoding" error.

Use WithCharsetReader to provide a decoder; xmltext does not ship charset implementations.

Usage

dec := xmltext.NewDecoder(r,
    xmltext.ResolveEntities(true),
    xmltext.CoalesceCharData(true),
)
var tok xmltext.Token

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }

    if tok.Kind == xmltext.KindStartElement {
        name := tok.Name
        // use name within the lifetime of this buffer
        _ = name
    }
}

Examples

On-demand entity expansion for text:

dec := xmltext.NewDecoder(r,
    xmltext.ResolveEntities(false),
    xmltext.CoalesceCharData(true),
)
var tok xmltext.Token
scratch := make([]byte, 256)

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    if tok.Kind != xmltext.KindCharData {
        continue
    }

    text := tok.Text
    if tok.TextNeeds {
        for {
            n, err := dec.UnescapeInto(scratch, tok.Text)
            if err == io.ErrShortBuffer {
                scratch = make([]byte, len(scratch)*2+len(tok.Text))
                continue
            }
            if err != nil {
                return err
            }
            text = scratch[:n]
            break
        }
    }
    _ = text
}

Attribute values without forcing expansion:

dec := xmltext.NewDecoder(r, xmltext.ResolveEntities(false))
var tok xmltext.Token
scratch := make([]byte, 256)

for {
    err := dec.ReadTokenInto(&tok)
    if err == io.EOF {
        break
    }
    if err != nil {
        return err
    }
    if tok.Kind != xmltext.KindStartElement {
        continue
    }

    for _, attr := range tok.Attrs {
        name := attr.Name
        value := attr.Value
        if attr.ValueNeeds {
            for {
                n, err := dec.UnescapeInto(scratch, attr.Value)
                if err == io.ErrShortBuffer {
                    scratch = make([]byte, len(scratch)*2+len(attr.Value))
                    continue
                }
                if err != nil {
                    return err
                }
                value = scratch[:n]
                break
            }
        }
        _ = name
        _ = value
    }
}

Retaining token data beyond the next decoder call:

var tok xmltext.Token
err := dec.ReadTokenInto(&tok)
if err != nil {
    return err
}
stable := append([]byte(nil), tok.Name...)
_ = stable

SAX-Style Struct Unmarshaling

Unlike encoding/xml.Unmarshal which builds a DOM, xmltext streams tokens for manual struct population. Track element context and populate fields on events:

type Book struct {
    Title  string
    Author string
    Year   string
}

func UnmarshalBook(r io.Reader) (Book, error) {
    dec := xmltext.NewDecoder(r,
        xmltext.ResolveEntities(true),
        xmltext.CoalesceCharData(true),
    )

    var book Book
    var current string // tracks current element
    var tok xmltext.Token

    for {
        err := dec.ReadTokenInto(&tok)
        if err == io.EOF {
            break
        }
        if err != nil {
            return Book{}, err
        }

        switch tok.Kind {
        case xmltext.KindStartElement:
            current = string(tok.Name)
        case xmltext.KindCharData:
            text := string(tok.Text)
            switch current {
            case "title":
                book.Title = text
            case "author":
                book.Author = text
            case "year":
                book.Year = text
            }
        case xmltext.KindEndElement:
            current = ""
        }
    }
    return book, nil
}

For nested structures, use a stack or state machine to track depth:

type Library struct {
    Books []Book
}

func UnmarshalLibrary(r io.Reader) (Library, error) {
    dec := xmltext.NewDecoder(r,
        xmltext.ResolveEntities(true),
        xmltext.CoalesceCharData(true),
    )

    var lib Library
    var current Book
    var inBook bool
    var field string
    var tok xmltext.Token

    for {
        err := dec.ReadTokenInto(&tok)
        if err == io.EOF {
            break
        }
        if err != nil {
            return Library{}, err
        }

        switch tok.Kind {
        case xmltext.KindStartElement:
            name := string(tok.Name)
            if name == "book" {
                inBook = true
                current = Book{}
            } else if inBook {
                field = name
            }
        case xmltext.KindCharData:
            if !inBook {
                continue
            }
            text := string(tok.Text)
            switch field {
            case "title":
                current.Title = text
            case "author":
                current.Author = text
            case "year":
                current.Year = text
            }
        case xmltext.KindEndElement:
            name := string(tok.Name)
            if name == "book" {
                lib.Books = append(lib.Books, current)
                inBook = false
            }
            field = ""
        }
    }
    return lib, nil
}

This approach avoids reflection and DOM allocation, giving full control over parsing. Use SkipValue() to skip unwanted subtrees efficiently.

Token lifetimes

Token slices are backed by the token's internal buffers and are overwritten on the next ReadTokenInto call that reuses the token. Copy slices if you need to keep them.

ReadValueInto

ReadValueInto writes the next subtree or token payload into dst and returns the number of bytes written. When ResolveEntities(true) is set, entity expansion is applied. It returns io.ErrShortBuffer if dst is too small.

Error model

Well-formedness errors return *xmltext.SyntaxError, which includes line and column information when TrackLineColumn(true) is enabled.

Footguns

token slices are reused; copy them if you need to keep data past the next call
ReadTokenInto overwrites the Token contents every time
Token retains its largest slices; assign a zero value to release memory
ReadValueInto writes into dst; use the returned length to slice the buffer
CDATA and CharData merge into a single CharData token when coalescing is on
ResolveEntities(false) leaves entity references in Text/Attr values
non-UTF-8 encodings require WithCharsetReader

Options

Common options include:

WithCharsetReader (decode non-UTF-8 encodings)
WithEntityMap (custom named entity replacements)
ResolveEntities
Strict
CoalesceCharData
TrackLineColumn
EmitComments, EmitPI, EmitDirectives
MaxDepth, MaxAttrs, MaxTokenSize
FastValidation

MaxDepth, MaxAttrs, and MaxTokenSize are unlimited by default (0). Set them when parsing untrusted input to cap memory growth; tokens exactly MaxTokenSize bytes long are allowed. FastValidation() does not set MaxTokenSize.

Strict validates XML declarations: version must be 1.0, and encoding and standalone (if present) must follow in that order with valid values. In non-strict mode, the declaration is treated like a PI and only checked for general PI well-formedness.

See docs/xmltext-architecture.md for the design and buffer model.

Documentation ¶

Overview ¶

Package xmltext provides a streaming XML 1.0 tokenizer that returns caller-owned bytes and avoids building a DOM.

Index ¶

type Attr
type Decoder
- func NewDecoder(r io.Reader, opts ...Options) *Decoder
type Kind
- func (k Kind) String() string
type Options
- func (opts Options) QNameInternEntries() (int, bool)
type SyntaxError
- func (e *SyntaxError) Error() string
- func (e *SyntaxError) Unwrap() error
type Token
- func (t *Token) Reserve(sizes TokenSizes)
type TokenSizes

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Attr ¶ added in v0.0.7

type Attr struct {
	Name  []byte
	Value []byte
	// ValueNeeds reports whether Value includes unresolved entity references.
	ValueNeeds bool
}

Attr holds an attribute name and value for a start element token. Name and Value are backed by the Token that produced them.

type Decoder ¶

type Decoder struct {
	// contains filtered or unexported fields
}

Decoder streams XML tokens and copies token bytes into caller-owned storage.

func NewDecoder ¶

func NewDecoder(r io.Reader, opts ...Options) *Decoder

NewDecoder creates a new XML decoder for the reader.

func (*Decoder) InputOffset ¶

func (d *Decoder) InputOffset() int64

InputOffset reports the absolute byte offset of the next read position.

func (*Decoder) ReadTokenInto ¶

func (d *Decoder) ReadTokenInto(dst *Token) error

ReadTokenInto reads the next XML token into dst. Slices in dst are overwritten on the next call that reuses dst.

func (*Decoder) ReadValueInto ¶ added in v0.0.7

func (d *Decoder) ReadValueInto(dst []byte) (int, error)

ReadValueInto writes the next element subtree or token into dst and returns the number of bytes written. It returns io.ErrShortBuffer if dst is too small and still consumes the value.

func (*Decoder) Reset ¶

func (d *Decoder) Reset(r io.Reader, opts ...Options)

Reset prepares the decoder for reading from r with new options.

func (*Decoder) SkipValue ¶

func (d *Decoder) SkipValue() error

SkipValue skips the current value without materializing it.

func (*Decoder) StackPointer ¶

func (d *Decoder) StackPointer() string

StackPointer renders the current stack path using local names.

func (*Decoder) UnescapeInto ¶ added in v0.0.7

func (d *Decoder) UnescapeInto(dst, data []byte) (int, error)

UnescapeInto expands entity references in data into dst and returns the number of bytes written. It returns io.ErrShortBuffer if dst is too small.

type Kind ¶

type Kind byte

Kind identifies the syntactic kind of an XML token.

const (
	KindNone Kind = iota
	KindStartElement
	KindEndElement
	KindCharData
	KindComment
	KindPI
	KindDirective
	KindCDATA
)

func (Kind) String ¶

func (k Kind) String() string

String returns a stable name for the kind, suitable for debugging.

type Options ¶

type Options struct {
	// contains filtered or unexported fields
}

Options holds decoder configuration values. The zero value means no overrides.

func CoalesceCharData ¶

func CoalesceCharData(value bool) Options

CoalesceCharData merges adjacent text tokens into a single CharData token.

func EmitComments ¶

func EmitComments(value bool) Options

EmitComments controls whether comment tokens are emitted.

func EmitDirectives ¶

func EmitDirectives(value bool) Options

EmitDirectives controls whether directive tokens are emitted.

func EmitPI ¶

func EmitPI(value bool) Options

EmitPI controls whether processing instruction tokens are emitted.

func FastValidation ¶

func FastValidation() Options

FastValidation returns a preset tuned for validation throughput.

func JoinOptions ¶

func JoinOptions(srcs ...Options) Options

JoinOptions combines multiple option sets into one in declaration order. Later options override earlier ones when set.

func MaxAttrs ¶

func MaxAttrs(value int) Options

MaxAttrs limits the number of attributes on a start element.

func MaxDepth ¶

func MaxDepth(value int) Options

MaxDepth limits element nesting depth.

func MaxQNameInternEntries ¶

func MaxQNameInternEntries(value int) Options

MaxQNameInternEntries limits the number of interned QNames. Zero means no limit.

func MaxTokenSize ¶

func MaxTokenSize(value int) Options

MaxTokenSize limits the maximum size of a single token in bytes. Tokens exactly MaxTokenSize bytes long are allowed.

func ResolveEntities ¶

func ResolveEntities(value bool) Options

ResolveEntities controls whether entity references are expanded.

func Strict ¶ added in v0.0.7

func Strict(value bool) Options

Strict enables XML declaration validation. It enforces version and encoding/standalone ordering and values.

func TrackLineColumn ¶

func TrackLineColumn(value bool) Options

TrackLineColumn controls whether line and column tracking is enabled.

func WithCharsetReader ¶

func WithCharsetReader(fn func(label string, r io.Reader) (io.Reader, error)) Options

WithCharsetReader registers a decoder for non-UTF-8/UTF-16 encodings.

func WithEntityMap ¶

func WithEntityMap(values map[string]string) Options

WithEntityMap configures custom named entity replacements.

func (Options) QNameInternEntries ¶ added in v0.0.10

func (opts Options) QNameInternEntries() (int, bool)

QNameInternEntries reports the configured QName interner limit.

type SyntaxError ¶

type SyntaxError struct {
	// Err is the underlying parser error.
	Err error
	// Path is the stack path at the error location.
	Path string
	// Snippet is a short input slice near the failure point.
	Snippet []byte
	// Offset is the absolute byte offset in the input stream.
	Offset int64
	// Line is the 1-based line number when tracking is enabled.
	Line int
	// Column is the 1-based column number when tracking is enabled.
	Column int
}

SyntaxError reports a well-formedness error with location context.

func (*SyntaxError) Error ¶

func (e *SyntaxError) Error() string

Error formats the syntax error with location and cause.

func (*SyntaxError) Unwrap ¶

func (e *SyntaxError) Unwrap() error

Unwrap exposes the underlying error.

type Token ¶

type Token struct {
	Attrs []Attr
	Text  []byte
	Name  []byte

	Line      int
	Column    int
	TextNeeds bool
	IsXMLDecl bool
	Kind      Kind
	// contains filtered or unexported fields
}

Token is a decoded XML token with caller-owned byte slices. Slices are backed by the Token's internal buffers and remain valid until the next ReadTokenInto call that reuses the Token.

func (*Token) Reserve ¶ added in v0.0.9

func (t *Token) Reserve(sizes TokenSizes)

Reserve ensures the token has at least the requested capacities. It resets the buffer lengths to zero.

type TokenSizes ¶ added in v0.0.9

type TokenSizes struct {
	Name      int
	Text      int
	Attrs     int
	AttrName  int
	AttrValue int
}

TokenSizes controls initial buffer capacities.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL