Introduction

The TVM community has worked since the last release to deliver the following new exciting improvements!

The main tags are below (bold text is with lots of progress): Relax (especial PyTorch frontend), CUDA etc.

Please visit the full listing of commits for a complete view: v0.20.dev0...v0.20.0.rc0.

Community

None.

RFCs

None.

Adreno

#17608 - [WINDOWS] Windows build dependencies for Adreno target

BugFix

#17761 - [FIX][RELAX] fix fusion of transpose + matmul when constant weight
#17762 - [Fix] Fix OpenCL header in attention utils
#17711 - [Fix][dlight] add an explicit reduction loop check in Reduce
#17697 - [Fix] Include <chrono> for std::chrono
#17677 - Declare build backend for python package
#17598 - [TIR][FIX] update FlopEstimator to include missing nodes
#17601 - [Flashinfer][Fix] fix missing args in flashinfer test
#17607 - [FIX][TVMC] Fix the mixed precision conversion pipeline

CI

#17687 - Update images to 20250226-223225-63bc315f
#17680 - update images to 20250225-035137-aeadc31c
#17675 - [skip ci]Update github tvmbot
#17635 - Cleanup legacy files
#17634 - [skip ci]Improve build time
#17629 - [skip ci]Robustify CI for SPOT failure
#17620 - Unpin pytest-profiling
#17621 - [skip ci] Remove legacy CI runners protection
#17619 - [Refactor]Remove legacy frontend tests

Dlight

#17754 - Fix general reduction rule to support non-last reduction axis
#17663 - [CPU] Add CPU Backend Support for GEMV Optimization

Docker

#17691 - Fix ml_dtypes downgrade issue introduced by TensorFlow
#17686 - Update ml_dtypes to 0.5.1+
#17676 - Use Torch GPU on gpu device
#17648 - Tensorflow (aka TFLite) upgrade to 2.18.0
#17643 - Update ml_dtypes version
#17638 - [skip ci]Update ml_dtypes version
#17638 - [skip ci]Update ml_dtypes version
#17617 - Tensorflow upgrade to 2.18.0

Docs

#17650 - Update README
#17611 - Download 3rd party embeds to local files
#17604 - Update README

MetaSchedule

#17104 - Adding post optimization in MetaSchedule to Improve Scheduling

OpenCL & CLML

#17571 - [OPENCL][TEXTURE] Improved texture memory planning

Relax

#17814 - [PyTorch] Add stack.default and sum.default to exported programs translator
#17820 - [PyTorch] Add support for broadcast_to, narrow ops
#17822 - [PyTorch] Cleanup tests for ExportedProgram frontend
#17806 - [PyTorch] Add Softplus Op Support for Exported Program and FX graph
#17817 - [PyTorch] Support dynamic shapes in ExportedProgram frontend
#17813 - [PyTorch] Improve ExportedProgram frontend by supporting unflatten.int, hardtanh_.default, dropout_.default, silu_.default, add_.Tensor and relu_.default
#17812 - [PyTorch] Support argsort, topk ops for ExportedProgram importer
#17810 - [PyTorch] Add support for argsort, sort, topk ops
#17809 - [PyTorch] Delete duplicate converter function _to
#17807 - [PyTorch] Fix torch 2.6 compatibility issues
#17797 - [Pytorch] Update SELU Implementation Using Decomposed Core-Level Ops
#17802 - [Pytorch] support for arange in exported programs translator
#17801 - [PyTorch] Support where, cumprod and reciprocal ops for ExportedProgram importer
#17790 - [PyTorch] Add support for index_select
#17786 - [PyTorch] Support softshrink op for ExportedProgram
#17788 - [PyTorch] Add support for where, cumprod and reciprocal ops
#17785 - [PyTorch] Support prod, std and var ops for ExportedProgram importer
#17778 - [PyTorch] Support log2, log10 and log1p ops for ExportedProgram importer
#17772 - [PyTorch] Add support for prod, std and var ops
#17766 - [PyTorch] Add support for log2, log10 and log1p ops
#17760 - [PyTorch] Add support for lerp, select and clone ops
#17751 - [PyTorch] Support one_hot, empty_like ops for ExportedProgram importer
#17747 - [PyTorch] Support flip, gather, take ops for ExportedProgram importer
#17738 - [PyTorch] Support elu, celu, selu ops for ExportedProgram importer
#17726 - [PyTorch] Add support for numel, empty_like and one_hot ops
#17707 - [PyTorch] Add support for gather, flip and take ops
#17702 - [PyTorch] Add support for celu, selu, is_floating_point ops
#17694 - [PyTorch] Add support for elu, hardtanh ops
#17689 - [PyTorch] Support several binary ops for ExportedProgram importer
#17672 - [PyTorch] Refactor binary ops tests
#17679 - [PyTorch] Support several unary ops for ExportedProgram importer
#17668 - [PyTorch] Add support for and_, lshift, min, or_, rshift, xor ops
#17664 - [PyTorch] Add support for ge, gt, le, mod, ne ops
#17659 - [PyTorch] Add support for bitwise_not, isfinite, isinf, isnan, logical_not, sign and square ops
#17622 - [PyTorch] Add support for abs, ceil, erf, floor, log ops and refactor unary tests
#17566 - [ONNX] Add prim experssion support to Neg converter and update Arange converter to use relax.op.arange
#17642 - [ONNX]replace topi.split with relax.op.split in the onnx frontend
#17674 - [KVCache] PagedKVCache refactor, FlashInfer JIT and MLA integration
#17618 - [KVCache] TIR attention kernel support for MLA
#17615 - [KVCache] Add KV Cache for CPU Runtime
#17616 - [Runtime][KVCache] Initial interface setup for MLA
#17782 - [Frontend] Support max/min in frontend op interface
#17758 - Allow ingesting tensor.chunk() from exported torch program
#17781 - Enable bfloat16 for softmax struct-info inference
#17752 - Batch norm correctness on eval mode
#17774 - check for tensor_meta in exported_program_translator
#17757 - Tensor.split with uneven tensors
#17749 - Move TIR backend to gpu_generic
#17725 - Ingest Tensor.clamp from torch export
#17724 - Add support to ingest Tensor.expand_as()
#17723 - Add torch exported program ingestion capability for Tensor.detach(), Tensor.copy_, and aten.lift_fresh_copy
#17721 - Allow ingesting Upsample module from torch.export either using Size or Scale Factor argument
#17722 - Allow ingesting vector_norm from torch.export
#17728 - ingest Tensor.contiguous from torch export
#17700 - Fix tree attention for Qwen2-1.5 models
#17682 - Add support for func attr inheritance in SplitLayoutRewritePreproc
#17654 - [BYOC] OpenCLML offload support for Relax
#17633 - Pipeline file reorganization
#17626 - Initial setup of relax backend pipeline
#17568 - [PASS] Convert layout pass and ops enhanced to support sub indexing

Runtime

#17614 - [CLML] Profiling options enabled for CLML
#17614 - [CLML] Profiling options enabled for CLML
#17570 - [OPENCL] Bugfix

TIR

#17799 - Fix reduce buffer allocation position
#17783 - [REFACTOR]remove legacy tir::any
#17706 - Minor fix for default GPU schedule
#17579 - [SoftwarePipeline] Ensure pipeline epilogue and prologue do not overlap
#17584 - [LoopPartition] enforcement on loop partition control

TVMC

#17606 - Bug fix

cuda & cutlass & tensorrt

#17789 - [CUTLASS] Add blockwise scale gemm/bmm kernels
#17741 - [Codegen][CUDA] Fix codegen of cast among vector bfloat16, fp8 and fp4
#17708 - [CUDA] FP4 cast and reinterpret support
#17639 - [CUDA] Remove htanh from unsupported math ops for CUDA 12.8
#16950 - [Codegen, CUDA] Add FP8 Tensor Core Codegen

web

#17695 - [WASM] Update wasm include in accordance to kv cache revamp

Misc

#17796 - [Cublas] Added support for bfloat16 while dispatching to cublas kernels
#17763 - [Flashinfer] Added jit flow for sampling kernel
#17811 - [NFC] Fix explict typo
#17780 - [3rdparty] Enable bfloat16 for custom allreduce kernel
#17784 - [REFACTOR] Phase out StackVM
#17750 - BugFix: Relax comment
#17748 - [Codegen] Support codegen for vectorized tir.ShuffleNode
#17743 - Fix: Change variable i to x in split operation in cross_compilation_and_rpc.py
#17730 - [Attention] Added caching for flashinfer binaries during JIT
#17733 - [Refactor] Clean up Relay references in the codebase
#17739 - [BF16] Support ndarray.asnumpy() to bfloat16 tensor natively using ml_dtypes
#17734 - Remove Google Analytics
#17731 - [IR] Compact Functor vtable
#17736 - Fix typos in comments and strings
#17670 - [DataType] BF16 Support
#17727 - [FFI] Fix dynamic FFI index to ensure compatibility
#17718 - [Refactor] Migrate build API to tvm.compile
#17714 - [FFI] Phase out ctypes fallback in favor of cython
#17716 - Fix the get_target_compute_version for sm >= 100
#17710 - [Refactor] Introduce base Executable class and tvm.compile interface
#17713 - [REFACTOR] Cleanup legacy relay runtime data structures
#17712 - [DataType] Rename FP8 dtypes to standard names
#17703 - Fix typos in multiple files
#17693 - updated the assert in BindParams to allow tvm.relax.Constant
#17701 - [Refactor] Remove legacy TE schedule tag
#17683 - [MSC] Remove relay
#17688 - Fix relax.ccl.scatter_from_worker0 assert
#17630 - [Codegen] FP4 support
#17685 - [REFACTOR] Cleanup legacy TE-based passes
#17681 - [REFACTOR] Followup cleanup of relay phase out
#17678 - Bump 3rdparty/cutlass_fpA_intB_gemm
#17669 - [REFACTOR] Allow target dependent default tir pipeline dispatch in tir.build()
#17665 - [REFACTOR] move build flow from C++ to Python
#17624 - Added support for normal MLA kernel
#17641 - Pick up vector length from 'zvlXXXb' (RVV) mattr for riscv
#17666 - [Refactor] Improve TargetHasSVE function with optional target handling
#17661 - [Refactor] Phrase out python dependency decorator
#17662 - [REFACTOR] Phase out te.Schedule c++ components
#17660 - [REFACTOR] Phase out relay c++ components
#17655 - Upgrading onnx and onnxrt verions
#17657 - Update argument order for relax.op.pad to make it round-trippable
#17658 - [REFACTOR] Phase out te.schedule python components
#17653 - Update images to 20250214-034537-bd1411f8
#17656 - [REFACTOR] Phase out relay python components
#17649 - [Refactor] Phase out python dependency attrs
#17644 - Bump rollup from 2.79.1 to 2.79.2 in /web
#17637 - [PYTHON] Build cython by default
#17631 - Handle vector width (VLEN) for RISCV arches
#17613 - Bug Fix: Removed unused code
#17585 - [Relay]Disable InferType if it was done and no changes after previous pass
#17605 - [Refactor] Phase out legacy example apps
#17603 - [Refactor] Phase out legacy docs
#17513 - [GRAPH RT] Additional API support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Apache TVM v0.20.0