v0.6.0
Pre-releaseHighlight
- Support Flink hybrid shuffle integration with Apache Celeborn
- Support HARD_SPLIT in PushMergedData
- Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
- Refine the celeborn RESTful APIs and introduce Celeborn CLI for automation
- Supporting worker Tags
- CongestionController supports control traffic by user/worker traffic speed
- Support read shuffle from S3
- Support CppClient in Celeborn
- HELM Chart Optimization
Improvement
[CELEBORN-2006] LifecycleManager should avoid parsing shufflePartitionType every time
[CELEBORN-2007] Reduce PartitionLocation memory usage
[CELEBORN-2008] SlotsAllocator should select disks randomly in RoundRobin mode
[CELEBORN-1896] delete data from failed to fetch shuffles
[CELEBORN-2004] Filter empty partition before createIntputStream
[CELEBORN-2002][MASTER] Audit shuffle lifecycle in separate log file
[CELEBORN-1993] CelebornConf introduces celeborn..io.threads to specify number of threads used in the client thread pool
[CELEBORN-1995] Optimize memory usage for push failed batches
[CELEBORN-1994] Introduce disruptor dependency to support asynchronous logging of log4j2
[CELEBORN-1965] Rely on all default hadoop providers for S3 auth
[CELEBORN-1982] Slot Selection Perf Improvements
[CELEBORN-1961] Convert Resource.proto from Protocol Buffers version 2 to version 3
[CELEBORN-1931] use gather API for local flusher to optimize write io pattern
[CELEBORN-1916] Support Aliyun OSS Based on MPU Extension Interface
[CELEBORN-1925] Support Flink 2.0
[CELEBORN-1844][CIP-8] introduce tier writer proxy and simplify partition data writer
[CELEBORN-1921] Broadcast large GetReducerFileGroupResponse to prevent Spark driver network exhausted
[CELEBORN-1928][CIP-12] Support HARD_SPLIT in PushMergedData should support handle older worker success response
[CELEBORN-1930][CIP-12] Support HARD_SPLIT in PushMergedData should handle congestion control NPE issue
[CELEBORN-1577][PHASE2] QuotaManager should support interrupt shuffle
[CELEBORN-1856] Support stage-rerun when read partition by chunkOffsets when enable optimize skew partition read
[CELEBORN-1894] Allow skipping already read chunks during unreplicated shuffle read retried
[CELEBORN-1909] Support pre-run static code blocks of TransportMessages to improve performance of protobuf serialization
[CELEBORN-1911] Move multipart-uploader to multipart-uploader/multipart-uploader-s3 for extensibility
[CELEBORN-1910] Remove redundant synchronized of isTerminated in ThreadUtils#sameThreadExecutorService
[CELEBORN-1898] SparkOutOfMemoryError compatible with Spark 4.0 and 4.1
[CELEBORN-1897] Avoid calling toString for too long messages
[CELEBORN-1882] Support configuring the SSL handshake timeout for SSLHandler
[CELEBORN-1883] Replace HashSet with ConcurrentHashMap.newKeySet for ShuffleFileGroups
[CELEBORN-1858] Support DfsPartitionReader read partition by chunkOffsets when enable optimize skew partition read
[CELEBORN-1879] Ignore invalid chunk range generated by splitSkewedPartitionLocations
[CELEBORN-1857] Support LocalPartitionReader read partition by chunkOffsets when enable optimize skew partition read
[CELEBORN-1876] Log remote address on RPC exception for TransportRequestHandler
[CELEBORN-1865] Update master endpointRef when master leader is abnormal
[CELEBORN-1319] Optimize skew partition logic for Reduce Mode to avoid sorting shuffle files
[CELEBORN-1861] Support celeborn.worker.storage.baseDir.diskType option to specify disk type of base directory for worker
[CELEBORN-1757] Add retry when sending RPC to LifecycleManager
[CELEBORN-1859] DfsPartitionReader and LocalPartitionReader should reuse pbStreamHandlers get from BatchOpenStream request
[CELEBORN-1843] Optimize roundrobin for more balanced disk slot allocation
[CELEBORN-1854] Change receive revive request log level to debug
[CELEBORN-1841] Support custom implementation of EventExecutorChooser to avoid deadlock when calling await in EventLoop thread
[CELEBORN-1847][CIP-8] Introduce local and DFS tier writer
[CELEBORN-1850] Setup worker endpoint after initalizing controller
[CELEBORN-1838] Interrupt spark task should not report fetch failure
[CELEBORN-1835][CIP-8] Add tier writer base and memory tier writer
[CELEBORN-1829] Replace waitThreadPoll's thread pool with ScheduledExecutorService in Controller
[CELEBORN-1720] Prevent stage re-run if another task attempt is running or successful
[CELEBORN-1482][CIP-8] Add partition meta handler
[CELEBORN-1812] Distinguish sorting-file from sort-tasks waiting to be submitted
[CELEBORN-1737] Support build tez client package
[CELEBORN-1802] Fail the celeborn master/worker start if CELEBORN_CONF_DIR is not directory
[CELEBORN-1701][FOLLOWUP] Support stage rerun for shuffle data lost
[CELEBORN-1413] Support Spark 4.0
[CELEBORN-1753] Optimize the code for exists
and find
method
[CELEBORN-1777] Add java.security.jgss/sun.security.krb5
to DEFAULT_MODULE_OPTIONS
[CELEBORN-1731] Support merged kv input for Tez
[CELEBORN-1732] Support unordered kv input for Tez
[CELEBORN-1733] Support ordered grouped kv input for Tez
[CELEBORN-1721][CIP-12] Support HARD_SPLIT in PushMergedData
[CELEBORN-1758] Remove the empty user resource consumption from worker heartbeat
[CELEBORN-1729] Support ordered KV output for Tez
[CELEBORN-1730] Support unordered KV output for Tez
[CELEBORN-1612] Add a basic reader writer class to Tez
[CELEBORN-1725] Optimize performance of handling MapperEnd
RPC in LifecycleManager
[CELEBORN-1700] Flink supports fallback to vanilla Flink built-in shuffle implementation
[CELEBORN-1621][CIP-11] Predefined worker tags expr via dynamic configs
[CELEBORN-1545] Add Tez plugin skeleton and dag app master
[CELEBORN-1530] support MPU for S3
[CELEBORN-1618][CIP-11] Supporting tags via DB Config Service
[CELEBORN-1726] Update WorkerInfo when transition worker state
[CELEBORN-1715] RemoteShuffleMaster should check celeborn.client.push.replicate.enabled in constructor
[CELEBORN-1714] Optimize handleApplicationLost
[CELEBORN-1660] Cache available workers and only count the available workers device free capacity
[CELEBORN-1619][CIP-11] Integrate tags manager with config service
[CELEBORN-1071] Support stage rerun for shuffle data lost
[CELEBORN-1697] Improve ThreadStackTrace for thread dump
[CELEBORN-1671] CelebornShuffleReader will try replica if create client failed
[CELEBORN-1682] Add java tools.jar into classpath for JVM quake
[CELEBORN-1660] Using map for workers to find worker fast
[CELEBORN-1673] Support retry create client
[CELEBORN-1577][PHASE1] Storage quota should support interrupt shuffle
[CELEBORN-1642][CIP-11] Support multiple worker tags
[CELEBORN-1636] Client supports dynamic update of Worker resources on the server
[CELEBORN-1651] Support ratio threshold of unhealthy disks for excluding worker
[CELEBORN-1601] Support revise lost shuffles
[CELEBORN-1487][PHASE2] CongestionController support dynamic config
[CELEBORN-1648] Refine AppUniqueId with UUID suffix
[CELEBORN-1490][CIP-6] Introduce tier consumer for hybrid shuffle
[CELEBORN-1645] Introduce ShuffleFallbackPolicy to support custom implementation of shuffle fallback policy for CelebornShuffleFallbackPolicyRunner
[CELEBORN-1620][CIP-11] Support passing worker tags via RequestSlots message
[CELEBORN-1487][PHASE1] CongestionController support control traffic by user/worker traffic speed
[CELEBORN-1637] Enhance config to bypass memory check for partition file sorter
[CELEBORN-1638] Improve the slots allocator performance
[CELEBORN-1617][CIP-11] Support workers tags in FS Config Service
[CELEBORN-1615] Start the http server after all handlers added
[CELEBORN-1574] Speed up unregister shuffle by batch processing
[CELEBORN-1625] Add parameter skipCompress
for pushOrMergeData
[CELEBORN-1597][CIP-11] Implement TagsManager
[CELEBORN-1490][CIP-6] Impl worker write process for Flink Hybrid Shuffle
[CELEBORN-1602] do hard split for push merged data RPC with disk full
[CELEBORN-1594] Refine dynamicConfig template and prevent NPE
[CELEBORN-1513] Support wildcard bind in dual stack environments
[CELEBORN-1587] Change to debug logging on client side for SortBasedPusher trigger push
[CELEBORN-1496] Differentiate map results with only different stageAttemptId
[CELEBORN-1583] MasterClient#sendMessageInner should throw Throwable for celeborn.masterClient.maxRetries is 0
[CELEBORN-1578] Make Worker#timer have thread name and daemon
[CELEBORN-1580] ReadBufferDispacther should notify exception to listener
[CELEBORN-1575] TimeSlidingHub should remove expire node when reading
[CELEBORN-1560] Remove usages of deprecated Files.createTempDir of Guava
[CELEBORN-1567] Support throw FetchFailedException when Data corruption detected
[CELEBORN-1568] Support worker retries in MiniCluster
[CELEBORN-1563] Log networkLocation in WorkerInfo
[CELEBORN-1531] Refactor self checks in master
[CELEBORN-1550] Add support of providing custom dynamic store backend implementation
[CELEBORN-1529] Read shuffle data from S3
[CELEBORN-1518] Add support for Apache Spark barrier stages
[CELEBORN-1542] Master supports to check the worker host pattern on worker registration
[CELEBORN-1535] Support to disable master workerUnavailableInfo expiration
[CELEBORN-1511] Add support for custom master endpoint resolver
[CELEBORN-1541] Enhance the readable address for internal port
[CELEBORN-1533] Log location when CelebornInputStream#fillBuffer fails
[CELEBORN-1524] Support IPv6 hostnames for Apache Ratis
[CELEBORN-1519] Do not update estimated partition size if it is unchanged
[CELEBORN-1521] Introduce celeborn-spi module for authentication extensions
[CELEBORN-1469] Support writing shuffle data to OSS(S3 only)
[CELEBORN-1446] Enable chunk prefetch when initialize CelebornInputStream
[CELEBORN-1509] Reply response without holding a lock
[CELEBORN-1543] Support Flink 1.20
[CELEBORN-1504] Support for Apache Flink 1.16
[CELEBORN-1500] Filter out empty InputStreams
[CELEBORN-1495] CelebornColumnDictionary supports dictionary of float and double column type
[CELEBORN-1494] Support IPv6 addresses in PbSerDeUtils.fromPackedPartitionLocations
[CELEBORN-1483] Add storage policy
[CELEBORN-1479] Report register shuffle failed reason in exception
[CELEBORN-1489] Update Flink support with authentication support
[CELEBORN-1485] Refactor addCounter, addGauge and addTimer of AbstractSource to reduce CPU utilization
[CELEBORN-1472] Reduce CongestionController#userBufferStatuses call times
[CELEBORN-1471] CelebornScalaObjectMapper supports configuring FAIL_ON_UNKNOWN_PROPERTIES to false
Flink hybrid shuffle
[CELEBORN-1867][FLINK] Fix flink client memory leak of TransportResponseHandler#outstandingRpcs for handling addCredit and notifyRequiredSegment response
[CELEBORN-1866][FLINK] Fix CelebornChannelBufferReader request more buffers than needed
[CELEBORN-1490][CIP-6] Support process large buffer in flink hybrid shuffle
[CELEBORN-1490][CIP-6] Add Flink hybrid shuffle doc
[CELEBORN-1490][CIP-6] Impl worker read process in Flink Hybrid Shuffle
[CELEBORN-1490][CIP-6] Introduce tier producer in celeborn flink client
[CELEBORN-1490][CIP-6] Introduce tier factory and master agent in flink hybrid shuffle
[CELEBORN-1490][CIP-6] Enrich register shuffle method
[CELEBORN-1490][CIP-6] Extends FileMeta to support hybrid shuffle
[CELEBORN-1490][CIP-6] Extends message to support hybrid shuffle
RESTful API and CLI
[CELEBORN-1875] Support to get workers topology information with RESTful api
[CELEBORN-1797] Support to adjust the logger level with RESTful API during runtime
[CELEBORN-1750] Return struct worker resource consumption information with RESTful api
[CELEBORN-1707] Audit all RESTful api calls and use separate restAuditFile
[CELEBORN-1632] Support to apply ratis local raft_meta_conf command with RESTful api
[CELEBORN-1599] Container Info REST API
[CELEBORN-1630] Support to apply ratis peer operation with RESTful api
[CELEBORN-1631] Support to apply ratis snapshot operation with RESTful api
[CELEBORN-1633] Return more raft group information
[CELEBORN-1629] Support to apply ratis election operation with RESTful api
[CELEBORN-1609] Support SSL for celeborn RESTful service
[CELEBORN-1608] Reduce redundant response for RESTful api
[CELEBORN-1607] Enable useEnumCaseInsensitive
for openapi-generator
[CELEBORN-1589] Ensure master is leader for some POST request APIs
[CELEBORN-1572] Celeborn CLI initial REST API support
[CELEBORN-1553] Using the request base url as swagger server
[CELEBORN-1537] Support to remove workers unavailable info with RESTful api
[CELEBORN-1546] Support authorization on swagger UI
[CELEBORN-1477] Using openapi-generator apache-httpclient
library instead of jersey2
[CELEBORN-1493] Check admin privileges for http mutative requests
[CELEBORN-1477][CIP-9] Refine the celeborn RESTful APIs
[CELEBORN-1476] Enhance the RESTful response error msg
[CELEBORN-1475] Fix unknownExcludedWorkers filter for /exclude request
[CELEBORN-1318] Support celeborn http authentication
Monitoring
[CELEBORN-1800] Introduce ApplicationTotalCount and ApplicationFallbackCount metric to record the total and fallback count of application
[CELEBORN-1974] ApplicationId as metrics label should be behind a config flag
[CELEBORN-1968] Publish metric for unreleased partition location count when worker was gracefully shutdown
[CELEBORN-1977] Add help/type on prometheus exposed metrics
[CELEBORN-1831] Add ratis commitIndex metrics
[CELEBORN-1817] add committed file size metrics
[CELEBORN-1791] All NettyMemoryMetrics should register to source
[CELEBORN-1804] Shuffle environment metrics of RemoteShuffleEnvironment should use Shuffle.Remote metric group
[CELEBORN-1634][FOLLOWUP] Add rpc metrics into grafana dashboard
[CELEBORN-1766] Add detail metrics about fetch chunk
[CELEBORN-1756] Only gauge hdfs metrics if HDFS storage enabled to reduce metrics
[CELEBORN-1634] Support queue time/processing time metrics for rpc framework
[CELEBORN-1706] Use bytes(IEC) unit instead of bytes(SI) for size related metrics in prometheus dashboard
[CELEBORN-1685] ShuffleFallbackPolicy supports ShuffleFallbackCount metric
[CELEBORN-1680] Introduce ShuffleFallbackCount metrics
[CELEBORN-1444][FOLLOWUP] Add IsDecommissioningWorker to celeborn dashboard
[CELEBORN-1640] NettyMemoryMetrics supports numHeapArenas, numDirectArenas, tinyCacheSize, smallCacheSize, normalCacheSize, numThreadLocalCaches and chunkSize
[CELEBORN-1656] Remove duplicate UserProduceSpeed metrics
[CELEBORN-1627] Introduce instance
variable for celeborn dashboard to filter metrics
[CELEBORN-1582] Publish metric for unreleased shuffle count when worker was decommissioned
[CELEBORN-1501] Introduce application dimension resource consumption metrics of Worker
[CELEBORN-1586] Add available workers Metrics
[CELEBORN-1581] Fix incorrect metrics of DeviceCelebornFreeBytes and DeviceCelebornTotalBytes
[CELEBORN-1491] introduce flusher working queue size metric
CPP Client
[CELEBORN-1978][CIP-14] Add code style checking for cppClient
[CELEBORN-1958][CIP-14] Add testsuite to test writing with javaClient and reading with cppClient
[CELEBORN-1874][CIP-14] Add CICD procedure in github action for cppClient
[CELEBORN-1932][CIP-14] Adapt java's serialization to support cpp serialization for GetReducerFileGroup/Response
[CELEBORN-1915][CIP-14] Add reader's ShuffleClient to cppClient
[CELEBORN-1906][CIP-14] Add CelebornInputStream to cppClient
[CELEBORN-1881][CIP-14] Add WorkerPartitionReader to cppClient
[CELEBORN-1871][CIP-14] Add NettyRpcEndpointRef to cppClient
[CELEBORN-1863][CIP-14] Add TransportClient to cppClient
[CELEBORN-1845][CIP-14] Add MessageDispatcher to cppClient
[CELEBORN-1836][CIP-14] Add Message to cppClient
[CELEBORN-1827][CIP-14] Add messageDecoder to cppClient
[CELEBORN-1821][CIP-14] Add controlMessages to cppClient
[CELEBORN-1819][CIP-14] Refactor cppClient with nested namespace
[CELEBORN-1814][CIP-14] Add transportMessage to cppClient
[CELEBORN-1809][CIP-14] Add partitionLocation to cppClient
[CELEBORN-1799][CIP-14] Add celebornConf to cppClient
[CELEBORN-1785][CIP-14] Add baseConf to cppClient
[CELEBORN-1772][CIP-14] Add memory module to cppClient
[CELEBORN-1761][CIP-14] Add cppProto to cppClient
[CELEBORN-1754][CIP-14] Add exceptions and checking utils to cppClient
[CELEBORN-1751][CIP-14] Add celebornException utils to cppClient
[CELEBORN-1740][CIP-14] Add stackTrace utils to cppClient
[CELEBORN-1741][CIP-14] Add processBase utils to cppClient
[CELEBORN-1724][CIP-14] Add environment setup tools for CppClient development
HELM Chart Optimization
[CELEBORN-1996][HELM] Rename volumes.{master,worker} to {master,worker}.volumes and {master.worker}.volumeMounts
[CELEBORN-1989][HELM] Split securityContext into master.podSecurityContext and worker.podSecurityContext
[CELEBORN-1988][HELM] Split hostNetwork into master.hostNetwork and worker.hostNetwork
[CELEBORN-1987][HELM] Split dnsPolicy into master.dnsPolicy and worker.dnsPolicy
[CELEBORN-1981][HELM] Rename masterReplicas and workerReplicas to master.replicas and worker.replicas
[CELEBORN-1985][HELM] Add new values master.envFrom and worker.envFrom
[CELEBORN-1986][HELM] Rename priorityClass.{master,worker} to {master,worker}.priorityClass
[CELEBORN-1980][HELM] Split environments into master.env and worker.env
[CELEBORN-1972][HELM] Rename affinity.{master,worker} to {master,worker}.affinity
[CELEBORN-1951][HELM] Rename resources.{master,worker} to {master,worker}.resoruces
[CELEBORN-1962][HELM] Split tolerations into master.tolerations and worker.tolerations
[CELEBORN-1955][HELM] Split nodeSelector into master.nodeSelector and worker.nodeSelector
[CELEBORN-1953][HELM] Split podAnnotations into master.annotations and worker.annotations
[CELEBORN-1954][HELM] Add a new value image.registry
[CELEBORN-1952][HELM] Define template helpers for master/worker respectively
[CELEBORN-1532][HELM] Read log4j2 and metrics configurations from file
[CELEBORN-1830] Chart statefulset resources key duplicate
[CELEBORN-1788] Add role and roleBinding helm charts
[CELEBORN-1786] add serviceAccount helm chart
[CELEBORN-1780] Add support for NodePort Service per Master replica
[CELEBORN-1552] automatically support prometheus to scrape metrics for helm chart
Stability and Bug Fix
[CELEBORN-1902] Read client throws PartitionConnectionException
[CELEBORN-2000] Ignore the getReducerFileGroup timeout before shuffle stage end
[CELEBORN-1960] Fix PauseSpentTime only append the interval check time
[CELEBORN-1998] RemoteShuffleEnvironment should not register InputChannelMetrics repeatedly
[CELEBORN-1999] OpenStreamTime should use requestId to record cost time
[CELEBORN-1855] LifecycleManager return appshuffleId for non barrier stage when fetch fail has been reported
[CELEBORN-1948] Fix the issue where replica may lose data when HARD_SPLIT occurs during handlePushMergeData
[CELEBORN-1912] Client should send heartbeat to worker for processing heartbeat to avoid reading idleness of worker which enables heartbeat
[CELEBORN-1992] Ensure hadoop FS are not closed by hadoop ShutdownHookManager
[CELEBORN-1919] Hardsplit batch tracking should be disabled when pushing only a single replica
[CELEBORN-1979] Change partition manager should respect the celeborn.storage.availableTypes
[CELEBORN-1983] Fix fetch fail not throw due to reach spark maxTaskFailures
[CELEBORN-1976] CommitHandler should use celeborn.client.rpc.commitFiles.askTimeout for timeout of doParallelCommitFiles
[CELEBORN-1966] Added fix to get userinfo in celeborn
[CELEBORN-1969] Remove celeborn.client.shuffle.mapPartition.split.enabled to enable shuffle partition split at default for MapPartition
[CELEBORN-1970] Use StatusCode.fromValue instead of Utils.toStatusCode
[CELEBORN-1646][FOLLOWUP] DeviceMonitor should notifyObserversOnError with CRITICAL_ERROR disk status for input/ouput error
[CELEBORN-1929] Avoid unnecessary buffer loss to get better buffer reusability
[CELEBORN-1918] Add batchOpenStream time to fetch wait time
[CELEBORN-1947] Reduce log for CelebornShuffleReader sleeping before inputStream ready
[CELEBORN-1923] Correct Celeborn available slots calculation logic
[CELEBORN-1914] incWriteTime when ShuffleWriter invoke pushGiantRecord
[CELEBORN-1822] Respond to RegisterShuffle with max epoch PartitionLocation to avoid revive
[CELEBORN-1899] Fix configuration bug in shuffle s3
[CELEBORN-1885] Fix nullptr exceptions in FetchChunk after worker restart
[CELEBORN-1889] Fix scala 2.13 complie error
[CELEBORN-1846] Fix the StreamHandler usage in fetching chunk when task attempt is odd
[CELEBORN-1792] MemoryManager resume should use pinnedDirectMemory instead of usedDirectMemory
[CELEBORN-1832] MapPartitionData should create fixed thread pool with registration of ThreadPoolSource
[CELEBORN-1820] Failing to write and flush StreamChunk data should be counted as FETCH_CHUNK_FAIL
[CELEBORN-1818] Fix incorrect timeout exception when waiting on no pending writes
[CELEBORN-1763] Fix DataPusher be blocked for a long time
[CELEBORN-1783] Fix Pending task in commitThreadPool wont be canceled
[CELEBORN-1782] Worker in congestion control should be in blacklist to avoid impact new shuffle
[CELEBORN-1510] Partial task unable to switch to the replica
[CELEBORN-1778] Fix commitInfo NPE and add assert in LifecycleManagerCommitFilesSuite
[CELEBORN-1670] Avoid swallowing InterruptedException in ShuffleClientImpl
Revert "[CELEBORN-1376] Push data failed should always release request body"
[CELEBORN-1770] FlushNotifier should setException for all Throwables in Flusher
[CELEBORN-1769] Fix packed partition location cause GetReducerFileGroupResponse lose location
[CELEBORN-1767] Fix occasional errors in UT when creating workers
[CELEBORN-1765] Fix NPE when removeFileInfo in StorageManager
[CELEBORN-1760] OOM causes disk buffer unable to be released
[CELEBORN-1743] Resolve the metrics data interruption and the job failure caused by locked resources
[CELEBORN-1759] Fix reserve slots might lost partition location between 0.4 client and 0.5 server
[CELEBORN-1749] Fix incorrect application diskBytesWritten metrics
[CELEBORN-1713] RpcTimeoutException should include RPC address in message
[CELEBORN-1728] Fix NPE when failing to connect to celeborn worker
[CELEBORN-1727] Correct the calculation of worker diskInfo actualUsableSpace
[CELEBORN-1718] Fix memory storage file won't hard split when memory file is full and worker has no disks
[CELEBORN-1705] Fix disk buffer size is negative issue
[CELEBORN-1686] Avoid return the same pushTaskQueue
[CELEBORN-1696] StorageManager#cleanFile should remove file info
[CELEBORN-1691] Fix the issue that upstream tasks don't rerun and the current task still retry when failed to decompress in flink
[CELEBORN-1693] Fix storageFetcherPool concurrent problem
[CELEBORN-1674] Fix reader thread name of MapPartitionData
[CELEBORN-1668] Fix NPE when handle closed file writers
[CELEBORN-1669] Fix NullPointerException for PartitionFilesSorter#updateSortedShuffleFiles after cleaning up expired shuffle key
[CELEBORN-1667] Fix NPE & LEAK occurring prior to worker registration
[CELEBORN-1664] Fix secret fetch failures after LEADER master failover
[CELEBORN-1665] CommitHandler should process CommitFilesResponse with COMMIT_FILE_EXCEPTION status
[CELEBORN-1663] Only register appShuffleDeterminate if stage using celeborn for shuffle
[CELEBORN-1661] Make sure that the sortedFilesDb is initialized successfully when worker enable graceful shutdown
[CELEBORN-1662] Handle PUSH_DATA_FAIL_PARTITION_NOT_FOUND in getPushDataFailCause
[CELEBORN-1655] Fix read buffer dispatcher thread terminate unexpectedly
[CELEBORN-1646] Catch exception of Files#getFileStore for DeviceMonitor and StorageManager for input/ouput error
[CELEBORN-1652] Throw TransportableError for failure of sending PbReadAddCredit to avoid flink task get stuck
[CELEBORN-1643] DataPusher handle InterruptedException
[CELEBORN-1579] Fix the memory leak of result partition
[CELEBORN-1564] Fix actualUsableSpace of offerSlotsLoadAware condition on diskInfo
[CELEBORN-1573] Change to debug logging on client side for reserve slots
[CELEBORN-1557] Fix totalSpace of DiskInfo for Master in HA mode
[CELEBORN-1549] Fix networkLocation persistence into Ratis
[CELEBORN-1558] Fix the incorrect decrement of pendingWrites in handlePushMergeData
[CELEBORN-1547] Worker#listTopDiskUseApps should return celeborn.metrics.app.topDiskUsage.count applications
[CELEBORN-1544] ShuffleWriter needs to call close finally to avoid memory leaks
[CELEBORN-1473] TransportClientFactory should register netty memory metric with source for shared pooled ByteBuf allocator
[CELEBORN-1522] Fix applicationId extraction from shuffle key
[CELEBORN-1516] DynamicConfigServiceFactory should support singleton
[CELEBORN-1515] SparkShuffleManager should set lifecycleManager to null after stopping lifecycleManager in Spark 2
[CELEBORN-1439] Fix revive logic bug which will casue data correctness issue and job failiure
[CELEBORN-1507] Prevent invalid Filegroups from being used
[CELEBORN-1478] Fix wrong use partitionId as shuffleId when readPartition
Build
[CELEBORN-1801] Remove out-of-dated flink 1.14 and 1.15
[CELEBORN-1746] Reduce the size of aws dependencies
[CELEBORN-1677][BUILD] Update SCM information for SBT build configuration
[CELEBORN-1649] Bumping up maven to 3.9.9
[CELEBORN-1658] Add Git Commit Info and Build JDK Spec to sbt Manifest
[CELEBORN-1659] Fix sbt make-distribution for cli
[CELEBORN-1616] Shade com.google.thirdparty to prevent dependency conflicts
[CELEBORN-1606] Generate dependencies-client-flink-1.16
[CELEBORN-1600] Enable check server dependencies
[CELEBORN-1556] Update Github actions to v4
[CELEBORN-1559] Fix make distribution script failed to recognize specific profile
[CELEBORN-1526] Fix MR plugin can not run on Hadoop 3.1.0
[CELEBORN-1527] Error prone plugin should exclude target/generated-sources/java of module
[CELEBORN-1505] Algin the celeborn server jackson dependency versions
Documentation
[CELEBORN-1870] Fix typos in in 'Developer' documents
[CELEBORN-1860] Remove unused celeborn..io.enableVerboseMetrics option
[CELEBORN-1823] Remove unused remote-shuffle.job.min.memory-per-partition and remote-shuffle.job.min.memory-per-gate
[CELEBORN-1811] Update default value for celeborn.master.slot.assign.extraSlots
[CELEBORN-1789][DOC] Document on Java Columnar Shuffle
[CELEBORN-1774] Update default value of celeborn..io.mode to whether epoll mode is available
[CELEBORN-1622][CIP-11] Adding documentation for Worker Tags feature
[CELEBORN-1755] Update doc to include S3 as one of storage layers
[CELEBORN-1752] Migration guide for unexpected shuffle RESTful api change since 0.5.0
[CELEBORN-1745] Remove application top disk usage code
[CELEBORN-1719] Introduce celeborn.client.spark.stageRerun.enabled with alternative celeborn.client.spark.fetch.throwsFetchFailure to enable spark stage rerun
[CELEBORN-1687] Highlight flink session cluster issue in doc
[CELEBORN-1684] Fix ambiguous client jar expression of document
[CELEBORN-1678] Add Celeborn CLI User guide in README
[CELEBORN-1635] Introduce Blaze support document
[CLEBORN-1555] Replace deprecated config celeborn.storage.activeTypes in docs and tests
[CELEBORN-1566] Update docs about using HDFS
[CELEBORN-1551] Fix wrong link in quota_management.md
[CELEBORN-1437][DOC] Merge METRICS.md into monitoring.md
[CELEBORN-1436][DOC] Move Rest API out from monitoring.md to webapi.md
[CELEBORN-1486] Introduce ClickHouse Backend in Gluten Support document
[CELEBORN-1466] Add local command in celeborn_ratis_shell.md
Dependencies
[CELEBORN-1895] Bump log4j2 version to 2.24.3
[CELEBORN-1890] Bump Spark from 3.5.4 to 3.5.5
[CELEBORN-1884] Bump rocksdbjni version from 9.5.2 to 9.10.0
[CELEBORN-1877] Bump zstd-jni version from 1.5.2-1 to 1.5.7-1
[CELEBORN-1872] Bump Flink from 1.19.1, 1.20.0 to 1.19.2, 1.20.1
[CELEBORN-1864] Bump Netty version from 4.1.115.Final to 4.1.118.Final
[CELEBORN-1862] Bump Ratis version from 3.1.2 to 3.1.3
[CELEBORN-1842] Bump ap-loader version from 3.0-8 to 3.0-9
[CELEBORN-1806] Bump Spark from 3.5.3 to 3.5.4
[CELEBORN-1712] Bump Netty version from 4.1.109.Final to 4.1.115.Final
[CELEBORN-1748] Deprecate identity provider configs tied with quota
[CELEBORN-1702] Bump Ratis version from 3.1.1 to 3.1.2
[CELEBORN-1708] Bump protobuf version from 3.21.7 to 3.25.5
[CELEBORN-1709] Bump jetty version from 9.4.52.v20230823 to 9.4.56.v20240826
[CELEBORN-1710] Bump commons-io version from 2.13.0 to 2.17.0
[CELEBORN-1672] Bump Spark from 3.4.3 to 3.4.4
[CELEBORN-1666] Bump scala-protoc from 1.0.6 to 1.0.7
[CELEBORN-1525] Bump Ratis version from 3.1.0 to 3.1.1
[CELEBORN-1613] Bump Spark from 3.5.2 to 3.5.3
[CELEBORN-1605] Bump commons-lang3 version from 3.13.0 to 3.17.0
[CELEBORN-1604] Bump rocksdbjni version from 8.11.3 to 9.5.2
[CELEBORN-1562] Bump Spark from 3.5.1 to 3.5.2
[CELEBORN-1512] Bump Flink from 1.19.0 to 1.19.1
[CELEBORN-1499] Bump Ratis version from 3.0.1 to 3.1.0
Credits
- Kerwin Zhang
- Sanskar Modi
- Weijie Guo
- Yuting Wang
- Shaoyun Chen
- Jiashu Xiong
- Aravind Patnam
- Mingxiao Feng
- Fu Chen
- Amandeep Singh
- Lianne Li
- Mridul Muralidharan
- Keyong Zhou
- Cheng Pan
- Nan Zhu
- Yanze Jiang
- Chongchen Chen
- Xu Huang
- Jinqian Fan
- Yi Zhu
- Xinyu Wang
- Pengqi Li
- Xianming Lei
- Veli Yang
- Jianfu Li
- Biao Geng
- Erik Fang
- xy2953396112
- He Zhao
- Arsen Gumin
- Tao Zheng
- Leo Li
- Shengjie Wang
- Nicolas Fraison
- Fei Wang
- Minchu Yang
- Madhukar525722
- avishnus
- Björn Boschman
- Binjie Yang
- Jiaming Xie
- Zhengqi Zhang
- Nicholas Jiang
- Yajun Gao
- Yi Chen
- Wang, Fei
- Aidar Bariev
- Yuxin Tan
- Kun Wan
- Saurabh Dubey
- Ziyi Wu
- Shlomi Uubul
- Bowen Liang
- Haotian Cao
- Guangwei Hong
- Zhao Zhao