Similar Series
Data Engineering Silicon Valley
Conducting real-time similarity search using an approximate nearest neighbor technique.
Problem Statement:
- Want to query historical price data to get Real time approximate nearest neighbors.
Motivation
To provide researchers of financial time series to find periods in time that are
“similar” to the latest period in real time.
- Would help algorithm developers gain insights into how the time series developed in “similar”
periods in the past.
- Allows to cross-reference other time series (commodities/other currencies) for “similar” periods
in the past.
- Can be used as a signal in quantitative trading.
Data Pipeline
Overview of challenges
- Compute distance between unevenly spaced time series.
- Compute approximate nearest neighbor in near constant
time.
- Construct a data structure that allows reliable processing,
storage and retrieval of data to quickly respond to queries.
Distance metric between non-uniform time series.
- L1 Analogue
- Satisfies triangle inequality
- Easy to visualize
Finding the nearest neighbor quickly
- LSH for a generic metric space.
- N pivots
- Use the distance ordering to pivots as a
permutation.
- Example permutation: 32154
- Permutation is used to index the historical
data and perform fast queries.
On Locality-sensitive Indexing in Generic Metric Spaces.
Novak, Kyselak, Zezula 2010
Applying the idea to unevenly spaced time series.
Query:
Resulting permutation:
13245
Data structure for fast querying of similar permutations:
- Use a nested key-value store.
- Store the full permutations and timestamps in the leaves.
- Total possible number of leave nodes is n! Where n is the number of pivots.
- Implemented a persistent version using Cassandra tables.
Want to query permutation: 13245
The desired timestamp is at the leaf.
Further Directions
- Optimize pivot selection.
- Optimize algorithm to find more exact results.
- Consider different distance functions.
- Benchmark accuracy.
- Use the obtained nearest neighbors for research.
Yevgeniy Grechka
MA Statistics UC Berkeley
Similar Series
Similar Series

Similar Series

  • 1.
    Similar Series Data EngineeringSilicon Valley Conducting real-time similarity search using an approximate nearest neighbor technique.
  • 2.
    Problem Statement: - Wantto query historical price data to get Real time approximate nearest neighbors.
  • 3.
    Motivation To provide researchersof financial time series to find periods in time that are “similar” to the latest period in real time. - Would help algorithm developers gain insights into how the time series developed in “similar” periods in the past. - Allows to cross-reference other time series (commodities/other currencies) for “similar” periods in the past. - Can be used as a signal in quantitative trading.
  • 4.
  • 5.
    Overview of challenges -Compute distance between unevenly spaced time series. - Compute approximate nearest neighbor in near constant time. - Construct a data structure that allows reliable processing, storage and retrieval of data to quickly respond to queries.
  • 6.
    Distance metric betweennon-uniform time series. - L1 Analogue - Satisfies triangle inequality - Easy to visualize
  • 7.
    Finding the nearestneighbor quickly - LSH for a generic metric space. - N pivots - Use the distance ordering to pivots as a permutation. - Example permutation: 32154 - Permutation is used to index the historical data and perform fast queries. On Locality-sensitive Indexing in Generic Metric Spaces. Novak, Kyselak, Zezula 2010
  • 8.
    Applying the ideato unevenly spaced time series. Query: Resulting permutation: 13245
  • 9.
    Data structure forfast querying of similar permutations: - Use a nested key-value store. - Store the full permutations and timestamps in the leaves. - Total possible number of leave nodes is n! Where n is the number of pivots. - Implemented a persistent version using Cassandra tables. Want to query permutation: 13245 The desired timestamp is at the leaf.
  • 10.
    Further Directions - Optimizepivot selection. - Optimize algorithm to find more exact results. - Consider different distance functions. - Benchmark accuracy. - Use the obtained nearest neighbors for research.
  • 11.