[B! compression][integer] yassのブックマーク

yass id:yass

compressionとintegerに関するyassのブックマーク (20)

Frame of Reference and Roaring Bitmaps
Search and analytics, data ingestion, and visualization – all at your fingertips.
yass 2016/06/11
integer

compression

delta encoding

bit packing

lucene

BitSet

bitmap index

roaringbitmap
リンク
ソート済の整数列を圧縮する件
圧縮されたソート済の整数列ってのは汎用的なデータ構造で、たとえば検索エンジンの転置インデックスとか、いろんなところで使うわけです。で、検索エンジンの場合は速度重要なので、PForDeltaとか様々なデータ構造が研究されてる。一方、H2O には、ブラウザキャッシュに載ってない js や css をサーバプッシュする仕組み「cache-aware server push」があって、何がキャッシュされているか判定するためにブルームフィルタを全ての HTTP リクエストに含める必要がある。で、ブルームフィルタを圧縮しようと思うと、ブルームフィルタってのはソート済の整数列として表現できるので、これを圧縮しようって話になる。検索エンジン等で使う場合は速度重要だけど、HTTPリクエストに載せる場合は空間効率のほうが重要になる。ってことで、空間効率が理論限界に近いゴロム符号（の特殊系であるライス符号
yass 2015/11/06
sort

integer

compression

Huffman coding

Golomb coding
リンク
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
yass 2015/10/12
facebook

compression

integer

float

xor

delta encoding

time series database
リンク
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014 Apache Parquet is an open-source columnar storage format for efficient data storage and analytics. It provides efficient compression and encoding techniques that enable fast scans and queries of large datasets. Parquet 2.0 improves on these efficiencies through techniques like delta encoding, dictionary encoding, run-length
yass 2014/06/25
parquet

delta encoding

columnar storage

integer

compression

branch prediction
リンク
A BILLION ROWS PER SECOND Metaprogramming Python for Big Data
Ville Tuulos Principal Engineer @ AdRoll ville.tuulos@adroll.com We faced the key technical challenge of modern Business Intelligence: How to query tens of billions of events interactively? Our solution, DeliRoll, is implemented in Python. Everyone knows that Python is SLOW. You can't handle big data with low latency in Python! Small Benchmark Data: 1.5 billion rows, 400 columns - 660GB. Smaller e
yass 2013/09/29
compression

redmine

python

LLVM

integer

columnar storage
リンク
Metaprogramming Python for Big Data
yass 2013/09/29
compression

python

LLVM

columnar storage

redshift

integer

video
リンク
Compression encodings - Amazon Redshift
Amazon Redshift will no longer support the creation of new Python UDFs starting November 1, 2025. If you would like to use Python UDFs, create the UDFs prior to that date. Existing Python UDFs will continue to function as normal. For more information, see the blog post . Compression encodings A compression encoding specifies the type of compression that is applied to a column of data values as row
yass 2013/09/23
" if the column contains 10 integers in sequence from 1 to 10, the first will be stored as a 4-byte integer (plus a 1-byte flag), and the next 9 will each be stored as a byte with the value 1 / the full original value is stored, with a leading 1-byte flag. "

redshift

compression

delta encoding

integer

columnar storage
リンク
Binary fingerprint compression - Lukáš Lalinský
yass 2013/09/07
" If I use XOR instead of subtraction for the delta encoding, I will still get high numbers, but they will usually not have many bits set. "

compression

integer

XOR

delta encoding

fingerprint
リンク
MOPID | Accelerate Your Hiring Process - Hire the Best, 10x Faster
yass 2013/08/13
" The basic idea of PForDelta is as follows: in order to compress a block of k numbers, say, 256 numbers, it first determines a value b such that most of the 256 values to be encoded (say, 90%) are less than 2^b and thus fit into a fixed bit field of b bit s each. "

integer

compression

PForDelta

lucene
リンク
GitHub - maropu/vpacker: A simple integer compression library for C/C++/Java
A simple integer compression library for C/C++/Java ================= Overview ----------- Released vpacker-0.1.0, which compresses a 32-bit or 64-bit integer array. It is assumed to encode a sequence of integers with highly positive skewness. The skewness is a distribution where the mass of the distribution is concentrated on the left, i.e., an element of the sequence is rarely a large integer. T
yass 2013/02/27
"A simple integer compression library for C/C++/Java"

integer

compression

C
リンク
SNA Projects Blog : Tech Talk: Michael Deerkoski (Flickr) — “Continuous Deployment at Flickr”
LinkedIn operates the world’s largest professional network with more than 645 million members in over 200 countries and terr itories. This team builds distributed systems that collect, manage and analyze this digital representation of the world's economy, while our AI experts, data scientists and researchers conduct applied research that fuel LinkedIn’s data-driven products and provide insights tha
yass 2013/02/10
"Kamikaze is a utility package wrapping set implementations on sorted integer arrays. Search indexes, graph algorithms and certain sparse matrix representations tend to make heavy use of sorted integer arrays."

linkedin

integer

compression

PForDelta
リンク
GitHub - LinkedInAttic/kamikaze: DocId set compression and set operation library
yass 2013/02/10
compression

PForDelta

integer

simple16
リンク
γ符号、δ符号、ゴロム符号による圧縮効果 - naoyaのはてなダイアリー
通常の整数は 32 ビットは 4 バイトの固定長によるバイナリ符号ですが、小さな数字がたくさん出現し、大きな数字はほとんど出現しないという確率分布のもとでは無駄なビットが目立ちます。 Variable Byte Code (Byte Aligned 符号とも呼ばれます) は整数の符号化手法の一つで、この無駄を幾分解消します。詳しくは Introduction to Information Retrieval (以下 IIR) の第5章に掲載されています。(http://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html で公開されています) Variable Byte Code はその名の通りバイトレベルの可変長符号で、1バイトの先頭1ビットを continuation ビットとして扱い、続く 7 ビット
yass 2012/11/12
" ゴロム符号はパラメータフリーなγ符号やδ符号とは異なり、符号長を調整するためのパラメータが必要になります。このパラメータは転置インデックスの場合、全体の文書数や単語の数から求めることができます "

integer

encoding

gamma coding

delta coding

compression

vByte

algorithm

Golomb coding
リンク
GitHub - stuhood/gvi: group varint encoding
yass 2012/11/11
" A Java implementation of group varint encoding."

integer

encoding

group varint encoding

java

varint

compression
リンク
[IR] Google WSDM'09講演で述べられている符号化方式を実装してみた - tsubosakaの日記
MG勉強会の後にid:sleepy_yoshiさんに教えてもらったWSDM 2009における講演"Challenges in Building Large-Scale Information Retrieval Systems"で述べられている符号化方式のGroup Varint Encodingを実装してみた。資料講演スライドスライドの日本語による解説記事整数の符号化方式転置インデックスなどで文章番号のリストを前の値との差分で表すなどの方法を用いると出現する、ほとんどの値は小さな値となるためこれを4バイト使って表現するのは記憶容量の無駄である。このためVarint Encoding、ガンマ符号、デルタ符号、Rice Coding、Simple 9、pForDeltaなど様々な符号化方式が提案されている。このうちVarint Encodingは実装が手軽なことからよく用いられて
yass 2012/11/11
compression

google

integer

encoding

group varint encoding

varint
リンク
第11回　転置索引の圧縮 | gihyo.jp
はじめに第2回で、索引は多くの場合圧縮されていることに言及しました。また第7回では、索引構築時にどの部分で索引を圧縮すればよいかを疑似コードを用いて説明しました。今回は、転置索引の具体的な圧縮方法について説明していきます。圧縮の目的中規模から大規模な索引の場合、転置リストは非常に長くなり、検索時にはディスクからの大量のデータの読み取りが行われます。転置索引（を用いた検索エンジン）では、これによる検索処理時間の増加を防ぐために、転置リストを圧縮しディスクからの読み込み時間の短縮を図ります。この場合、圧縮された転置リストをディスクから読み込みさらに復元処理を行う必要がありますが、通常は次のようになります。これは、近年のCPUとディスクの速度差が大きいため、主にCPUにおける処理である復元処理が高速に行えることによるものです。よって、圧縮というと容量を節約の意図で使うことが多いと思いま
yass 2012/11/11
PForDelta

compression

encoding

gamma coding

simple9

vByte

inverted index

integer
リンク
Google Code Archive - Long-term storage for Google Code Project Hosting.
Code Archive Skip to content Google About Google Privacy Terms
yass 2012/11/11
compression

integer

encoding

PForDelta

simple9

vByte
リンク
Simple-9について解説 - tsubosakaの日記
前回に引き続き転置インデックスの圧縮を実装してみる。今回紹介するのは[2]で提案されているSimple-9というアルゴリズムである。 Simple-9は32bitのwordにできるだけ数字を詰めていくという圧縮アルゴリズムである。例えば2bitの数が16個ならんでいれば32bitで表現できる。しかし、実際は大きい数字も出現するため数字の長さの情報も格納する必要がある。Simple-9では4bitを用いて残りの28bitがどう詰められているかを表す。 28bitの表し方としては上位bit 符号の個数符号のビット長 0000 28 1 0001 14 2 0010 9 3 0011 7 4 0100 5 5 0101 4 7 0110 3 9 0111 2 14 1000 1 28 の9通りがあり、これがSimple-9の名前の由来となっている。例えば ( 3 , 5 , 0 , 0 ,
yass 2012/11/11
" 32bitのwordにできるだけ数字を詰めていく / 例えば2bitの数が16個ならんでいれば32bitで表現できる / 数字の長さの情報も格納する必要がある。Simple-9では4bitを用いて残りの28bitがどう詰められているかを表す "

integer

encoding

simple9

compression
リンク
サービス終了のお知らせ
サービス終了のお知らせいつもYahoo! JAPANのサービスをご利用いただき誠にありがとうございます。お客様がアクセスされたサービスは本日までにサービスを終了いたしました。今後ともYahoo! JAPANのサービスをご愛顧くださいますよう、よろしくお願いいたします。
yass 2012/11/10
encoding

algorithm

compression

integer

RLE
リンク
整数列圧縮アルゴリズムの最前線 - ny23の日記
ちょうど二年ぐらい前，機械学習で疎ベクトルの圧縮に情報検索でよく使われる整数列の圧縮技術を使うことを検討したことがあった（オンライン学習でキャッシュを実装してみた - ny23の日記）．そのときは，オンラインで圧縮し Disk に保存，圧縮したベクトルは陽にメモリに置かず読む（OS に任せる）という実装で，（Disk IO のオーバーヘッドが大きく）圧縮さえすれば何を使っても大差なしという身も蓋もない結論になった（結局2行で書ける最も単純な Variable byte code を採用）．それ以降は整数列圧縮アルゴリズムに関する知識も NewPFD ぐらいで止まっていたのだけど，つい先日，現時点で最速の圧縮アルゴリズムの提案＋ここ数年の主な整数列圧縮アルゴリズム（Simple-8b (J. Software Pract. Exper. 2010), VSEncoding (CIKM 20
yass 2012/09/17
compression

integer

array

varint
リンク
1