stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

vfilimonov · 2013-07-01T17:53:33Z

There seems to be a strange random bug in saving dataframes with integer columns to HDF5 files. The error sounds like:

ValueError: invalid combinate of [values_axes] on appending data [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer] vs current table [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer]

and the full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-91a60d72b1ca> in <module>()
     25     print raw.dtypes
     26     store = pd.HDFStore('possible_bug9.h5')
---> 27     store.append('raw', raw)
     28     store.close()

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    608             raise Exception("columns is not a supported keyword in append, try data_columns")
    609 
--> 610         self._write_to_group(key, value, table=True, append=True, **kwargs)
    611 
    612     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    869             raise ValueError('Compression not supported on non-table')
    870 
--> 871         s.write(obj = value, append=append, complib=complib, **kwargs)
    872         if s.is_table and index:
    873             s.create_index(columns = index)

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2738         # create the axes
   2739         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2740                          min_itemsize=min_itemsize, **kwargs)
   2741 
   2742         if not self.is_exists:

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2482         # validate the axes if we have an existing table
   2483         if validate:
-> 2484             self.validate(existing_table)
   2485 
   2486     def process_axes(self, obj, columns=None):

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in validate(self, other)
   2096                     oax = ov[i]
   2097                     if sax != oax:
-> 2098                         raise ValueError("invalid combinate of [%s] on appending data [%s] vs current table [%s]" % (c,sax,oax))
   2099 
   2100                 # should never get here

ValueError: invalid combinate of [values_axes] on appending data [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer] vs current table [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer]

Interestingly, the error is not deterministic and depends at least on (i) set of other (non-integer) columns in dataframe and (ii) size of dataframe.

Unfortunately I was unable to narrow it down to "just-code" report, so I have to attach a piece of data and an ipython notebook. This is a minimal code in which I was able to reproduce the bug.

https://www.dropbox.com/s/myo03sbqbulzvaj/pandas_possible_bug.zip

the pandas version is 0.11.0, pytables version: 3.0.0

The text was updated successfully, but these errors were encountered:

jreback · 2013-07-01T18:24:52Z

See your modified code below

you can move the store outside of the loop
add Current Ric and GMT Offset as data columns, this creates them separately from other blocks

Note, that if you read and create this csv all at once then this works fine (with your existing code)

This is a buglet, but as you note, tricky to reproduce. The error message is basically saying: hey you are trying to write columns differently from what exists on disk, namely the Current Ric and GMT Offset are reversed.

This is because they are constructed internally in a different order (namely the internal representation which has differernt dtypes in different blocks), gives a different order to the blocks. I am not exactly sure why that would be the case. Will look further.

import pandas as pd

iter_csv = pd.read_csv('t4.csv.gz', compression='gzip',
                       parse_dates={'timestamp': [2,3]},
                       dayfirst=True, iterator=True, chunksize=10)


store = pd.HDFStore('bug.h5',mode='w')
for chunk in iter_csv:
        raw = chunk
        del raw['#RIC']

        ### These del's are not important. But the types in error message depend on them
        del raw['Exch Time']
        del raw['Quote Time']
        del raw['Type']
        del raw['Qualifiers']

        mapping = {'None': 0, 'SSTN0': 2, 'SSTc1': 1}
        raw = raw.replace({'Current RIC': mapping})

        ### If we replace int32 here with float (or int8) everything will work
        raw['Current RIC'] = raw['Current RIC'].fillna(-1).astype('int16')

        ### Now: if you uncomment the next line and comment following two - again it will work
        #raw = raw[['Current RIC','timestamp']]
        for fld in ['Price','Volume', 'Bid Size','Ask Size','Bid Price','Ask Price']:
                raw[fld] = raw[fld].astype('float')

        raw.set_index('timestamp', inplace=True)

        print raw.dtypes
        store.append('raw', raw, data_columns=['Current RIC','GMT Offset'])

store.close()

vfilimonov · 2013-07-02T17:20:43Z

Hi, jreback. Thank you very much for your suggestion and prompt bug fix!

Could you please point to some docs where I can read about data_columns argument? In particular, I am interested in the question if it does indexing or creates additional overhead in size of HDF file?

jreback · 2013-07-02T17:23:52Z

http://pandas.pydata.org/pandas-docs/dev/io.html#query-via-data-columns

essentially data_columns allow you to control how the data is store (e.g. these will generate a one-to-one field mapping in PyTables, automatically index, and allow selection via them). downside it is somewhat slower (not a lot, but don't make every field a data_column), only the ones you think that you will need for indexing

vfilimonov · 2013-07-02T17:35:13Z

jreback, thanks again for the explanation!

jreback mentioned this issue Jul 1, 2013

BUG: (GH 4096) block ordering is somewhat non-deterministic in HDFStore; reorder to the existing store #4100

Merged

jreback closed this as completed in #4100 Jul 1, 2013

jreback mentioned this issue Jul 25, 2013

ER: HDFStore raising an invalid TypeError rather than ValueError when appending with a diff block ordering (GH4096) #4355

Merged

cpcloud added a commit that referenced this issue Jul 25, 2013

DOC: add #4096 issue to v0.13.0.txt

ebf9147

jreback mentioned this issue Nov 7, 2013

PyTables: invalid combinate of [values_axes] on appending data #5463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

vfilimonov commented Jul 1, 2013

jreback commented Jul 1, 2013

Uh oh!

vfilimonov commented Jul 2, 2013

Uh oh!

jreback commented Jul 2, 2013

Uh oh!

vfilimonov commented Jul 2, 2013

Uh oh!

Uh oh!

stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

Comments

vfilimonov commented Jul 1, 2013

jreback commented Jul 1, 2013

Uh oh!

vfilimonov commented Jul 2, 2013

Uh oh!

jreback commented Jul 2, 2013

Uh oh!

vfilimonov commented Jul 2, 2013

Uh oh!