Skip to content

stochastic bug in saving dataframes with "int16" or "int32" columns to HDF5 file #4096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfilimonov opened this issue Jul 1, 2013 · 4 comments · Fixed by #4100
Closed
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@vfilimonov
Copy link
Contributor

There seems to be a strange random bug in saving dataframes with integer columns to HDF5 files. The error sounds like:

ValueError: invalid combinate of [values_axes] on appending data [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer] vs current table [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer]

and the full traceback:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-91a60d72b1ca> in <module>()
     25     print raw.dtypes
     26     store = pd.HDFStore('possible_bug9.h5')
---> 27     store.append('raw', raw)
     28     store.close()

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in append(self, key, value, columns, **kwargs)
    608             raise Exception("columns is not a supported keyword in append, try data_columns")
    609 
--> 610         self._write_to_group(key, value, table=True, append=True, **kwargs)
    611 
    612     def append_to_multiple(self, d, value, selector, data_columns=None, axes=None, **kwargs):

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in _write_to_group(self, key, value, index, table, append, complib, **kwargs)
    869             raise ValueError('Compression not supported on non-table')
    870 
--> 871         s.write(obj = value, append=append, complib=complib, **kwargs)
    872         if s.is_table and index:
    873             s.create_index(columns = index)

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, **kwargs)
   2738         # create the axes
   2739         self.create_axes(axes=axes, obj=obj, validate=append,
-> 2740                          min_itemsize=min_itemsize, **kwargs)
   2741 
   2742         if not self.is_exists:

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
   2482         # validate the axes if we have an existing table
   2483         if validate:
-> 2484             self.validate(existing_table)
   2485 
   2486     def process_axes(self, obj, columns=None):

/Users/v/.virtual_envs/system/lib/python2.7/site-packages/pandas/io/pytables.pyc in validate(self, other)
   2096                     oax = ov[i]
   2097                     if sax != oax:
-> 2098                         raise ValueError("invalid combinate of [%s] on appending data [%s] vs current table [%s]" % (c,sax,oax))
   2099 
   2100                 # should never get here

ValueError: invalid combinate of [values_axes] on appending data [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer] vs current table [name->values_block_1,cname->values_block_1,axis->None,pos->2,kind->integer]

Interestingly, the error is not deterministic and depends at least on (i) set of other (non-integer) columns in dataframe and (ii) size of dataframe.

Unfortunately I was unable to narrow it down to "just-code" report, so I have to attach a piece of data and an ipython notebook. This is a minimal code in which I was able to reproduce the bug.

https://www.dropbox.com/s/myo03sbqbulzvaj/pandas_possible_bug.zip

the pandas version is 0.11.0, pytables version: 3.0.0

@jreback
Copy link
Contributor

jreback commented Jul 1, 2013

See your modified code below

  • you can move the store outside of the loop
  • add Current Ric and GMT Offset as data columns, this creates them separately from other blocks

Note, that if you read and create this csv all at once then this works fine (with your existing code)

This is a buglet, but as you note, tricky to reproduce. The error message is basically saying: hey you are trying to write columns differently from what exists on disk, namely the Current Ric and GMT Offset are reversed.

This is because they are constructed internally in a different order (namely the internal representation which has differernt dtypes in different blocks), gives a different order to the blocks. I am not exactly sure why that would be the case. Will look further.

import pandas as pd

iter_csv = pd.read_csv('t4.csv.gz', compression='gzip',
                       parse_dates={'timestamp': [2,3]},
                       dayfirst=True, iterator=True, chunksize=10)


store = pd.HDFStore('bug.h5',mode='w')
for chunk in iter_csv:
        raw = chunk
        del raw['#RIC']

        ### These del's are not important. But the types in error message depend on them
        del raw['Exch Time']
        del raw['Quote Time']
        del raw['Type']
        del raw['Qualifiers']

        mapping = {'None': 0, 'SSTN0': 2, 'SSTc1': 1}
        raw = raw.replace({'Current RIC': mapping})

        ### If we replace int32 here with float (or int8) everything will work
        raw['Current RIC'] = raw['Current RIC'].fillna(-1).astype('int16')

        ### Now: if you uncomment the next line and comment following two - again it will work
        #raw = raw[['Current RIC','timestamp']]
        for fld in ['Price','Volume', 'Bid Size','Ask Size','Bid Price','Ask Price']:
                raw[fld] = raw[fld].astype('float')

        raw.set_index('timestamp', inplace=True)

        print raw.dtypes
        store.append('raw', raw, data_columns=['Current RIC','GMT Offset'])

store.close()

@vfilimonov
Copy link
Contributor Author

Hi, jreback. Thank you very much for your suggestion and prompt bug fix!

Could you please point to some docs where I can read about data_columns argument? In particular, I am interested in the question if it does indexing or creates additional overhead in size of HDF file?

@jreback
Copy link
Contributor

jreback commented Jul 2, 2013

http://pandas.pydata.org/pandas-docs/dev/io.html#query-via-data-columns

essentially data_columns allow you to control how the data is store (e.g. these will generate a one-to-one field mapping in PyTables, automatically index, and allow selection via them). downside it is somewhat slower (not a lot, but don't make every field a data_column), only the ones you think that you will need for indexing

@vfilimonov
Copy link
Contributor Author

jreback, thanks again for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
2 participants