Wednesday, August 16, 2017

HDFStore: Efficiency between appending data to an existing table and reindexing vs. creating a new table

Leave a Comment

I have several TB of data (in subsets) in flat files that I want to convert to HDF5 using Python Pandas/Pytables/H5py for faster querying and searching. I'm planning to convert each subsection of the data using something like to_hdf and storing them in an HDFStore.

Although the stored data will never need to be changed, I might need to append data later on to some particular subsection, and then reindex (for queries) the entire piece.

My question is this: Is it more efficient to append data to an existing table (using store.append) and then reindex the new table, or should I simply create an new table with the data that I need to append?

If I do the latter, I might creates a LOT (over 100k) nodes in the HDSFStore. Would that degrade node access time?

I tried to look at other answers and also created my own store with a bunch of nodes to see if there was an effect, but I couldn't find anything significant. Any help is appreciated!

1 Answers

Answers 1

I'm not aware of any issues with having a lot of nodes in your HDF5 file. There is no limit on the number of groups in a file (https://support.hdfgroup.org/HDF5/faq/limits.html).

You can also resize data sets but speed and space performance will depend on the allocation method (contiguous vs chunking). Read about it on the user guide: https://support.hdfgroup.org/HDF5/doc/UG/HDF5_Users_Guide-Responsive%20HTML5/HDF5_Users_Guide/Datasets/HDF5_Datasets.htm?rhtocid=5.3#TOC_5_5_Allocation_of_Spacebc-15

The h5py implementation allows chunking as well as the default contiguous.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment