I have several TB of data (in subsets) in flat files that I want to convert to HDF5 using Python Pandas/Pytables/H5py for faster querying and searching. I'm planning to convert each subsection of the data using something like to_hdf
and storing them in an HDFStore.
Although the stored data will never need to be changed, I might need to append data later on to some particular subsection, and then reindex (for queries) the entire piece.
My question is this: Is it more efficient to append data to an existing table (using store.append
) and then reindex the new table, or should I simply create an new table with the data that I need to append?
If I do the latter, I might creates a LOT (over 100k) nodes in the HDSFStore. Would that degrade node access time?
I tried to look at other answers and also created my own store with a bunch of nodes to see if there was an effect, but I couldn't find anything significant. Any help is appreciated!
1 Answers
Answers 1
I'm not aware of any issues with having a lot of nodes in your HDF5 file. There is no limit on the number of groups in a file (https://support.hdfgroup.org/HDF5/faq/limits.html).
You can also resize data sets but speed and space performance will depend on the allocation method (contiguous vs chunking). Read about it on the user guide: https://support.hdfgroup.org/HDF5/doc/UG/HDF5_Users_Guide-Responsive%20HTML5/HDF5_Users_Guide/Datasets/HDF5_Datasets.htm?rhtocid=5.3#TOC_5_5_Allocation_of_Spacebc-15
The h5py implementation allows chunking as well as the default contiguous.
0 comments:
Post a Comment