The statistical software Stata allows short text snippets to be saved within a dataset. This is accomplished either using notes
and/or characteristics
.
This is a feature of great value to me as it allows me to save a variety of information, ranging from reminders and to-do lists to information about how I generated the data, or even what the estimation method for a particular variable was.
I am now trying to come up with a similar functionality in Python 3.6. So far, I have looked online and consulted a number of posts, which however do not exactly address what I want to do.
A few reference posts include:
What is the difference between save a pandas dataframe to pickle and to csv?
What is the fastest way to upload a big csv file in notebook to work with python pandas?
For a small NumPy
array, I have concluded that a combination of the function numpy.savez()
and a dictionary
can store adequately all relevant information in a single file.
For example:
a = np.array([[2,4],[6,8],[10,12]]) d = {"first": 1, "second": "two", "third": 3} np.savez(whatever_name.npz, a=a, d=d) data = np.load(whatever_name.npz) arr = data['a'] dic = data['d'].tolist()
However, the question remains:
Are there better ways to potentially incorporate other pieces of information in a file containing a NumPy
array or a (large) Pandas
DataFrame
?
I am particularly interested in hearing about the particular pros and cons of any suggestions you may have with examples. The fewer dependencies, the better.
Thank you.
6 Answers
Answers 1
There are many options. I will discuss only HDF5, because I have experience using this format.
Advantages: Portable (can be read outside of Python), native compression, out-of-memory capabilities, metadata support.
Disadvantages: Reliance on single low-level C API, possibility of data corruption as a single file, deleting data does not reduce size automatically.
In my experience, for performance and portability, avoid pyTables
/ HDFStore
to store numeric data. You can instead use the intuitive interface provided by h5py
.
Store an array
import h5py, numpy as np arr = np.random.randint(0, 10, (1000, 1000)) f = h5py.File('file.h5', 'w', libver='latest') # use 'latest' for performance dset = f.create_dataset('array', shape=(1000, 1000), data=arr, chunks=(100, 100) compression='gzip', compression_opts=9)
Compression & chunking
There are many compression choices, e.g. blosc
and lzf
are good choices for compression and decompression performance respectively. Note gzip
is native; other compression filters may not ship by default with your HDF5 installation.
Chunking is another option which, when aligned with how you read data out-of-memory, can significantly improve performance.
Add some attributes
dset.attrs['Description'] = 'Some text snippet' dset.attrs['RowIndexArray'] = np.arange(1000)
Store a dictionary
for k, v in d.items(): f.create_dataset('dictgroup/'+str(k), data=v)
Out-of-memory access
dictionary = f['dictgroup'] res = dictionary['my_key']
There is no substitute for reading the h5py
documentation, which exposes most of the C API, but you should see from the above there is a significant amount of flexibility.
Answers 2
A practical way could be to embed meta-data directly inside the Numpy array. The advantage is that, as you'd like, there's no extra dependency and it's very simple to use in the code. However, this doesn't fully answers your question, because you still need a mechanism to save the data, and I'd recommend using jpp's solution using HDF5.
To include metadata in an ndarray
, there is an example in the documentation. You basically have to subclass an ndarray
and add a field info
or metadata
or whatever.
It would give (code from the link above)
import numpy as np class ArrayWithInfo(np.ndarray): def __new__(cls, input_array, info=None): # Input array is an already formed ndarray instance # We first cast to be our class type obj = np.asarray(input_array).view(cls) # add the new attribute to the created instance obj.info = info # Finally, we must return the newly created object: return obj def __array_finalize__(self, obj): # see InfoArray.__array_finalize__ for comments if obj is None: return self.info = getattr(obj, 'info', None)
To save the data through numpy
, you'd need to overload the write
function or use another solution.
Answers 3
I agree with JPP that hdf5 storage is a good option here. The difference between his solution and mine is mine uses Pandas dataframes instead of numpy arrays. I prefer the dataframe since this allows mixed types, multi-level indexing (even datetime indexing, which is VERY important for my work), and column labeling, which helps me remember how different datasets are organized. Also, Pandas provides a slew of built-in functionalities (much like numpy). Another benefit of using Pandas is it has a hdf creator built in (i.e. pandas.DataFrame.to_hdf), which I find convenient
When storing the dataframe to h5 you have the option of storing a dictionary of metadata as well, which can be your notes to self, or actual metadata that does not need to be stored in the dataframe (I use this for setting flags as well, e.g. {'is_agl': True, 'scale_factor': 100, 'already_corrected': False, etc.}. In this regard, there is no difference between using a numpy array and a dataframe. For the full solution see my original question and solution here.
Answers 4
jpp's answer is pretty comprehensive, just wanted to mention that as of pandas v22 parquet is very convenient and fast option with almost no drawbacks vs csv (accept perhaps the coffee break).
At time of writing you'll need to also
pip install pyarrow
In terms of adding information you have the metadata which is attached to the data
import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import numpy as np df = pd.DataFrame(np.random.normal(size=(1000, 10))) tab = pa.Table.from_pandas(df) tab = tab.replace_schema_metadata({'here' : 'it is'}) pq.write_table(tab, 'where_is_it.parq') pq.read_table('where_is_it.parq')
Pyarrow table 0: double 1: double 2: double 3: double 4: double 5: double 6: double 7: double 8: double 9: double __index_level_0__: int64 metadata -------- {b'here': b'it is'}
To get this back to pandas:
tab.to_pandas()
Answers 5
It's an interesting question, although very open-ended I think.
Text Snippets
For text snippets that have literal notes (as in, not code and not data), I really don't know what your use case is, but I don't see why I would deviate from using the usual with open() as f: ...
Small collections of various data pieces
Sure, your npz
works. Actually what you are doing is very similar to creating a dictionary with everything you want to save and pickling that dictionary.
See here for a discussion of the differences between pickle and npz (but mainly, npz is optimized for numpy arrays).
Personally, I'd say if you are not storing Numpy arrays I would use pickle, and even implement a quick MyNotes
class that is basically a dictionary to save stuff in it, with some additional functionality you may want.
Collection of large objects
For really big np.arrays or dataframes I have used before the HDF5 format. The good thing is that it is already built in into pandas and you can directly df.to_hdf5()
. It does need underneath pytables
-installation should be fairly painless with pip or conda- but using pytables directly can be a much bigger pain.
Again, this idea is very similar: you are creating an HDFStore, which is pretty much a big dictionary in which you can store (almost any) objects. The benefit is that the format utilizes space in a smarter way by leveraging repetition of similar values. When I was using it to store some ~2GB dataframes, it was able to reduce it by almost a full order of magnitude (~250MB).
One last player: feather
Feather
is a project created by Wes McKinney and Hadley Wickham on top of the Apache Arrow framework, to persist data in a binary format that is language agnostic (and therefore you can read from R and Python). However, it is still under development, and last time I checked they didn't encourage to use it for long-term storage (since the specification may change in future versions), rather than just use it for communication between R and Python.
They both just launched Ursalabs, literally just weeks ago, that will continue growing this and similar initiatives.
Answers 6
You stated as the reasons for this question:
... it allows me to save a variety of information, ranging from reminders and to-do lists, to information about how i generated the data, or even what the estimation method for a particular variable was.
May I suggest a different paradigm than that offered by Stata? The notes and characteristics seem to be very limited and confined to just text. Instead, you should use Jupyter Notebook for your research and data analysis projects. It provides such a rich environment to document your workflow and capture details, thoughts and ideas as you are doing your analysis and research. It can easily be shared, and it's presentation-ready.
Here is a gallery of interesting Jupyter Notebooks across many industries and disciplines to showcase the many features and use cases of notebooks. It may expand your horizons beyond trying to devise a way to tag simple snippets of text to your data.
0 comments:
Post a Comment