I have a dataframe that I am looking at the data types associated with each column.
When I run:
In [23]: df.dtype.descr Out [24]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|O')]
I want to set the currency dtype to S7. I am doing:
In [25]: dtype_new[-1] = (u'currency', "|S7") In [26]: print dtype_new Out [27]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|S7')]
It looks to be the correct format. So I try to put it back to my df:
In [28]: df = df.astype(np.dtype(dtype_new))
And I get the error:
TypeError('data type not understood',)
What should I be changing? Thank you. This was working before I recently updated anaconda and I am not aware of the issue. Thanks.
ADJUSTMENT:
df.dtype is
In [23]: records.dtype Out[23]: dtype((numpy.record, [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', 'O')]))
How can I change the '0' to a string less than 7 characters?
How can I change the last dtype from 'O' to something else? Specifically a string less than 7 characters.
LASTLY - is this a unicode issue? With Unicode:
In [38]: np.dtype([(u'date', '<i8')]) ...: --------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-38-8702f0c7681f> in <module>() ----> 1 np.dtype([(u'date', '<i8')]) TypeError: data type not understood
No Unicode:
In [39]: np.dtype([('date', '<i8')]) Out[39]: dtype([('date', '<i8')])
1 Answers
Answers 1
It seems you have centered the point about unicode and, actually, you seem to have touched on a sore point.
Let's start from the last numpy documentation.
The documentation dtypes states that:
[(field_name, field_dtype, field_shape), ...]
obj should be a list of fields where each field is described by a tuple of length 2 or 3. (Equivalent to the
descr
item in the__array_interface__
attribute.)The first element,
field_name
, is the field name (if this is''
then a standard field name, 'f#', is assigned). The field name may also be a 2-tuple of strings where the first string is either a “title” (which may be any string or unicode string) or meta-data for the field which can be any object, and the second string is the “name” which must be a valid Python identifier. The second element,field_dtype
, can be anything that can be interpreted as a data-type. The optional third elementfield_shape
contains the shape if this field represents an array of the data-type in the second element. Note that a 3-tuple with a third argument equal to 1 is equivalent to a 2-tuple. This style does not accept align in the dtype constructor as it is assumed that all of the memory is accounted for by the array interface description.
So the doc doesn't seem to really specify whether the field name can be unicode, what we can be sure from the doc is that if we define a tuple as the field name, e.g. ((u'date', 'date'), '<i8')
, then using unicode as the "title" (notice, still not for the name!), leads to no errors.
Otherwise, also in this case, if you define ((u'date', u'date'), '<i8')
you will get an error.
Now, you can use unicode names in Py2 by using the encode("ascii")
(u'date'.encode("ascii"))
and this should work.
One big point is that for Py2, Numpy does not allow to specify dtype
with unicode field names as list of tuples, but allows it using dictionaries.
If I don't use unicode names in Py2, I can change the last field from |0
to |S7
or you have to use the encode("ascii")
if you define the name as unicode string.
And the bugs involved...
To understand why it happens what you see, it is useful to have a look at the bugs/issues reported in Numpy and Pandas and the relative discussions.
Numpy
https://github.com/numpy/numpy/issues/2407
You can notice in the discussion (which I do not report here) mainly a couple of things:
- the "issue" has been going on for a while
- one trick people used was to use
encode("ascii")
on the unicode string - remember that the
'whatever'
string has different defaults (bytes/unicode) in Py2/3 - @hpaulj himself commented beautifully in that issue report that "If the dtype specification is of the list of tuples type, it checks whether each name is a string (as defined by py2 or 3) But if the dtype specification is a dictionary
{'names':[ alist], 'formats':[alist]...}
, the py2 case also allows unicode names"
Pandas
Also on the pandas side an issue has been reported which relates to the numpy issue: https://github.com/pandas-dev/pandas/pull/13462
It seems to have been fixed not that long ago.
0 comments:
Post a Comment