Thursday, April 27, 2017

why is blindly using df.copy() a bad idea to fix the SettingWithCopyWarning

Leave a Comment

There are countless questions about the dreaded SettingWithCopyWarning

I've got a good handle on how it comes about. (Notice I said good, not great)

It happens when a dataframe df is "attached" to another dataframe via an attribute stored in is_copy.

Here's an example

df = pd.DataFrame([[1]])  d1 = df[:]  d1.is_copy  <weakref at 0x1115a4188; to 'DataFrame' at 0x1119bb0f0> 

We can either set that attribute to None or

d1 = d1.copy() 

I've seen devs like @Jeff and I can't remember who else, warn about doing that. Citing that the SettingWithCopyWarning has a purpose.

Question
Ok, so what is a concrete example that demonstrates why ignoring the warning by assigning a copy back to the original is a bad idea.

I'll define "bad idea" for clarification.

Bad Idea
It is a bad idea to place code into production that will lead to getting a phone call in the middle of a Saturday night saying your code is broken and needs to be fixed.

Now how can using df = df.copy() in order to bypass the SettingWithCopyWarning lead to getting that kind of phone call. I want it spelled out because this is a source of confusion and I'm attempting to find clarity. I want to see the edge case that blows up!

4 Answers

Answers 1

here is my 2 cent on this with a very simple example why the warning is important.

so assuming that I am creating a df such has

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b']) print(x)    a  b 0  0  0 1  1  1 2  2  2 3  3  3 

now I want to create a new dataframe based on a subset of the original and modify it such has:

 q = x.loc[:, 'a'] 

now this is a slice of the original and whatever I do on it will affect x:

q += 2 print(x)  # checking x again, wow! it changed!    a  b 0  2  0 1  3  1 2  4  2 3  5  3 

this is what the warning is telling you. you are working on a slice, so everything you do on it will be reflected on the original DataFrame

now using .copy(), it won't be a slice of the original, so doing an operation on q wont affect x :

x = pd.DataFrame(list(zip(range(4), range(4))), columns=['a', 'b']) print(x)    a  b 0  0  0 1  1  1 2  2  2 3  3  3  q = x.loc[:, 'a'].copy() q += 2 print(x)  # oh, x did not change because q is a copy now    a  b 0  0  0 1  1  1 2  2  2 3  3  3 

and btw, a copy just mean that q will be a new object in memory. where a slice share the same original object in memory

imo, using .copy()is very safe. as an example df.loc[:, 'a'] return a slice but df.loc[df.index, 'a'] return a copy. Jeff told me that this was an unexpected behavior and : or df.index should have the same behavior as an indexer in .loc[], but using .copy() on both will return a copy, better be safe. so use .copy() if you don't want to affect the original dataframe.

now using .copy() return a deepcopy of the DataFrame, which is a very safe approach not to get the phone call you are talking about.

but using df.is_copy = None, is just a trick that does not copy anything which is a very bad idea, you will still be working on a slice of the original DataFrame

one more thing that people tend not to know:

df[columns] may return a view.

df.loc[indexer, columns] also may return a view, but almost always does not in practice. emphasis on the may here

Answers 2

EDIT:

After our comment exchange and from reading around a bit (I even found @Jeff's answer), I may bring owls to Athens, but in panda-docs exists this code example:

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:

def do_something(df):           foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!        # ... many lines here ...           foo['quux'] = value  # We don't know whether this will modify df or not!          return foo 

That maybe an easily avoided problem, for an experienced user/developer but pandas is not only for the experienced...

Still you probably will not get a phone call in the middle of the night on a Sunday about this but it may damage your data integrity in the long if you don't catch it early.
Also as the Murphy's law states, the most time consuming and complex data manipulation that you will do it WILL be on a copy which will get discarded before it is used and you will spend hours try to debug it!

Note: All that are hypothetical because the very definition in the docs is a hypothesis based on probability of (unfortunate) events... SettingWithCopy is a new-user-friendly warning which exists to warn new users of a potentially random and unwanted behavior of their code.


There exists this issue from 2014.
The code that causes the warning in this case looks like this:

from pandas import DataFrame # create example dataframe: df = DataFrame ({'column1':['a', 'a', 'a'], 'column2': [4,8,9] }) df # assign string to 'column1': df['column1'] = df['column1'] + 'b' df # it works just fine - no warnings #now remove one line from dataframe df: df = df [df['column2']!=8] df # adding string to 'column1' gives warning: df['column1'] = df['column1'] + 'c' df 

And jreback make some comments on the matter:

You are in fact setting a copy.

You prob don't care; it is mainly to address situations like:

df['foo'][0] = 123...  

which sets the copy (and thus is not visible to the user)

This operation, make the df now point to a copy of the original

df = df [df['column2']!=8] 

If you don't care about the 'original' frame, then its ok

If you are expecting that the

df['column1'] = df['columns'] + 'c' 

would actually set the original frame (they are both called 'df' here which is confusing) then you would be suprised.

and

(this warning is mainly for new users to avoid setting the copy)

Finally he concludes:

Copies don't normally matter except when you are then trying to set them in a chained manner.

From the above we can draw this conclusions:

  1. SettingWithCopyWarning has a meaning and there are (as presented by jreback) situations in which this warning matters and the complications may be avoided.
  2. The warning is mainly a "safety net" for newer users to make them pay attention to what they are doing and that it may cause unexpected behavior on chained operations. Thus a more advanced user can turn of the warning (from jreback's answer):
pd.set_option('chained_assignement',None) 

or you could do:

df.is_copy = False 

Answers 3

Update:

TL;DR: I think how to treat the SettingWithCopyWarning depends on the purposes. If one wants to avoid modifying df, then working on df.copy() is safe and the warning is redundant. If one wants to modify df, then using .copy() means wrong way and the warning need to be respected.

Disclaimer: I don't have private/personal communications with Pandas' experts like other answerers. So this answer is based on the official Pandas docs, what a typical user would base on, and my own experiences.


SettingWithCopyWarning is not the real problem, it warns about the real problem. User need to understand and solve the real problem, not bypass the warning.

The real problem is that, indexing a dataframe may return a copy, then modifying this copy will not change the original dataframe. The warning asks users to check and avoid that logical bug. For example:

import pandas as pd, numpy as np np.random.seed(7)  # reproducibility df = pd.DataFrame(np.random.randint(1, 10, (3,3)), columns=['a', 'b', 'c']) print(df)    a  b  c 0  5  7  4 1  4  8  8 2  8  9  9 # Setting with chained indexing: not work & warning. df[df.a>4]['b'] = 1 print(df)    a  b  c 0  5  7  4 1  4  8  8 2  8  9  9 # Setting with chained indexing: *may* work in some cases & no warning, but don't rely on it, should always avoid chained indexing. df['b'][df.a>4] = 2 print(df)    a  b  c 0  5  2  4 1  4  8  8 2  8  2  9 # Setting using .loc[]: guarantee to work. df.loc[df.a>4, 'b'] = 3 print(df)    a  b  c 0  5  3  4 1  4  8  8 2  8  3  9 

About wrong way to bypass the warning:

df1 = df[df.a>4]['b'] df1.is_copy = None df1[0] = -1  # no warning because you trick pandas, but will not work for assignment print(df)    a  b  c 0  5  7  4 1  4  8  8 2  8  9  9  df1 = df[df.a>4]['b'] df1 = df1.copy() df1[0] = -1  # no warning because df1 is a separate dataframe now, but will not work for assignment print(df)    a  b  c 0  5  7  4 1  4  8  8 2  8  9  9 

So, setting df1.is_copy to False or None is just a way to bypass the warning, not to solve the real problem when assigning. Setting df1 = df1.copy() also bypass the warning in another even more wrong way, because df1 is not a weakref of df, but a totally independent dataframe. So if the users want to change values in df, they will receive no warning, but a logical bug. The inexperienced users will not understand why df does not change after being assigned new values. That is why it is advisable to avoid these approaches completely.

If the users only want to work on the copy of the data, that is, strictly not modifying the original df, then it's perfectly correct to call .copy() explicitly. But if they want to modify the data in the original df, they need to respect the warning. The point is, users need to understand what they are doing.

In case of warning because of chained indexing assignment, the correct solution is to avoid assigning values to a copy produced by df[cond1][cond2], but to use the view produced by df.loc[cond1, cond2] instead.

More examples of setting with copy warning/error and solutions are shown in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Answers 4

While the other answers provide good information about why one shouldn't simply ignore the warning, I think your original question has not been answered, yet.

@thn points out that using copy() completely depends on the scenario at hand. When you want that the original data is preserved, you use .copy(), otherwise you don't. If you are using copy() to circumvent the SettingWithCopyWarning you are ignoring the fact that you may introduce a logical bug into your software. As long as you are absolutely certain that this is what you want to do, you are fine.

However, when using .copy() blindly you may run into another issue, which is no longer really pandas specific, but occurs every time you are copying data.

I slightly modified your example code to make the problem more apparent:

@profile def foo():     df = pd.DataFrame(np.random.randn(2 * 10 ** 7))      d1 = df[:]     d1 = d1.copy()  if __name__ == '__main__':     foo() 

When using memory_profile one can clearly see that .copy() doubles our memory consumption:

> python -m memory_profiler demo.py  Filename: demo.py  Line #    Mem usage    Increment   Line Contents ================================================      4   61.195 MiB    0.000 MiB   @profile      5                             def foo():      6  213.828 MiB  152.633 MiB    df = pd.DataFrame(np.random.randn(2 * 10 ** 7))      7                                   8  213.863 MiB    0.035 MiB    d1 = df[:]      9  366.457 MiB  152.594 MiB    d1 = d1.copy() 

This relates to the fact, that there is still a reference (df) which points to the original data frame. Thus, df is not cleaned up by the garbage collector and is kept in memory.

When you are using this code in a production system, you may or may get a MemoryError depending on the size of the data you are dealing with and your available memory.

To conclude, it is not a wise idea to use .copy() blindly. Not just because you may introduce a logical bug in your software, but also because it may expose runtime dangers such as a MemoryError.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment