
Explaining the migration path for Copy-on-Write

Introduction
The introduction of Copy-on-Write (CoW) is a breaking change that can have some impact on existing pandas code. We are going to investigate how we will adapt our code to avoid errors when CoW will probably be enabled by default. That is currently planned for the pandas 3.0 release, which is scheduled for April 2024. The primary post on this series explained the behavior of Copy-on-Write while the second post dove into performance optimizations which are related to Copy-on-Write.
We’re planning on adding a warning mode that may warn for all operations that may change behavior with CoW. The warning will probably be very noisy for users and thus must be treated with some care. This post explains common cases and the way you’ll be able to adapt your code to avoid changes in behavior.
Chained task
Chained task is a way where one object is updated through 2 subsequent operations.
import pandas as pddf = pd.DataFrame({"x": [1, 2, 3]})
df["x"][df["x"] > 1] = 100
The primary operation selects the column "x"
while the second operation restricts the variety of rows. There are numerous different mixtures of those operations (e.g. combined with loc
or iloc
). None of those mixtures will work under CoW. As a substitute, they are going to raise a warning ChainedAssignmentError
to remove these patterns as a substitute of silently doing nothing.
Generally, you need to use loc
as a substitute:
df.loc[df["x"] > 1, "x"] = 100
The primary dimension of loc
all the time corresponds to the row-indexer
. Because of this you might be capable of select a subset of rows. The second dimension corresponds to the column-indexer
, which allows you to select a subset of columns.
It is mostly faster using loc
when you ought to set values right into a subset of rows, so this may clean up your code and supply a performance improvement.
That is the plain case where CoW can have an impact. It should also impact chained inplace operations:
df["x"].replace(1, 100)
The pattern is similar as above. The column selection is the primary operation. The replace
method tries to operate on the temporary object, which can fail to update the initial object. You can even remove these patterns pretty easily through specifying the columns you ought to operate on.
df = df.replace({"x": 1}, {"x": 100})
Patterns to avoid
My previous post explains how the CoW mechanism works and the way DataFrames share the underlying data. A defensiv copy will probably be performed if two objects share the identical data while you’re modifying one object inplace.
df2 = df.reset_index()
df2.iloc[0, 0] = 100
The reset_index
operation will create a view of the underlying data. The result’s assigned to a brand new variable df2
, which means that two objects share the identical data. This holds true until df
is garbage collected. The setitem
operation will thus trigger a replica. This is totally unnecessary in the event you don’t need the initial object df
anymore. Simply reassigning to the identical variable will invalidate the reference that’s held by the item.
df = df.reset_index()
df.iloc[0, 0] = 100
Summarizing, creating multiple references in the identical method keeps unnecessary references alive.
Temporary references which are created when chaining different methods together are effective.
df = df.reset_index().drop(...)
It will only keep one reference alive.
Accessing the underlying NumPy array
pandas currently gives us access to the underlying NumPy array through to_numpy
or .values
. The returned array is a replica, in case your DataFrame consists of various dtypes, e.g.:
df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
df.to_numpy()[[1. 1.5]
[2. 2.5]]
The DataFrame is backed by two arrays which need to be combined into one. This triggers the copy.
The opposite case is a DataFrame that is barely backed by a single NumPy array, e.g.:
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df.to_numpy()[[1 3]
[2 4]]
We will directly access the array and get a view as a substitute of a replica. This is far faster than copying all data. We will now operate on the NumPy array and potentially modify it inplace, which can even update the DataFrame and potentially all other DataFrames that share data. This becomes far more complicated with Copy-on-Write, since we removed many defensive copies. Many more DataFrames will now share memory with one another.
to_numpy
and .values
will return a read-only array for this reason. Because of this the resulting array is just not writeable.
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
arr = df.to_numpy()arr[0, 0] = 1
It will trigger a ValueError
:
ValueError: task destination is read-only
You’ll be able to avoid this in two other ways:
- Trigger a replica manually if you ought to avoid updating DataFrames that share memory together with your array.
- Make the array writeable. It is a more performant solution but circumvents Copy-on-Write rules, so it needs to be used with caution.
arr.flags.writeable = True
There are cases where this is just not possible. One common occurrence is, in the event you are accessing a single column which was backed by PyArrow:
ser = pd.Series([1, 2], dtype="int64[pyarrow]")
arr = ser.to_numpy()
arr.flags.writeable = True
This returns a ValueError
:
ValueError: cannot set WRITEABLE flag to True of this array
Arrow arrays are immutable, hence it is just not possible to make the NumPy array writeable. The conversion from Arrow to NumPy is zero-copy on this case.
Conclusion
We’ve checked out probably the most invasive Copy-on-Write related changes. These changes will change into the default behavior in pandas 3.0. We’ve also investigated how we will adapt our code to avoid breaking our code when Copy-on-Write is enabled. The upgrade process needs to be pretty smooth in the event you can avoid these patterns.