Home Artificial Intelligence Deep Dive into Pandas Copy-on-Write Mode — Part III

Deep Dive into Pandas Copy-on-Write Mode — Part III

0
Deep Dive into Pandas Copy-on-Write Mode — Part III

Explaining the migration path for Copy-on-Write

Towards Data Science
Photo by Zoe Nicolaou on Unsplash

Introduction

The introduction of Copy-on-Write (CoW) is a breaking change that can have some impact on existing pandas code. We are going to investigate how we will adapt our code to avoid errors when CoW will probably be enabled by default. That is currently planned for the pandas 3.0 release, which is scheduled for April 2024. The primary post on this series explained the behavior of Copy-on-Write while the second post dove into performance optimizations which are related to Copy-on-Write.

We’re planning on adding a warning mode that may warn for all operations that may change behavior with CoW. The warning will probably be very noisy for users and thus must be treated with some care. This post explains common cases and the way you’ll be able to adapt your code to avoid changes in behavior.

Chained task

Chained task is a way where one object is updated through 2 subsequent operations.

import pandas as pd

df = pd.DataFrame({"x": [1, 2, 3]})

df["x"][df["x"] > 1] = 100

The primary operation selects the column "x" while the second operation restricts the variety of rows. There are numerous different mixtures of those operations (e.g. combined with loc or iloc). None of those mixtures will work under CoW. As a substitute, they are going to raise a warning ChainedAssignmentError to remove these patterns as a substitute of silently doing nothing.

Generally, you need to use loc as a substitute:

df.loc[df["x"] > 1, "x"] = 100

The primary dimension of loc all the time corresponds to the row-indexer. Because of this you might be capable of select a subset of rows. The second dimension corresponds to the column-indexer, which allows you to select a subset of columns.

It is mostly faster using loc when you ought to set values right into a subset of rows, so this may clean up your code and supply a performance improvement.

That is the plain case where CoW can have an impact. It should also impact chained inplace operations:

df["x"].replace(1, 100)

The pattern is similar as above. The column selection is the primary operation. The replace method tries to operate on the temporary object, which can fail to update the initial object. You can even remove these patterns pretty easily through specifying the columns you ought to operate on.

df = df.replace({"x": 1}, {"x": 100})

Patterns to avoid

My previous post explains how the CoW mechanism works and the way DataFrames share the underlying data. A defensiv copy will probably be performed if two objects share the identical data while you’re modifying one object inplace.

df2 = df.reset_index()
df2.iloc[0, 0] = 100

The reset_index operation will create a view of the underlying data. The result’s assigned to a brand new variable df2, which means that two objects share the identical data. This holds true until df is garbage collected. The setitem operation will thus trigger a replica. This is totally unnecessary in the event you don’t need the initial object df anymore. Simply reassigning to the identical variable will invalidate the reference that’s held by the item.

df = df.reset_index()
df.iloc[0, 0] = 100

Summarizing, creating multiple references in the identical method keeps unnecessary references alive.

Temporary references which are created when chaining different methods together are effective.

df = df.reset_index().drop(...)

It will only keep one reference alive.

Accessing the underlying NumPy array

pandas currently gives us access to the underlying NumPy array through to_numpy or .values. The returned array is a replica, in case your DataFrame consists of various dtypes, e.g.:

df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
df.to_numpy()

[[1. 1.5]
[2. 2.5]]

The DataFrame is backed by two arrays which need to be combined into one. This triggers the copy.

The opposite case is a DataFrame that is barely backed by a single NumPy array, e.g.:

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df.to_numpy()

[[1 3]
[2 4]]

We will directly access the array and get a view as a substitute of a replica. This is far faster than copying all data. We will now operate on the NumPy array and potentially modify it inplace, which can even update the DataFrame and potentially all other DataFrames that share data. This becomes far more complicated with Copy-on-Write, since we removed many defensive copies. Many more DataFrames will now share memory with one another.

to_numpy and .values will return a read-only array for this reason. Because of this the resulting array is just not writeable.

df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
arr = df.to_numpy()

arr[0, 0] = 1

It will trigger a ValueError:

ValueError: task destination is read-only

You’ll be able to avoid this in two other ways:

  • Trigger a replica manually if you ought to avoid updating DataFrames that share memory together with your array.
  • Make the array writeable. It is a more performant solution but circumvents Copy-on-Write rules, so it needs to be used with caution.
arr.flags.writeable = True

There are cases where this is just not possible. One common occurrence is, in the event you are accessing a single column which was backed by PyArrow:

ser = pd.Series([1, 2], dtype="int64[pyarrow]")
arr = ser.to_numpy()
arr.flags.writeable = True

This returns a ValueError:

ValueError: cannot set WRITEABLE flag to True of this array

Arrow arrays are immutable, hence it is just not possible to make the NumPy array writeable. The conversion from Arrow to NumPy is zero-copy on this case.

Conclusion

We’ve checked out probably the most invasive Copy-on-Write related changes. These changes will change into the default behavior in pandas 3.0. We’ve also investigated how we will adapt our code to avoid breaking our code when Copy-on-Write is enabled. The upgrade process needs to be pretty smooth in the event you can avoid these patterns.

LEAVE A REPLY

Please enter your comment!
Please enter your name here