import pandas as pd
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
print(df.columns)
df
After adding a new column, it appears in the Index
returned by df.columns
.
df['col3'] = [5, 6]
print(df.columns)
df
After dropping a column, using axis=1
, it is no longer in the Index
object.
x = df.drop(['col1'], axis=1)
display(x.columns)
x
But how does it work for a hierarchical MultiIndex?
columns = pd.MultiIndex.from_product([['head', 'body'], ['x', 'y']],
names=['bodypart', 'coordinates'])
df = pd.DataFrame([[1, 2, 3, 4], [9, 2, 3, 4]], columns=columns)
df
display(df.columns)
display(df.columns.levels)
display(list(df.columns.levels[0]))
What happens if we add another column? Will it appear in the column index?
df['tail', 'x'] = 5
df['tail', 'y'] = 9
display(df)
display(df.columns)
display(df.columns.levels)
display(list(df.columns.levels[0]))
It seems to be in columns level 0!
And what happens if the drop on of the original columns? Is the columns Index
updated accordingly?
df_drop = df.drop(['body'], axis=1)
display(df_drop)
display(df_drop.columns)
display(df_drop.columns.levels)
display(list(df_drop.columns.levels[0]))
So we can see, dropping a column does not remove it from the column index (which is backed by a FrozenList
)! While some might consider this a bug, the pandas developers think this is a philosophical question and actually works as intended.
However, there is a good workarond:
display(df_drop.columns.get_level_values(0).unique())
There is another way, which means setting a new column index. While it seems this is a reasonable approach for some use cases, there might be unforseen (performance) implication, which are the reasons, that this is not the default behaviour.
df_drop.columns = df_drop.columns.remove_unused_levels()
display(df_drop.columns.levels)
And how can we set values on certain values of a MutliIndexed DataFrame, if there is a non-MultiIndexed column?
df['behavior'] = 'cute'
df
df.behavior[df['head', 'x'] == 1] = 'foobar' # this does not work!
df
As you can see, setting the value using chained indexing does not work. This comes from the fact how the Pandas DSL is translated into Python method calls. The official docs provide a detailed explanation of the reasons. You are actually getting a copy! (which is logged in a warning, that might be visible, depending on how you render the notebook)
Instead, we should make use of the loc
method:
df.loc[df['head', 'x'] == 1, ['behavior']] = 'buzzz'
df
Behind the scenes, the operator overloading of the Python data model come into practice, which are used extensively by the Pandas DSL. What looks similar to a method call, will actually call the overloaded __getitem__
method on an internal _LocIndexer
object. While the internal code branches are a bit more involved, what will happen functionally in our case is using the first argument as a boolean Series to select specific rows and the second argument for specifying which column to access (and thereby override).