python - Applying a function to a MultiIndex pandas.DataFrame column -
i have multiindex pandas dataframe in want apply function 1 of columns , assign result same column.
in [1]: import numpy np import pandas pd cols = ['one', 'two', 'three', 'four', 'five'] df = pd.dataframe(np.array(list('abcdefghijklmno'), dtype='object').reshape(3,5), index = list('abc'), columns=cols) df.to_hdf('/tmp/test.h5', 'df') df = pd.read_hdf('/tmp/test.h5', 'df') df out[1]: 1 2 3 4 5 b c d e b f g h j c k l m n o 3 rows × 5 columns in [2]: df.columns = pd.multiindex.from_arrays([list('uuull'), ['one', 'two', 'three', 'four', 'five']]) df['l']['five'] = df['l']['five'].apply(lambda x: x.lower()) df -c:2: settingwithcopywarning: value trying set on copy of slice dataframe. try using .loc[row_index,col_indexer] = value instead out[2]: u l 1 2 3 4 5 b c d e b f g h j c k l m n o 3 rows × 5 columns in [3]: df.columns = ['one', 'two', 'three', 'four', 'five'] df out[3]: 1 2 3 4 5 b c d e b f g h j c k l m n o 3 rows × 5 columns in [4]: df['five'] = df['five'].apply(lambda x: x.upper()) df out[4]: 1 2 3 4 5 b c d e b f g h j c k l m n o 3 rows × 5 columns
as can see, function not applied column, guess because warning:
-c:2: settingwithcopywarning: value trying set on copy of slice dataframe. try using .loc[row_index,col_indexer] = value instead
what strange error happens sometimes, , haven't been able understand when happens , when not.
i managed apply function slicing dataframe .loc
warning recommended:
in [5]: df.columns = pd.multiindex.from_arrays([list('uuull'), ['one', 'two', 'three', 'four', 'five']]) df.loc[:,('l','five')] = df.loc[:,('l','five')].apply(lambda x: x.lower()) df out[5]: u l 1 2 3 4 5 b c d e b f g h j c k l m n o 3 rows × 5 columns
but understand why behavior happens when doing dict-like slicing (e.g. df['l']['five']
) , not when using .loc
slicing.
note: dataframe comes hdf file not multiindexed perhaps cause of strange behavior?
edit: i'm using pandas v.0.13.1
, numpy v.1.8.0
df['l']['five']
selecting level 0 value 'l' , returning dataframe, column 'five' selected, returning accessed series.
the __getitem__
accessor dataframe (the []
), try right thing, , gives correct column. however, chained indexing, see here
to access multi-index, use tuple notation, ('a','b')
, .loc
unambiguous, e.g. df.loc[:,('a','b')]
. furthermore allows multi-axes indexing @ same time (e.g. rows , columns).
so, why not work when chained indexing , assignement, e.g. df['l']['five'] = value
.
df['l']
rerturns data frame singly-indexed. python operation df_with_l['five']
selects series index 'five' happens. indicated variable. because pandas sees these operations separate events (e.g. separate calls __getitem__
, has treat them linear operations, happen 1 after another.
contrast df.loc[:,('l','five')]
passes nested tuple of (:,('l','five'))
single call __getitem__
. allows pandas deal single entity (and fyi quite bit faster because can directly index frame).
why matter? since chained indexing 2 calls, possible either call may return copy of data because of way sliced. when setting setting copy, , not original frame. impossible pandas figure out because 2 separate python operations not connected.
the settingwithcopy
warning 'heuristic' detect (meaning tends catch cases lightweight check). figuring out real way complicated.
the .loc
operation single python operation, , can select slice (which still may copy), allows pandas assign slice frame after modified setting values think.
the reason warning, this. when slice array view back, means can set no problem. however, single dtyped array can generate copy if sliced in particular way. multi-dtyped dataframe (meaning has float , object data), yield copy. whether view created dependent on memory layout of array.
note: doesn't have source of data.
Comments
Post a Comment