Analyze Data/Python Libraries

pandas-2. DataFrame

Naranjito 2021. 5. 25. 16:17

2. DataFrame

Two-dimensional, added columns to Series.

df=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]],index=['one','two','three'],columns=['a','b','c'])
df

>>>	a	b	c
one	1	2	3
two	4	5	6
three	7	8	9

df.values
>>>array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
       
df.index
>>>Index(['one', 'two', 'three'], dtype='object')

df.columns
>>>Index(['a', 'b', 'c'], dtype='object')

 

- Create DataFrame with list

data=[['1000', 'Steve', 90.72], 
    ['1001', 'James', 78.09], 
    ['1002', 'Doyeon', 98.43], 
    ['1003', 'Jane', 64.19], 
    ['1004', 'Pilwoong', 81.30],
    ['1005', 'Tony', 99.14]]
    
df=pd.DataFrame(data, index=['one','two','three','one','two','three'],columns=['a','b','c'])
df
>>>	a	b	c
one	1000	Steve	90.72
two	1001	James	78.09
three	1002	Doyeon	98.43
one	1003	Jane	64.19
two	1004	Pilwoong	81.30
three	1005	Tony	99.14

 

- Create DataFrame with dictionary

data={'a':['1000', '1001', '1002', '1003', '1004', '1005'],
      'b':['Steve', 'James', 'Doyeon', 'Jane', 'Pilwoong', 'Tony'],
      'c':[90.72, 78.09, 98.43, 64.19, 81.30, 99.14]}
df=pd.DataFrame(data,index=['one','two','three','one','two','three'])
df
>>>	a	b	c
one	1000	Steve	90.72
two	1001	James	78.09
three	1002	Doyeon	98.43
one	1003	Jane	64.19
two	1004	Pilwoong	81.30
three	1005	Tony	99.14
pd.DataFrame({
    'class':['a','b','c'],
    'val1':(np.random.rand(3)*100).astype(int),
    'val2':(np.random.rand(3)*100).astype(int),
    'val3':(np.random.rand(3)*100).astype(int)
})
>>>
class	val1	val2	val3
0	a	38	80	1
1	b	79	85	44
2	c	90	1	44

 

  • loc

Location, access a group of rows using with index name.

c2=pd.DataFrame(data,index=['A', 'B', 'C','A', 'B', 'C'])
c2
a	b	c
A	1000	Steve	90.72
B	1001	James	78.09
C	1002	Doyeon	98.43
A	1003	Jane	64.19
B	1004	Pilwoong	81.30
C	1005	Tony	99.14

c2.loc[['A','C']]
a	b	c
A	1000	Steve	90.72
A	1003	Jane	64.19
C	1002	Doyeon	98.43
C	1005	Tony	99.14

 

Set value. For example, 

df.loc[item1,item2]=10 : Find item1 and put 10 to item2 of item1.

df.loc[['viper', 'sidewinder'], ['shield']] = 50

>>>
df

            max_speed  shield
cobra               1       2
viper               4      50
sidewinder          7      50

 

df.loc[item1,item2]=100 : Find item1 and create item2, and then put 100 to item2.

df.loc[df.index, 'abc']=100

>>>

	    max_speed	shield	abc
cobra	      1	         2     100.0
viper	      4	         5     100.0
sidewinder    7	         8     100.0

 

  • iloc

Integer Location, access a group of rows using with index number.

df
>>>
	a	b	c	d
0	1	2	3	4
1	100	200	300	400
2	1000	2000	3000	4000

df.iloc[0,2]
>>>
3

df.iloc[[0,2]]
>>>
a	b	c	d
0	1	2	3	4
2	1000	2000	3000	4000

 

  • at

Access a single value for a row/column label pair.

c2.at['A','b']
A    Steve
A     Jane
Name: b, dtype: object

 

  • isin

DataFrame.isin() : Is () in DataFrame? Return dataframe if yes.

c2[c2['c'].isin([90.72])]
a	b	c
A	1000	Steve	90.72

 

  •  drop

Drop specified labels from rows or columns.

- axis=1 : Drop the column

- axis=0 : Drop the row

lst_df.drop(columns=['new'])

 

  • astype

DataFrame.astype() : Change the DataFrame type as ()

c2['c'].astype('int')
A    90
B    78
C    98
A    64
B    81
C    99
Name: c, dtype: int64

 

- .astype('category') : Data can be categorized

train=pd.read_csv("https://raw.githubusercontent.com/developer-sdk/kaggle-python-beginner/master/datas/kaggle-titanic/train.csv")
train['SexCate']=train['Sex'].astype('category')

 

  • cut

It used when you need to segment and sort data values into bins.

- labels : Specifies the labels for the returned bins. Must be the same length as the resulting bins.

pd.cut(train['Age'],3,labels=['young','mid','old'])
>>>
0      young
1        mid
2      young
3        mid
4        mid
       ...  
886      mid
887    young
888      NaN
889    young
890      mid
Name: Age, Length: 891, dtype: category
Categories (3, object): ['young' < 'mid' < 'old']

 

- bins : It values into discrete intervals. It should be one more than number of labels.

pd.cut(train['Fare'],bins=[-1, 0, 8, 15, 32, 100, 600],labels=[1, 2, 3, 4, 5, 6])
0      2
1      5
2      2
3      5
4      3
      ..
886    3
887    4
888    4
889    4
890    2
Name: Fare, Length: 891, dtype: category
Categories (6, int64): [1 < 2 < 3 < 4 < 5 < 6]

 

  • qcut

qcut(DataFrame, 10) : Discrete DataFrame to 10 as same size.

Quantile cut, it discretes same size.

pd.qcut(train['Fare'],10)
0        (-0.001, 7.55]
1      (39.688, 77.958]
2         (7.854, 8.05]
3      (39.688, 77.958]
4         (7.854, 8.05]
             ...       
886      (10.5, 14.454]
887      (27.0, 39.688]
888      (21.679, 27.0]
889      (27.0, 39.688]
890       (7.55, 7.854]
Name: Fare, Length: 891, dtype: category
Categories (10, interval[float64]): [(-0.001, 7.55] < (7.55, 7.854] < (7.854, 8.05] < (8.05, 10.5] ... (21.679, 27.0] < (27.0, 39.688] < (39.688, 77.958] < (77.958, 512.329]]

 

  • factorize

The values can be changed to integer.

df = pd.DataFrame({"A":["a", "b", "c", "a", "d"]})
df.A = df.A.astype('category')
codes, uniques = pd.factorize(df.A)
pd.factorize(df.A)
>>>
(array([0, 1, 2, 0, 3]),
 CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category'))

pd.factorize(df.A)[0]
>>>
array([0, 1, 2, 0, 3])

pd.factorize(df.A)[1]
>>>
CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category')

 

  • sort_index

- ascending=True : ascending order, False is descending order

 

  • sort_values

- by='column' : Align the values by column

- ascending : Sort ascending, derault True

df=pd.DataFrame({'col1':['A', 'A', 'B', np.nan, 'D', 'C'],'col2':['a', 'B', 'c', 'D', 'e', 'F'],'col3':[2, 1, 9, 8, 7, 4]})
df
>>>
	col1	col2	col3
0	A	a	2
1	A	B	1
2	B	c	9
3	NaN	D	8
4	D	e	7
5	C	F	4

df.sort_values(by=['col2','col3'])
>>>
col1	col2	col3
1	A	B	1
3	NaN	D	8
5	C	F	4
0	A	a	2
2	B	c	9
4	D	e	7
	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3

df.sort_values(by=['total_bill', 'tip'],ascending=[True, False])

###### Sort the data with multiple columns in different directions ######

 

  • iterrows

Iterate over DataFrame rows as index, Series pairs.

df = pd.DataFrame({
    '이름': ['조조', '조운', '주유'], 
    '통솔': [96, 91, 97], 
    '무력': [78, 97, 71], 
    '지력': [91, 28, 96], 
    '정치': [94, 78, 87], 
    '매력': [96, 81, 87]
})

>>>

	이름	통솔	무력	지력	정치	매력
0	조조	96	78	91	94	96
1	조운	91	97	28	78	81
2	주유	97	71	96	87	87

for index,row in df.iterrows():
    print(index)
>>>
0
1
2

for index,row in df.iterrows():
    print(row)
>>>
이름    조조
통솔    96
무력    78
지력    91
정치    94
매력    96
Name: 0, dtype: object
이름    조운
통솔    91
무력    97
지력    28
정치    78
매력    81
Name: 1, dtype: object
이름    주유
통솔    97
무력    71
지력    96
정치    87
매력    87
Name: 2, dtype: object

 

  • itertuple

Iterate over DataFrame rows as named tuples

for index in df.itertuples():
    print(index)

Pandas(Index=0, 이름='조조', 통솔=96, 무력=78, 지력=91, 정치=94, 매력=96)
Pandas(Index=1, 이름='조운', 통솔=91, 무력=97, 지력=28, 정치=78, 매력=81)
Pandas(Index=2, 이름='주유', 통솔=97, 무력=71, 지력=96, 정치=87, 매력=87)

 

  • groupby

Grouping the DataFrame by column.

df=pd.DataFrame({
    'class':[['A','B','C'][np.random.randint(0,3)] for i in range(0,10)],
    'year':[['2019','2020'][np.random.randint(0,2)] for i in range(0,10)],
    'month':[np.random.randint(5,7) for i in range(0,10)],
    'val1':[np.random.randint(10000,20000) for i in range(0,10)],
    'val2':[np.random.randint(100,300) for i in range(0,10)],
    'val3':[np.random.randint(1,11) for i in range(0,10)]    
    })
    
classgroup=df.groupby(['class','year']).count()
classgroup
			month	val1	val2	val3
class	year				
	A	2019	2	2	2	2
		2020	3	3	3	3
	B	2020	1	1	1	1
	C	2019	1	1	1	1
		2020	3	3	3	3
remainder=df.groupby(lambda x:x//2)
remainder.sum()

month	val1	val2	val3
0	11	34223	428	12
1	12	30678	368	9
2	12	27858	374	11
3	10	26259	280	16
4	11	27201	430	8

 

  • agg

Aggregate

classgroup.agg(['max','min'])

	month	val1	val2	val3
max	3	3	3	3
min	1	1	1	1
remainder.agg({'val1':'min','val2':'max','val3':['min','max']})

val1	val2	val3
min	max	min	max
0	17018	270	2	10
1	14071	222	4	5
2	12782	239	1	10
3	11640	149	8	8
4	10807	233	3	5

 

  • crosstab

Cross Tabulation, compute a simple cross tabulation of two or more factors.

- margins=True : Show subtotals.

pd.crosstab(train['Sex'],train['SexCate'],margins=True)

SexCate	female	male	All
Sex			
female	314	0	314
male	0	577	577
All	314	577	891

 

-values : Array of values to aggregate to the factors. Requires affgunc be specified.

- aggfunc : Aggregate Function

pd.crosstab(train['Sex'],train['Age'],values=train['Survived'], aggfunc=sum)

Age	young	mid	old
Sex			
female	126	95	12
male	56	49	4

 

  • concat

Concatenate

- ignore_index=True : Do not use the index values along the concatenation axis.

s1=pd.Series(['a','b'])
s2=pd.Series(['c','d'])
pd.concat([s1,s2])
>>>
0    a
1    b
0    c
1    d
dtype: object

s1=pd.Series(['a','b'])
s2=pd.Series(['c','d'])
pd.concat([s1,s2],ignore_index=True)
>>>
0    a
1    b
2    c
3    d
dtype: object

 

- join='inner' : Join the column under the same name, default is 'outer'

s1=pd.DataFrame({'col1':['a','b'],'col2':[1,2]})
s2=pd.DataFrame({'col2':['c','d'],'col3':[3,4]})
pd.concat([s1,s2],ignore_index=True)

>>>

      col1	 col2     col3
0	a	  1	   NaN
1	b	  2	   NaN
2	NaN	  c	   3.0
3	NaN	  d	   4.0

pd.concat([s1,s2],ignore_index=True, sort=False, join='inner')

>>>

      col2
0	1
1	2
2	c
3	d

 

  • style

https://colab.research.google.com/drive/1LpEo0f0WpIiaiFCA4lF7RuIfPzWhlQxO?authuser=1#scrollTo=MU2Ga1qMJzwb 

 

  • merge

- on : These must be found in both DataFrames and merging to the intersection of the columns in both DataFrames.

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 1, 7, 8]})
print(df1)
print('------')
print(df2)
>>>
  lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5
------
  rkey  value
0  foo      5
1  bar      1
2  baz      7
3  foo      8

combined=pd.merge(df1,df2,on='value')
combined
>>>
lkey	value	rkey
0	foo	1	bar
1	foo	5	foo

 

  • pivot_table

Create a spreadsheet-style pivot table as a DataFrame.

dataframe.pivot_table(values=, index=, columns=, fill_value=)

- values : value supposed to insert to value

- index, columns : column

- fill_value : scalar, value to replace missing values with.

- aggfunc : Aggregate Function, it aggregates values by using the function.

rating_crosstab=combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
df=pd.DataFrame({'a':["foo", "foo", "foo", "foo", "foo"],'b':["one", "one", "one", "two", "two"],'c':["small", "large", "large", "small",
                         "small"],'d':[1, 2, 2, 3, 3]})
df
>>>
	a	b	c	d
0	foo	one	small	1
1	foo	one	large	2
2	foo	one	large	2
3	foo	two	small	3
4	foo	two	small	3

pd.pivot_table(df, values='d',index=['a','b'],columns=['c'],aggfunc=np.sum)
>>>
	c	large	small
a	b		
foo	one	4.0	1.0
two	NaN	6.0

 

  • get_loc

Get Location, slice or boolean mask for requested label.

a=pd.Index(list('abc'))
a.get_loc('b')
>>>
1

 

  • corr_mat

Correlation matrix, returns a correlation matrix of a dataframe. Several different correlation methods can be choosen and the matrix can be created for column or row relations.

 

  • sparse

You can view the object as being compressed where any data matching a specific value(NaN, missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

arr=np.random.randn(10)
arr[2:-2]=np.nan
pd.Series(pd.arrays.SparseArray(arr))
>>>
0    0.256553
1    0.671573
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8    0.582745
9    1.599741
dtype: Sparse[float64, nan]

 

  • from_spmatrix

Sparse Matrix

 

  • rename

Renaming columns of an existing DataFrame.

- inplace : If True, it will make changes on the original DataFrame.

data={'name':['John', 'Doe', 'Paul'],
      'age':[22, 31, 15]}
df=pd.DataFrame(data)
df.rename(columns={'name':'first name','age':'AGE'},inplace=True)
df
>>>
	first name	AGE
0	John	22
1	Doe	31
2	Paul	15

 

  • values.tolist

Convert DataFrame value to List.

Inst_Att
INST_NM	DRULE_ATT_TYPE_CODE1
0	대구대학교	Attack
1	경기대학교	Attack

df_list=Inst_Att.values.tolist()
df_list
>>>
[['대구대학교', 'Attack'],
 ['경기대학교', 'Attack']]

 

  • columns.tolist

Convert DataFrame columns to List.

0	1	2	3	4	mean
0	0.621096	0.904698	0.629040	0.109003	0.648523	0.325452


cols=df.columns.tolist()
cols
>>>
[0, 1, 2, 3, 4, 'mean']

 

  • drop_duplicates

To get unique values of column, it removes all duplicated values and returns a Pandas series.

 

  • explode

Transform each element of a list-like to a row.

df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
                   'B': 1,
                   'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
                   
df.explode('A')
>>>
     A  B          C
0    0  1  [a, b, c]
0    1  1  [a, b, c]
0    2  1  [a, b, c]
1  foo  1        NaN
2  NaN  1         []
3    3  1     [d, e]
3    4  1     [d, e]

 

  • How to get column name
for col in df_100.columns:
  print(col)
  
>>>
TW_ATT_IP_SEARCH_DATA
TW_ATT_GEOLOCATION
ACCD_CHARGER_ID
...

 

  • How to remove nan from all columns
for column_name in df.columns:
    df[column_name]=df[column_name].replace(np.nan, '')

 

  • isna

Detect missing values.

NTM_df.isna().sum()

>>>
INST_NM                 0
DRULE_ATT_TYPE_CODE1    0
TW_ATT_IP               0
TW_ATT_PORT             0
TW_DMG_IP               0
TW_DMG_PORT             0
ACCD_DMG_PROTO_NM       1

 

  • mean

Return the mean of the values over the requested axis.

- mean(0) : index

- mean(1) : columns

df['mean']=df.mean(1)
df
>>>
0	1	2	3	4	mean
0	0.279847	0.956025	0.030422	0.994737	0.129731	0.478152
1	0.186938	0.276841	0.011527	0.062197	0.761385	0.259778

 

  • interpolate

Fill NaN values using an interpolation method.

-method='linear' : To guess the missing values and fill the data.

-limit_direction : Consecutive NaNs will be filled in this direction, default is ‘backward’. It indicates whether the missing values will be guessed using the previous values or latter values of the column.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

interpolate_df=cleaned_df.interpolate(method='linear',limit_direction='forward')
print(interpolate_df.isnull().any())
>>>
Category         False
Item             False
Calories         False
Total Fat        False
...

 

  • query

Filter the data with easy way.

df.loc[(df['tip']>6) & (df['total_bill']>=30)]

>>>

	total_bill	tip	sex	smoker	day	time	size
23	39.42	7.58	Male	No	Sat	Dinner	4
59	48.27	6.73	Male	No	Sat	Dinner	4
141	34.30	6.70	Male	No	Thur	Lunch	6

###### Above is same as below ######

df.query("tip>6 & total_bill>=30")

 

  • nunique

Count number of unique elements.

d1=pd.DataFrame({'a':[4,5,6],'b':[4,1,1]})
d1.nunique()

>>>

a    3
b    2

 

  • size

Return the number after count.

d1=pd.DataFrame({'col1':[1,2],'col2':[2,3]})
d1.size

>>>
4

 

  • reset_index

Reset the index.

drop=True : Drop the index column, default is False.

ex_df=pd.DataFrame(np.arange(0,9).reshape(3,3)) 
ex_df

>>>

	0	1	2
0	0	1	2
1	3	4	5
2	6	7	8

ex_df.reset_index()

>>>

	index	0	1	2
0	    0	0	1	2
1	    1	3	4	5
2	    2	6	7	8

ex_df.reset_index(drop=True)

>>>

	    0	1	2
0	    0	1	2
1	    3	4	5
2	    6	7	8