运动
学习
[[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass]]
Conditional Filtering - Select rows based a condition on a column - Columns are [[Feature]]s
Conditions -
df['column'] > 50
- return back a series of Boolean values for every instance(row); some filtering examples:df[df['total_bill']>30]
,df[df['sex'] == 'Male']
Multiple Conditions -
|
,&
,~
; some multiple conditions examples:df[(df['total_bill'] > 30) & (df['sex']=='Male')]
;df[(df['total_bill'] > 30) & ~(df['sex']=='Male')]
;df[(df['total_bill'] > 30) & (df['sex']!='Male')]
;df[(df['day'] =='Sun') | (df['day']=='Sat')]
; Should not use the built-inand
,or
ornot
argument, but the symbols.Conditional Operator
isin()
- Whether each element in the DataFrame is contained in values -df[df['day'].isin(['Sat','Sun'])]
- useful for comparing more than two values.Useful Methods
Apply on Single Column -
.apply(function)
(Just pass the function itself, instead of actually executing it.); it should only return one single value (because it will be applied on each row); e.g.df['last_four'] = df['CC Number'].apply(last_four)
; lambda example:df['total_bill'].apply(lambda bill:bill*0.18)
- Not everything could be converted to lambda expression; How to use more than one inputs?Apply on Multiple Columns
The lambda approach - e.g.
df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
The vectorization approach -
np.vectorize()
-np.vectorize(quality)(df['total_bill'], df['tip'])
- Documentation - though it is not designed for efficiency, it transforms non-Numpy-aware python functions into Numpy-aware functions and thus makes them more efficient.Statistical Information and Sorting
df.describe()
-df.describe().transpose()
Sort -
df.sort_values('tip', ascending=False)
;df.sort_values(['tip','size'])
(sort by more than one columns);df.['total_bill'].max()
->df.['total_bill'].idxmax()
(find the index of the max value);Correlation checks -
df.corr()
;df[['total_bill','tip']].corr()
;value_counts
-df['sex'].value_counts()
;.unique()
or.nunique
(number of unique elements)Replace -
df['Tip Quality'].replace(to_replace='Other',value='Ok')
; `df['sex'].replace(['Female','Male'], ['F', 'M'])Map - mapping with a [[dict]]:
my_map = {'Dinner':'D','Lunch':'L'}
-> `df['time'].map(my_map)Duplicates -
df.duplicated()
;df.drop_duplicates()
Between -
df['total_bill'].between(10,20,inclusive=True)
-> filtering using this:df[df['total_bill'].between(10,20,inclusive=True)]
Multiple largest/smallest ==
.sort_values().ilo[]
->df.nlargest(2, 'tip')
ordf.nsmallest()
Sample -
df.sample(5)
;df.sample(frac=0.1)
(grab a percent of all data)Missing Data
What Null/NA/nan objects look like:
pd.NA
;np.nan
;pd.NaT
( for missing datetime-like data)Options for missing data (Ask Why!: [[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass#^ed0667]]):
Keep
Remove
Dropping a row - makes sense when a lot of info is missing; often a good idea to calculate a percentage of what data is dropped
Dropping a feature - good choice if (almost) every row is missing that particular feature
Replace
Fill with same value - Good choice if NaN was a placeholder
Fill with interpolated or estimated value - Much harder and requires reasonable assumptions
[[Supervised Machine Learning - W3 - Classification]]
Optional lab: Sigmoid function and logistic regression
Sigmoid function:
np.exp()
; - the sigmoid function (sigmoid(z)
):g = 1/(1+np.exp(-z))
Logistic Regression - 逻辑回归作为广义线性回归的特例?
[[Decision boundary]] - The threshold does not need to be 0.5
Linear regression boundaries - Why [[Linear regression]] model could be used? -> See, it calculates the $x_1 + x_2 =3$ line, instead of a fit line cross the cluster.
Non-linear decision boundaries
Optional Lab: Decision boundary
[[Cost function for logistic regression]]
Squared error cost - Using squared error cost method to calculate logistic regression's cost will result in a [[non-convex function]], which has multiple local minima and thus is hard for [[gradient descent]] ([[Pasted image 20220710145310.png]]).
[[Logistic loss function]] - $log(f)$ where $0<=f<=1$ - Important to make sense of it!!!
Optional Lab: Logistic Regression, Logistic Loss
The simplified loss function: $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$
$y$ only has two possible values ⬆️
Simplified Cost Function for Logistic Regression:
![[Pasted image 20220710162355.png]]
The reason of choosing this as the cost function is statistics - [[maximum likelihood estimation]]
Optional Lab: Cost Function for Logistic Regression:
Gradient Descent Implementation - it looks like the same expression used for linear regression model - but why?
Optional Lab: Gradient Descent for Logistic Regression - Code一定要多看几遍,确定看懂了!!
Calculating the Gradient, Code Description
Gradient Descent Code `
笔记
[[pandas]] documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html
You cannot index an integer in Python.
Solution: cast a type - str()
incentivize v.有奖鼓励,激励 wiggly adj.扭动的;左右摇摆的;起波的;蠕动的
Mathematical Notations for Machine Learning (Markdown)