Jul 10, 2022

日报 10/7/2022

运动

学习

[[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass]]
Conditional Filtering - Select rows based a condition on a column - Columns are [[Feature]]s
Conditions - df['column'] > 50 - return back a series of Boolean values for every instance(row); some filtering examples: df[df['total_bill']>30], df[df['sex'] == 'Male']
Multiple Conditions - |, &, ~; some multiple conditions examples: df[(df['total_bill'] > 30) & (df['sex']=='Male')]; df[(df['total_bill'] > 30) & ~(df['sex']=='Male')]; df[(df['total_bill'] > 30) & (df['sex']!='Male')] ; df[(df['day'] =='Sun') | (df['day']=='Sat')]; Should not use the built-in and, or or not argument, but the symbols.
Conditional Operator isin() - Whether each element in the DataFrame is contained in values -df[df['day'].isin(['Sat','Sun'])] - useful for comparing more than two values.
Useful Methods
Apply on Single Column - .apply(function) (Just pass the function itself, instead of actually executing it.); it should only return one single value (because it will be applied on each row); e.g. df['last_four'] = df['CC Number'].apply(last_four); lambda example: df['total_bill'].apply(lambda bill:bill*0.18) - Not everything could be converted to lambda expression; How to use more than one inputs?
Apply on Multiple Columns
The lambda approach - e.g. df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)
The vectorization approach - np.vectorize() - np.vectorize(quality)(df['total_bill'], df['tip']) - Documentation - though it is not designed for efficiency, it transforms non-Numpy-aware python functions into Numpy-aware functions and thus makes them more efficient.
Statistical Information and Sorting
df.describe() - df.describe().transpose()
Sort - df.sort_values('tip', ascending=False); df.sort_values(['tip','size']) (sort by more than one columns); df.['total_bill'].max() -> df.['total_bill'].idxmax() (find the index of the max value);
Correlation checks - df.corr(); df[['total_bill','tip']].corr();
value_counts - df['sex'].value_counts(); .unique() or .nunique(number of unique elements)
Replace - df['Tip Quality'].replace(to_replace='Other',value='Ok'); `df['sex'].replace(['Female','Male'], ['F', 'M'])
Map - mapping with a [[dict]]: my_map = {'Dinner':'D','Lunch':'L'} -> `df['time'].map(my_map)
Duplicates - df.duplicated(); df.drop_duplicates()
Between - df['total_bill'].between(10,20,inclusive=True) -> filtering using this: df[df['total_bill'].between(10,20,inclusive=True)]
Multiple largest/smallest == .sort_values().ilo[] -> df.nlargest(2, 'tip') or df.nsmallest()
Sample - df.sample(5); df.sample(frac=0.1) (grab a percent of all data)
Missing Data
What Null/NA/nan objects look like: pd.NA; np.nan; pd.NaT( for missing datetime-like data)
Options for missing data (Ask Why!: [[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass#^ed0667]]):
Keep
Remove
Dropping a row - makes sense when a lot of info is missing; often a good idea to calculate a percentage of what data is dropped
Dropping a feature - good choice if (almost) every row is missing that particular feature
Replace
Fill with same value - Good choice if NaN was a placeholder
Fill with interpolated or estimated value - Much harder and requires reasonable assumptions
[[Supervised Machine Learning - W3 - Classification]]
Optional lab: Sigmoid function and logistic regression
Sigmoid function: np.exp(); - the sigmoid function (sigmoid(z)): g = 1/(1+np.exp(-z))
Logistic Regression - 逻辑回归作为广义线性回归的特例？
[[Decision boundary]] - The threshold does not need to be 0.5
Linear regression boundaries - Why [[Linear regression]] model could be used? -> See, it calculates the $x_1 + x_2 =3$ line, instead of a fit line cross the cluster.
Non-linear decision boundaries
Optional Lab: Decision boundary
[[Cost function for logistic regression]]
Squared error cost - Using squared error cost method to calculate logistic regression's cost will result in a [[non-convex function]], which has multiple local minima and thus is hard for [[gradient descent]] ([[Pasted image 20220710145310.png]]).
[[Logistic loss function]] - $log(f)$ where $0<=f<=1$ - Important to make sense of it!!!
Optional Lab: Logistic Regression, Logistic Loss
The simplified loss function: $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$
$y$ only has two possible values ⬆️
Simplified Cost Function for Logistic Regression:
![[Pasted image 20220710162355.png]]
The reason of choosing this as the cost function is statistics - [[maximum likelihood estimation]]
Optional Lab: Cost Function for Logistic Regression:
Gradient Descent Implementation - it looks like the same expression used for linear regression model - but why?
Optional Lab: Gradient Descent for Logistic Regression - Code一定要多看几遍，确定看懂了！！
Calculating the Gradient, Code Description
Gradient Descent Code `

笔记

[[pandas]] documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

You cannot index an integer in Python. Solution: cast a type - str()

incentivize v.有奖鼓励，激励 wiggly adj.扭动的；左右摇摆的；起波的；蠕动的

Mathematical Notations for Machine Learning (Markdown)

Programming Notes

一只耳 / Ear

Except where otherwise noted, content on this page is copyrighted.

白河夜船 / Sopor

The personal blog of Ear.

日报 10/7/2022

运动

学习

笔记

Read This

掩人耳目地长大

白河夜船 / Sopor

The personal blog of Ear.

日报 10/7/2022

运动

学习

笔记

Subscribe

Read This

掩人耳目地长大