日报 10/7/2022

运动

学习

  1. [[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass]]

  2. Conditional Filtering - Select rows based a condition on a column - Columns are [[Feature]]s

  3. Conditions - df['column'] > 50 - return back a series of Boolean values for every instance(row); some filtering examples: df[df['total_bill']>30], df[df['sex'] == 'Male']

  4. Multiple Conditions - |, &, ~; some multiple conditions examples: df[(df['total_bill'] > 30) & (df['sex']=='Male')]; df[(df['total_bill'] > 30) & ~(df['sex']=='Male')]; df[(df['total_bill'] > 30) & (df['sex']!='Male')] ; df[(df['day'] =='Sun') | (df['day']=='Sat')]; Should not use the built-in and, or or not argument, but the symbols.

  5. Conditional Operator isin() - Whether each element in the DataFrame is contained in values -df[df['day'].isin(['Sat','Sun'])] - useful for comparing more than two values.

  6. Useful Methods

  7. Apply on Single Column - .apply(function) (Just pass the function itself, instead of actually executing it.); it should only return one single value (because it will be applied on each row); e.g. df['last_four'] = df['CC Number'].apply(last_four); lambda example: df['total_bill'].apply(lambda bill:bill*0.18) - Not everything could be converted to lambda expression; How to use more than one inputs?

  8. Apply on Multiple Columns

  9. The lambda approach - e.g. df['Tip Quality'] = df[['total_bill','tip']].apply(lambda df: quality(df['total_bill'],df['tip']),axis=1)

  10. The vectorization approach - np.vectorize() - np.vectorize(quality)(df['total_bill'], df['tip']) - Documentation - though it is not designed for efficiency, it transforms non-Numpy-aware python functions into Numpy-aware functions and thus makes them more efficient.

  11. Statistical Information and Sorting

  12. df.describe() - df.describe().transpose()

  13. Sort - df.sort_values('tip', ascending=False); df.sort_values(['tip','size']) (sort by more than one columns); df.['total_bill'].max() -> df.['total_bill'].idxmax() (find the index of the max value);

  14. Correlation checks - df.corr(); df[['total_bill','tip']].corr();

  15. value_counts - df['sex'].value_counts(); .unique() or .nunique(number of unique elements)

  16. Replace - df['Tip Quality'].replace(to_replace='Other',value='Ok'); `df['sex'].replace(['Female','Male'], ['F', 'M'])

  17. Map - mapping with a [[dict]]: my_map = {'Dinner':'D','Lunch':'L'} -> `df['time'].map(my_map)

  18. Duplicates - df.duplicated(); df.drop_duplicates()

  19. Between - df['total_bill'].between(10,20,inclusive=True) -> filtering using this: df[df['total_bill'].between(10,20,inclusive=True)]

  20. Multiple largest/smallest == .sort_values().ilo[] -> df.nlargest(2, 'tip') or df.nsmallest()

  21. Sample - df.sample(5); df.sample(frac=0.1) (grab a percent of all data)

  22. Missing Data

  23. What Null/NA/nan objects look like: pd.NA; np.nan; pd.NaT( for missing datetime-like data)

  24. Options for missing data (Ask Why!: [[Section 5 Pandas - 2022 Python for Machine Learning & Data Science Masterclass#^ed0667]]):

  25. Keep

  26. Remove

  27. Dropping a row - makes sense when a lot of info is missing; often a good idea to calculate a percentage of what data is dropped

  28. Dropping a feature - good choice if (almost) every row is missing that particular feature

  29. Replace

  30. Fill with same value - Good choice if NaN was a placeholder

  31. Fill with interpolated or estimated value - Much harder and requires reasonable assumptions

  32. [[Supervised Machine Learning - W3 - Classification]]

  33. Optional lab: Sigmoid function and logistic regression

  34. Sigmoid function: np.exp(); - the sigmoid function (sigmoid(z)): g = 1/(1+np.exp(-z))

  35. Logistic Regression - 逻辑回归作为广义线性回归的特例?

  36. [[Decision boundary]] - The threshold does not need to be 0.5

  37. Linear regression boundaries - Why [[Linear regression]] model could be used? -> See, it calculates the $x_1 + x_2 =3$ line, instead of a fit line cross the cluster.

  38. Non-linear decision boundaries

  39. Optional Lab: Decision boundary

  40. [[Cost function for logistic regression]]

  41. Squared error cost - Using squared error cost method to calculate logistic regression's cost will result in a [[non-convex function]], which has multiple local minima and thus is hard for [[gradient descent]] ([[Pasted image 20220710145310.png]]).

  42. [[Logistic loss function]] - $log(f)$ where $0<=f<=1$ - Important to make sense of it!!!

  43. Optional Lab: Logistic Regression, Logistic Loss

  44. The simplified loss function: $$loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)$$

  45. $y$ only has two possible values ⬆️

  46. Simplified Cost Function for Logistic Regression:

  47. ![[Pasted image 20220710162355.png]]

  48. The reason of choosing this as the cost function is statistics - [[maximum likelihood estimation]]

  49. Optional Lab: Cost Function for Logistic Regression:

  50. Gradient Descent Implementation - it looks like the same expression used for linear regression model - but why?

  51. Optional Lab: Gradient Descent for Logistic Regression - Code一定要多看几遍,确定看懂了!!

  52. Calculating the Gradient, Code Description

  53. Gradient Descent Code `

笔记

[[pandas]] documentation: https://pandas.pydata.org/pandas-docs/stable/reference/index.html

You cannot index an integer in Python. Solution: cast a type - str()

incentivize v.有奖鼓励,激励 wiggly adj.扭动的;左右摇摆的;起波的;蠕动的

Mathematical Notations for Machine Learning (Markdown)

All rights reserved
Except where otherwise noted, content on this page is copyrighted.