머신러닝 알고리즘 Cheat Sheet
in Data on Machine-Learning
각종 머신러닝 알고리즘의 Cheat Sheet입니다! 매번 검색하기 번거로워 인터넷에 있는 자료들을 가지고 왔습니다
Dummies 자료
Algorithm | Best at | Pros | Cons |
---|---|---|---|
Random Forest | Apt at almost any machine learning problem Bioinformatics | Can work in parallel Seldom overfits Automatically handles missing values No need to transform any variable No need to tweak parameters Can be used by almost anyone with excellent results | Difficult to interpret Weaker on regression when estimating values at the extremities of the distribution of response values Biased in multiclass problems toward more frequent classes |
Gradient Boosting | Apt at almost any machine learning problem Search engines (solving the problem of learning to rank) | It can approximate most nonlinear function Best in class predictor Automatically handles missing values No need to transform any variable | It can overfit if run for too many iterations Sensitive to noisy data and outliers Doesn’t work well without parameter tuning |
Linear regression | Baseline predictions Econometric predictions Modelling marketing responses | Simple to understand and explain It seldom overfits Using L1 & L2 regularization is effective in feature selection Fast to train Easy to train on big data thanks to its stochastic version | You have to work hard to make it fit nonlinear functions Can suffer from outliers |
Support Vector Machines | Character recognition Image recognition Text classification | Automatic nonlinear feature creation Can approximate complex nonlinear functions | Difficult to interpret when applying nonlinear kernels Suffers from too many examples, after 10,000 examples it starts taking too long to train |
K-nearest Neighbors | Computer vision Multilabel tagging Recommender systems Spell checking problems | Fast, lazy training Can naturally handle extreme multiclass problems (like tagging text) | Slow and cumbersome in the predicting phase Can fail to predict correctly due to the curse of dimensionality |
Adaboost | Face detection | Automatically handles missing values No need to transform any variable It doesn’t overfit easily Few parameters to tweak It can leverage many different weak-learners | Sensitive to noisy data and outliers Never the best in class predictions |
Naive Bayes | Face recognition Sentiment analysis Spam detection Text classification | Easy and fast to implement, doesn’t require too much memory and can be used for online learning Easy to understand Takes into account prior knowledge | Strong and unrealistic feature independence assumptions Fails estimating rare occurrences Suffers from irrelevant features |
Neural Networks | Image recognition Language recognition and translation Speech recognition Vision recognition | Can approximate any nonlinear function Robust to outliers Works only with a portion of the examples (the support vectors) | Very difficult to set up Difficult to tune because of too many parameters and you have also to decide the architecture of the network Difficult to interpret Easy to overfit |
Logistic regression | Ordering results by probability Modelling marketing responses | Simple to understand and explain It seldom overfits Using L1 & L2 regularization is effective in feature selection The best algorithm for predicting probabilities of an event Fast to train Easy to train on big data thanks to its stochastic version | You have to work hard to make it fit nonlinear functions Can suffer from outliers |
SVD | Recommender systems | Can restructure data in a meaningful way | Difficult to understand why data has been restructured in a certain way |
PCA | Removing collinearity Reducing dimensions of the dataset | Can reduce data dimensionality | Implies strong linear assumptions (components are a weighted summations of features) |
K-means | Segmentation | Fast in finding clusters Can detect outliers in multiple dimensions | Suffers from multicollinearity Clusters are spherical, can’t detect groups of other shape Unstable solutions, depends on initialization |
Microsoft Azure Machine Learning 자료
Reference
카일스쿨 유튜브 채널을 만들었습니다. 데이터 사이언스, 성장, 리더십, BigQuery 등을 이야기할 예정이니, 관심 있으시면 구독 부탁드립니다 :)
PM을 위한 데이터 리터러시 강의를 만들었습니다. 문제 정의, 지표, 실험 설계, 문화 만들기, 로그 설계, 회고 등을 담은 강의입니다
이 글이 도움이 되셨거나 다양한 의견이 있다면 댓글 부탁드립니다 :)