머신러닝 알고리즘 Cheat Sheet

01 May 2018 in Data on Machine-Learning

각종 머신러닝 알고리즘의 Cheat Sheet입니다! 매번 검색하기 번거로워 인터넷에 있는 자료들을 가지고 왔습니다

Dummies 자료

Algorithm	Best at	Pros	Cons
Random Forest	Apt at almost any machine learning problem Bioinformatics	Can work in parallel Seldom overfits Automatically handles missing values No need to transform any variable No need to tweak parameters Can be used by almost anyone with excellent results	Difficult to interpret Weaker on regression when estimating values at the extremities of the distribution of response values Biased in multiclass problems toward more frequent classes
Gradient Boosting	Apt at almost any machine learning problem Search engines (solving the problem of learning to rank)	It can approximate most nonlinear function Best in class predictor Automatically handles missing values No need to transform any variable	It can overfit if run for too many iterations Sensitive to noisy data and outliers Doesn’t work well without parameter tuning
Linear regression	Baseline predictions Econometric predictions Modelling marketing responses	Simple to understand and explain It seldom overfits Using L1 & L2 regularization is effective in feature selection Fast to train Easy to train on big data thanks to its stochastic version	You have to work hard to make it fit nonlinear functions Can suffer from outliers
Support Vector Machines	Character recognition Image recognition Text classification	Automatic nonlinear feature creation Can approximate complex nonlinear functions	Difficult to interpret when applying nonlinear kernels Suffers from too many examples, after 10,000 examples it starts taking too long to train
K-nearest Neighbors	Computer vision Multilabel tagging Recommender systems Spell checking problems	Fast, lazy training Can naturally handle extreme multiclass problems (like tagging text)	Slow and cumbersome in the predicting phase Can fail to predict correctly due to the curse of dimensionality
Adaboost	Face detection	Automatically handles missing values No need to transform any variable It doesn’t overfit easily Few parameters to tweak It can leverage many different weak-learners	Sensitive to noisy data and outliers Never the best in class predictions
Naive Bayes	Face recognition Sentiment analysis Spam detection Text classification	Easy and fast to implement, doesn’t require too much memory and can be used for online learning Easy to understand Takes into account prior knowledge	Strong and unrealistic feature independence assumptions Fails estimating rare occurrences Suffers from irrelevant features
Neural Networks	Image recognition Language recognition and translation Speech recognition Vision recognition	Can approximate any nonlinear function Robust to outliers Works only with a portion of the examples (the support vectors)	Very difficult to set up Difficult to tune because of too many parameters and you have also to decide the architecture of the network Difficult to interpret Easy to overfit
Logistic regression	Ordering results by probability Modelling marketing responses	Simple to understand and explain It seldom overfits Using L1 & L2 regularization is effective in feature selection The best algorithm for predicting probabilities of an event Fast to train Easy to train on big data thanks to its stochastic version	You have to work hard to make it fit nonlinear functions Can suffer from outliers
SVD	Recommender systems	Can restructure data in a meaningful way	Difficult to understand why data has been restructured in a certain way
PCA	Removing collinearity Reducing dimensions of the dataset	Can reduce data dimensionality	Implies strong linear assumptions (components are a weighted summations of features)
K-means	Segmentation	Fast in finding clusters Can detect outliers in multiple dimensions	Suffers from multicollinearity Clusters are spherical, can’t detect groups of other shape Unstable solutions, depends on initialization

Microsoft Azure Machine Learning 자료

Reference

카일스쿨 유튜브 채널을 만들었습니다. 데이터 사이언스, 성장, 리더십, BigQuery 등을 이야기할 예정이니, 관심 있으시면 구독 부탁드립니다 :)

PM을 위한 데이터 리터러시 강의를 만들었습니다. 문제 정의, 지표, 실험 설계, 문화 만들기, 로그 설계, 회고 등을 담은 강의입니다

이 글이 도움이 되셨거나 다양한 의견이 있다면 댓글 부탁드립니다 :)

Buy me a coffee