Machine Learning
- Classification problem
- Features
- Labels
- Training set
- Term frequency representation - Getting frequency of words based on a sentence
- Output is a categorical value.
- Naive Bayes
- Sentiment Analysis
- Classification Setup = Take a comment classify as positive or negative
- P = Positive, 1-P = Negative
- Probability of word being positive = Sum of frequency of the word in positive comments / sum frequency of the data in entire corpus
- Drop connective words (stop words) when processing
- Then multiply the positivity of the other words which are part of sentence to get overall score * overall probability of the number being positive.
- Sentiment labelled sentences data set - IMDB, Yelp and Amazon reviews.
- Scikit-Learn module in Python
- Can be used to classify any number of categories
- SVM
- Ad detection - Internet advertisement dataset
- Can be used only for binary classification
- Regression Problem
- Examples
- Finding stock price returns on a given day
- CAPM model/formula for finding future price
- Demand forecasting
- The mathematical function which is used to solve the independent variables to arrive at the output may be linear, polynomial or non-linear which gives the regression its name accordingly.
- Used when we want to find the relation between 2 variables.
- Output is a continuous value
- Co-efficients or weights define how much impact the independent variable on the output.
- b0+b1x1+b2bx2 = here x1,x2 are the independent variables which influence the outcome.
- The weights b0, b1, b2 are assigned to the variables
- Stochastic gradient descent - can be used to find the point of least error.
- SGDRegressor, LinearRegression in Scikit learn (from sklearn.linear_model)
- import sklearn.linear_model
- model = sklearn.linear_model.LinearRegression()
- Lasso (Least Absolute Shrinkage and Selection Operator) regression - additional compensating term based on alpha (hyperparameter) times of L1 norm .
- Ridge regression - additional compensating term based on alpha (hyperparameter) time on L2 norm.
- Alpha is the regularization parameter to avoid overfitting of the model.
- from sklearn.linear_model import Lasso
- Lasso(aplha=0.5, normalize= True) - Change value of alpha and try to get reduce root mean square error.
- Lasso eliminates unnecessary variables while evaluating the model by setting their coefficients to zero.
- from sklearn.linear_model import Ridge
- Lasso(aplha=0.5, normalize= True)
- Performs similar to Lasso
- Hyperparameter tuning is basically changing the value of alpha appropriately to get the right model
- Support vector regression - Hyperplane which separates data in n-dimensional plane.
- SVM - find widest margin with most distance from nearest points
- import sklearn.neighbors
- model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
- Collaborative Filtering for recommendations- uses past data for identifying recommendations.
- Latent Factor Analysis
- Alternating least squares is an optimization technique used to min the error in recommendation problem.
- Regularization term = lambda
- Movielens dataset
- Python module 'implicit'
- implicit.altermating_least_squares
- import heapq
- heap.q.nlargest
- Clustering - Unsupervised
- K-means
- Density based
- Distribution based
- Hierarchical Cluster analysis
- We dont use training set - we use clustering algo directly. Since we dont train and test this falls under unsupervised learning.
- Term frequency
- Term frequency Inverse document frequency (TF-IDF) - some words which are commonly found like - this, the, etc doesnt provide the weightage vs lesser found words defining the context of the sentence. The weight will inverse of "the number of documents the rare word was found ".
- Tuple of numbers are points in n-dimensional space.
- Initialize a set of points equal to the number of categories as the k-means.
- Assign each point to the cluster belonging to the nearest mean.
- Find the new means. Repeat until the means dont change.
- Sentiment labelled sentences data set or IMDB reviews
- from sklearn.feature_extraction.text import tfidvectorizer
- from sklearn.cluster import KMeans
- Visualization and Dimensionality reduction.- Unsupervised
- Principal component analysis
- Kernel PCA
- Locally linear embedding
- A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correla‐ ted features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that rep‐ resents the car’s wear and tear. This is called feature extraction.
- Some algorithms can deal with partially labeled training data, usually a lot of un-labeled data and a little bit of labeled data. This is called semi-supervised learning.
- Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
- In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning. If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset.
- In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly. Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data.
- It is crucial to use a training set that is representative of the cases you want to generalize to. : if the sample is too small, you will have sampling noise but even very large samples can be non-representative if the sampling method is flawed. This is called sampling bias.
- A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering.
- Feature selection: selecting the most useful features to train on among existing features.'
- Feature extraction: combining existing features to produce a more useful one.
- Creating new features by gathering new data.
- Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are:
- To simplify the model by selecting one with fewer parameters .
- To gather more training data
- To reduce the noise in the training data (e.g., fix data errors and remove outliers)
- Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. The amount of regularization to apply during learning can be controlled by a hyper‐ parameter. A hyperparameter is a parameter of a learning algorithm.
- underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.
- The error rate on new cases is called the generalization error (or out-of sample error), and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
- A sequence of data processing components is called a data pipeline. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. Each component is fairly self-contained: the interface between components is simply the data store.
- Deep Learning
- Perceptron
- Activation Function - Introduces non-linearity
- Sigmoid
- Relu
- Empirical loss measures the lost of total dataset
- Binary (Softmax) Cross Entropy Loss - can be used with models that output a probability between 0 and 1 (used for classification problems)
- Mean Squared Error Loss - can be used with regression models with output continuous real numbers
- Loss Optimization
- Find weights that would min the loss
- Process of training is to find the optimal weights
- Gradient of loss wrt weight tells which way is up - we take negative of gradient and take small step in that direction. Repeat until local minima.
- Back propagation - Derivatives and Chain rule
- Learning Rate - How to set this ? Adaptive learning rate
- SGD
- Adam
- Adadelta
- Adagrid
- RMSProp
- Stochastic Gradient Descent on a batch of data points
- Increases gradient estimation accuracy
- Can parallelize
- Overfitting
- Regularization to address this problem to improve generalization
- Dropout technique - randomly set some activations to 0.
- Early Stopping technique - Stop training before we have a chance to overfit
- Loss Metrics
- Regression - MSPE, MSAE, R Square
- Classification - Accuracy, Log loss, ROC-AUC, Precision Recall.
- Unsupervised models - Rand index, Mutual information
- Others - BLEU score(NLP), CV error, Heuristic methods to find K.
- Regression Metrics
- RMSE represents the sample standard deviation of the differences between predicted values and observed values (called residuals).
- MAE is the average of the absolute difference between the predicted values and observed value.
- UCI ML repository - Covertype dataset - Available in BigQuery. Classification problem.
- Creating reproducible DataSet - To split dataset use hashing and modulo operators - which converts the content to a hash sum. Farm_Fingerprint of BigQuery can be used for hashing. Instead of one field, concatenate all the fields (except label) into a json string and then perform hash on it. Extract the data set separately into:
- Training
- Validation
- Test
- ML model takes parameters and hyperparameters. Parameters are changed during model training (these are the weights which are tuned), whereas hyperparameters are set before training. Finding the best value for hyperparameters is called Tuning.
- SciKit learn - preprocessor = ColumnTransformer - use StandardScaler() for numeric features and OneHotEncoder() for Categorical features.
- Pipeline(preprocessor , classifier = SGDClassifier(loss='log', tol=1e-3))
- Pipeline.set_params(classifier_alpha= .001, classifies_max_iter =200)
- Classifier alpha - is regularization parameter.
- pipeline.fit(x_train, y_train)
- pipeline.score(x_validation, y_validation)
- AI platform can be used to find the best hyperparameter value. Import cloudml-hypertune from Google. Use hypertune.report_hyperparameter_tuning_metric() to find the best value.
- Python package fire - used for creating CLI.
- The model can be stored using Pickle. (Pickle.dump) since we are using Scikit learn. If TF is used , checkpoint/save of TF can be used.
- Supply hyperparameters for the model using Config.yaml file.
- Create training docker container
- from gcr.io/deeplearning-paltform-releasae/base-cpu
- run pip install - u fire scikit-learn pandas cloudml-hypertune
- workdir /app
- copy train.py (model file)
- Entrypoint["python","train.py"]
Kubeflow is cloud-native, multi-cloud solution for ML workflow orchestration.
Kubeflow offers DOmain specific language in Python to describe KF tasks as they organize in as Directed Acyclic Graph (DAG).
Comments
Post a Comment