Machine Learning

Classification problem

Features
Labels
Training set
Term frequency representation - Getting frequency of words based on a sentence
Output is a categorical value.

Naive Bayes

Sentiment Analysis

Classification Setup = Take a comment classify as positive or negative
P = Positive, 1-P = Negative
Probability of word being positive = Sum of frequency of the word in positive comments / sum frequency of the data in entire corpus
Drop connective words (stop words) when processing
Then multiply the positivity of the other words which are part of sentence to get overall score * overall probability of the number being positive.
Sentiment labelled sentences data set - IMDB, Yelp and Amazon reviews.
Scikit-Learn module in Python
Can be used to classify any number of categories

Ad detection - Internet advertisement dataset
Can be used only for binary classification

Regression Problem

Examples

Finding stock price returns on a given day

CAPM model/formula for finding future price

Demand forecasting

The mathematical function which is used to solve the independent variables to arrive at the output may be linear, polynomial or non-linear which gives the regression its name accordingly.
Used when we want to find the relation between 2 variables.
Output is a continuous value
Co-efficients or weights define how much impact the independent variable on the output.
b0+b1x1+b2bx2 = here x1,x2 are the independent variables which influence the outcome.
The weights b0, b1, b2 are assigned to the variables
Stochastic gradient descent - can be used to find the point of least error.
SGDRegressor, LinearRegression in Scikit learn (from sklearn.linear_model)

import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()

Lasso (Least Absolute Shrinkage and Selection Operator) regression - additional compensating term based on alpha (hyperparameter) times of L1 norm .
Ridge regression - additional compensating term based on alpha (hyperparameter) time on L2 norm.
Alpha is the regularization parameter to avoid overfitting of the model.
from sklearn.linear_model import Lasso
Lasso(aplha=0.5, normalize= True) - Change value of alpha and try to get reduce root mean square error.
Lasso eliminates unnecessary variables while evaluating the model by setting their coefficients to zero.
from sklearn.linear_model import Ridge
Lasso(aplha=0.5, normalize= True)
Performs similar to Lasso
Hyperparameter tuning is basically changing the value of alpha appropriately to get the right model
Support vector regression - Hyperplane which separates data in n-dimensional plane.
SVM - find widest margin with most distance from nearest points
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Collaborative Filtering for recommendations- uses past data for identifying recommendations.

Latent Factor Analysis
Alternating least squares is an optimization technique used to min the error in recommendation problem.
Regularization term = lambda
Movielens dataset
Python module 'implicit'
implicit.altermating_least_squares
import heapq
heap.q.nlargest

Clustering - Unsupervised

K-means
Density based
Distribution based
Hierarchical Cluster analysis
We dont use training set - we use clustering algo directly. Since we dont train and test this falls under unsupervised learning.
Term frequency
Term frequency Inverse document frequency (TF-IDF) - some words which are commonly found like - this, the, etc doesnt provide the weightage vs lesser found words defining the context of the sentence. The weight will inverse of "the number of documents the rare word was found ".
Tuple of numbers are points in n-dimensional space.
Initialize a set of points equal to the number of categories as the k-means.
Assign each point to the cluster belonging to the nearest mean.
Find the new means. Repeat until the means dont change.
Sentiment labelled sentences data set or IMDB reviews
from sklearn.feature_extraction.text import tfidvectorizer
from sklearn.cluster import KMeans

Visualization and Dimensionality reduction.- Unsupervised

Principal component analysis
Kernel PCA
Locally linear embedding
A related task is dimensionality reduction, in which the goal is to simplify the data without losing too much information. One way to do this is to merge several correla‐ ted features into one. For example, a car’s mileage may be very correlated with its age, so the dimensionality reduction algorithm will merge them into one feature that rep‐ resents the car’s wear and tear. This is called feature extraction.

Some algorithms can deal with partially labeled training data, usually a lot of un-labeled data and a little bit of labeled data. This is called semi-supervised learning.
Reinforcement Learning is a very different beast. The learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards in return (or penalties in the form of negative rewards, as in Figure 1-12). It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.
In batch learning, the system is incapable of learning incrementally: it must be trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning. If you want a batch learning system to know about new data (such as a new type of spam), you need to train a new version of the system from scratch on the full dataset.
In online learning, you train the system incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly. Online learning is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data.
It is crucial to use a training set that is representative of the cases you want to generalize to. : if the sample is too small, you will have sampling noise but even very large samples can be non-representative if the sampling method is flawed. This is called sampling bias.
A critical part of the success of a Machine Learning project is coming up with a good set of features to train on. This process, called feature engineering.

Feature selection: selecting the most useful features to train on among existing features.'
Feature extraction: combining existing features to produce a more useful one.
Creating new features by gathering new data.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The possible solutions are:

To simplify the model by selecting one with fewer parameters .
To gather more training data
To reduce the noise in the training data (e.g., fix data errors and remove outliers)

Constraining a model to make it simpler and reduce the risk of overfitting is called regularization. The amount of regularization to apply during learning can be controlled by a hyper‐ parameter. A hyperparameter is a parameter of a learning algorithm.
underfitting is the opposite of overfitting: it occurs when your model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.
The error rate on new cases is called the generalization error (or out-of sample error), and by evaluating your model on the test set, you get an estimate of this error. This value tells you how well your model will perform on instances it has never seen before.
A sequence of data processing components is called a data pipeline. Components typically run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipeline pulls this data and spits out its own output, and so on. Each component is fairly self-contained: the interface between components is simply the data store.
Deep Learning

Perceptron
Activation Function - Introduces non-linearity

Sigmoid
Relu

Empirical loss measures the lost of total dataset

Binary (Softmax) Cross Entropy Loss - can be used with models that output a probability between 0 and 1 (used for classification problems)
Mean Squared Error Loss - can be used with regression models with output continuous real numbers

Loss Optimization

Find weights that would min the loss
Process of training is to find the optimal weights
Gradient of loss wrt weight tells which way is up - we take negative of gradient and take small step in that direction. Repeat until local minima.
Back propagation - Derivatives and Chain rule
Learning Rate - How to set this ? Adaptive learning rate

SGD
Adam
Adadelta
Adagrid
RMSProp

Stochastic Gradient Descent on a batch of data points

Increases gradient estimation accuracy
Can parallelize

Overfitting

Regularization to address this problem to improve generalization
Dropout technique - randomly set some activations to 0.
Early Stopping technique - Stop training before we have a chance to overfit

Loss Metrics

Regression - MSPE, MSAE, R Square
Classification - Accuracy, Log loss, ROC-AUC, Precision Recall.
Unsupervised models - Rand index, Mutual information
Others - BLEU score(NLP), CV error, Heuristic methods to find K.

Regression Metrics

RMSE represents the sample standard deviation of the differences between predicted values and observed values (called residuals).
MAE is the average of the absolute difference between the predicted values and observed value.

UCI ML repository - Covertype dataset - Available in BigQuery. Classification problem.
Creating reproducible DataSet - To split dataset use hashing and modulo operators - which converts the content to a hash sum. Farm_Fingerprint of BigQuery can be used for hashing. Instead of one field, concatenate all the fields (except label) into a json string and then perform hash on it. Extract the data set separately into:

Training
Validation
Test

ML model takes parameters and hyperparameters. Parameters are changed during model training (these are the weights which are tuned), whereas hyperparameters are set before training. Finding the best value for hyperparameters is called Tuning.
SciKit learn - preprocessor = ColumnTransformer - use StandardScaler() for numeric features and OneHotEncoder() for Categorical features.

Pipeline(preprocessor , classifier = SGDClassifier(loss='log', tol=1e-3))
Pipeline.set_params(classifier_alpha= .001, classifies_max_iter =200)
Classifier alpha - is regularization parameter.
pipeline.fit(x_train, y_train)
pipeline.score(x_validation, y_validation)

AI platform can be used to find the best hyperparameter value. Import cloudml-hypertune from Google. Use hypertune.report_hyperparameter_tuning_metric() to find the best value.
Python package fire - used for creating CLI.
The model can be stored using Pickle. (Pickle.dump) since we are using Scikit learn. If TF is used , checkpoint/save of TF can be used.
Supply hyperparameters for the model using Config.yaml file.
Create training docker container

from gcr.io/deeplearning-paltform-releasae/base-cpu
run pip install - u fire scikit-learn pandas cloudml-hypertune
workdir /app
copy train.py (model file)
Entrypoint["python","train.py"]

Kubeflow is cloud-native, multi-cloud solution for ML workflow orchestration.

Kubeflow offers DOmain specific language in Python to describe KF tasks as they organize in as Directed Acyclic Graph (DAG).

Search This Blog

AWS Professional Knowledge Base

Machine Learning

Comments

Post a Comment

Popular posts from this blog

Linear Algebra Concepts

Artificial Computation Intelligence

Serverless - Lambda functions