# Introduction to Machine Learning

If we show a picture to a three year old child and ask her is there a tree in the picture, it is almost certain that we will get the right answer. if we ask a thirty year person what the definition of a tree is, There is a very less chance that we get correct answer. Think what is the reason??? We never learn to recognize a  tree by studying its definition.

We learned it by looking at trees (i.e. we learned from ‘data’.) Learning from data is used in situations where we don’t have an analytic solution, but we do have data that we can use to construct an empirical solution. This premise covers a lot of territory, and indeed learning from data is one of the most widely used techniques in science, engineering, and economics, among other fields.

# Examples of Machine Learning

What do financial forecasting, medical diagnosis, computer vision, and search engines have in common? They all have successfully utilized learning from data. The repertoire of such applications is quite impressive. Let us start the discussion with a real-life application to see how learning from data works.

Consider the problem of predicting how a movie viewer would rate the various movies out there. This is an important problem if you are a company that rents out movies, since you want to recommend to different viewers the movies they will like . Good recommender systems are so important to business that the movie rental company Netflix offered a prize of one million dollars to anyone who could improve their recommendations by a mere $$10\%$$.

The main difficulty in this problem is that the criteria that viewers use to rate movies are quite complex. Trying to model those explicitly is no easy task, so it may not be possible to come up with an analytic solution. But we know that the historical rating data reveal a lot about how people rate movies, so we may be able to construct a good empirical solution. There is a great deal of data available to movie rental companies, since they often ask their viewers to rate the movies that they have already seen.

NetFlix Movie Rating Solution

Above Figure illustrates a specific approach that was widely used in the million-dollar competition. Here is how it works .

We describe a movie as a long array of different factors, e.g.,

• how much comedy is in it,
• how complicated is the plot,
• how handsome is the lead actor, etc .

Now, We describe each viewer with corresponding factors;

• how much do they like comedy,
• do they prefer simple or complicated plots,
• how important are the looks of the lead actor, and so on.

How this viewer will rate that movie is now estimated based on the match/mismatch of these factors. For example, if the movie is pure comedy and the viewer hates comedies, the chances are he won’t like it . If we take dozens of these factors describing many facets of a movie’s content and a viewer’s taste, the conclusion based on matching all the factors will be a good predictor of how the viewer will rate the movie.

The power of Machine learning is that this entire process can be automated, without any need for analyzing movie content or viewer taste. To do so , the learning algorithm ‘reverse-engineers’ these factors based solely on previous ratings. It starts with random factors, then tunes these factors to make them more and more aligned with how viewers have rated movies before, until they are ultimately able to predict how viewers rate movies in general.

A model for how a viewer rates a movie

The factors we end up with may not be as intuitive as ‘comedy content ‘ , and in fact can be quite subtle or even incomprehensible. After all, the algorithm is only trying to find the best way to predict how a viewer would rate a movie , not necessarily explain to us how it is done. This algorithm was part of the winning solution in the million-dollar competition.

# Components of Machine Learning

The movie rating application captures the essence of learning from data, and so do many other applications from vastly different fields. In order to abstract the common core of the learning problem, we will pick one application and use it as a metaphor for the different components of the problem. Let us take credit approval problem as our metaphor.

Suppose that a bank receives thousands of credit card applications every day, and it wants to automate the process of evaluating them. Just as in the case of movie ratings , the bank didn’t know any magical formula that can pinpoint when credit should be approved, but it has a lot of data. This calls for Machine Learning, so the bank uses historical records of previous customers to figure out a good formula for credit approval.

Each customer record has personal information related to credit , such as annual salary, years in residence, outstanding loans, etc. The record also keeps track of whether approving credit for that customer was a good idea, i.e., did the bank make money on that customer. This data guides the construction of a successful formula for credit approval that can be used on future applicants.

Snapshot of Credit Approval Data

Let us give names and symbols to the main components of this learning problem. There is the input $$\mathcal {x}$$ (customer information that is used to make a credit decision) , the unknown target function $$\mathcal {f: X \to Y}$$ (ideal formula for credit approval) , where $$\mathcal {X}$$ is the input space (set of all possible inputs  $$\mathcal {x}$$) , and $$\mathcal {Y}$$ is the output space (set of all possible outputs, in this case just a yes/no decision). There is a data set $$\mathcal {D}$$ of input-output examples $$(x_1 , y_1 ) , · · · , (x_N , y_N) ,$$ where $$\mathcal {y_n = f (x_n )}$$ for $$n = 1, . . . , N$$ (inputs corresponding to previous customers and the correct credit decision for them) .

Basic Setup of Machine Learning Problem

The examples are often referred to as data points. Finally, there is the learning algorithm that uses the data set $$\mathcal {D}$$ to pick a formula $$\mathcal {g: X \to Y}$$ that approximates $$\mathcal {f}$$. The algorithm chooses $$\mathcal {g}$$ from a set of candidate formulas under consideration, which we call the hypothesis set $$\mathcal {H}$$. For instance, $$\mathcal {H}$$ could be the set of all linear formulas from which the algorithm would choose the best linear fit to the data.

When a new customer applies for credit, the bank will base its decision on $$\mathcal {g}$$ (the hypothesis that the learning algorithm produced) , not on $$\mathcal {f}$$ (the ideal target function which remains unknown) . The decision will be good only to the extent that $$\mathcal {g}$$ faithfully replicates $$\mathcal {f}$$. To achieve that , the algorithm chooses $$\mathcal {g}$$ that best matches $$\mathcal {f}$$ on the training examples of previous customers, with the hope that it will continue to match f on new customers.

# Learning Model

There are five component of learning as discussed in above section:

• Unknown Target Function $$\mathcal {f: X \to y}$$
• Training Data $$\mathcal {(X_1, y_1), (X_2, y_2), …, (X_N, y_N)}$$
• Final Hypothesis $$\mathcal {g: X \to y}$$
• Learning Algorithm $$\mathcal {A}$$
• Hypothesis Set $$\mathcal {H}$$

In these $$5$$ components, we have no control over Target Function, Training Data and Final Hypothesis. We have only control over Learning Algorithm and Hypothesis. Together they are referred as Learning Model.