Statistics for Machine Learning
Statistics is regarded as a requirement for working in the field of applied machine learning. Statistics assist in the transformation of data into information and in answering queries regarding groups of observations. Statistical approaches are necessary to analyse the data used to develop a machine learning model as well as to evaluate the outcomes of testing multiple machine learning models.
What is Statistics?
Statistics is the area of mathematics that deals with quantitative data gathering, processing, interpreting, display, and organisation.
Types of Statistics
Statistics is into two subfields:
1. Descriptive:
This branch summarises data, such as the mean and standard deviation for continuous data types (such as price), and percentage and frequency for categorical data (such as category)
2. Inferential:
Because it is difficult to accumulate all the information (In statistics, we call this the population) a group of the data points is gathered, also known as a sample, and draw inferences about the entire population, which is known as inferential statistics
Types of Data
We receive two sorts of data: numerical and categorical, which we must manage and evaluate.
1. Numerical Information
Numerical data is just a collection of numbers or integers. Numeric data is further classified into two types: discrete numerical variables and continuous numerical variables.
a. Discrete Numerical Variable
Discrete variables are those with an unlimited range of values, such as rank in the classroom, number of faculty in the department, and so on.
b. Continuous Numeric Variable
Continuous variables are those whose values can be infinite, implying that they are not in the right range, such as an employee’s compensation.
2. Categorical Information
Categorical data refers to categories, programming strings, or character data such as name and colour. In general, there are two kinds of them.
a. Ordinal Variables
An ordinal categorical variable is one whose values rank in any range, such as a student’s grade (A, B, C), high, medium, and low.
b. Nominal Variables
Nominal variables are variables that can only have names or a limited number of categories, such as colour names, topics, and so on.
Measure of Central Tendency
A measure of central tendency indicates the centrality of the dataset, i.e., what is at the core of your records. It contains terminology such as mean, median, and mode.
1. Mean
The mean is simply the average of all numbers in a given numeric variable. When data contains exceptions, we cannot determine the mean and manipulate since a single outlier has a negative impact on the average. As a result, the answer is median.
2. Median
The median is a statistic that is found after arranging all of the data. If the overall amount is even, the mean of the middle two numbers is used. It is unaffected by or dependent on anomalies until half of the data does not become outliers.
3. Mode
The mode of a numeric variable reflects the most common occurrence. We don’t have a method in Numpy to find mode, but we do have one in scipy.
Measure of Spread
Measures of spread assist in understanding data distributions, which indicate where your data is most dispersed (positive, negative, centre)
1. Range
The range defines the gap between your data’s greatest and lowest points (max-min).
2. Percentiles
A percentile is a statistical metric that indicates the value below which a specified percent of data in a collection of data falls. The 20th percentile, for example, is a figure below which 20% of data falls. In a real-world context, such as the JEE Mains test, we employ percentiles a lot. Or, if the 20th percentile is 35, we may argue that the whole 20th percentile observation has a value smaller than 1.
3. Quartiles
Quartiles are the metrics that split a list of integers into quarters. The methods to determine the quartile are as follows.
- Sort the integers in ascending order.
- Then divide the collection into four identical halves.
Q2 is also called the median, and the four quartiles may be found by showing the score at 25, 50, 75, and 100.
4. Interquartile Range (IQR)
A metric of deviation between the upper (75th) and lower (25th) Quartiles. It is a crucial concept in statistics that is utilised in many computations and data pretreatment procedures, such as coping with outliers.
5. Absolute Mean Deviation
The absolute deviation from the average (MAD), sometimes known as the mean absolute deviation (MAD), quantifies the range in the data. Simply said, it shows you the mean absolute separation between each item in the collection.
6. Variance
Variance measures how much a data point deviates from the mean; the only difference between MAD and variance is that we use the square here. The variance is calculated by determining the gap between each sample point, squaring it, adding it all up, and taking the average of all those figures. The numpy package has a direct method for calculating variance.
The issue with variance is that it is not in the identical system of measure as the raw data due to a square. As a result, because it is not very straightforward to use, most people prefer standard deviation.
7. Standard Deviation
As we can see from doubling the actual unit, the standard deviation is simply the square root of variance, therefore we receive the result in the identical metric again. We may also use Numpy to automatically determine the standard deviation.
8. Median Absolute Deviation (MAD)
MAD is the average of all the figures produced by deducting and computing the exact value of each data with the median. Although numpy lacks a MAD algorithm, stats models include a package called robust that has a MAD function.
Standard Normal Distribution
1. Skewness
Skewness is a metric of the regularity of a distribution that you graph in the shape of a histogram, KDE, with a peak level towards the mode of observations. It is classified into two types: left-skewed data and right-skewed data. Some people view it as three types, with the third being symmetric distribution, which signifies normal distribution.
a. Right skewed data (Positively skewed distribution)
A right skewed dispersion is one with a large tail to the right (Positive axis). Income disparity is a classic example of right-skewed distribution; there are relatively few people who have extremely high wealth, while the majority of the people are in the middle range.
b. Data that is slanted to the left (Negatively skewed distribution)
A left skewed distribution is one with a large tail to the left (negative axis). As an example, consider student grades: there will be fewer students with worse marks, and the proportion of students will be in the pass class.
Central Limit Theorem
The Central Limit Theorem asserts that when we evaluate sample data from any population, the average of standard deviation and sample mean will be about the same after performing specific statistical measures. This is just the basic Limit theorem.
Probability Density Function
If you are familiar with histograms, you will understand how to divide data into bins and depict dispersion. However, if we wish to conduct a multiclass investigation on statistical information, it is hard to do using Histogram and much easier with PDF. The probability density function is the line defined within the histogram using solely KDE (kernel density estimation).
So, if you look at the KDE curve, you’ll notice that it goes through each bin by touching the edge. As a result, we may use PDF to draw side by side KDE and analyse multiclass data.
Cumulative Distributive Function
CDF informs us what fraction of the data is less than a certain percentage. To determine the CDF, sum all of the histogram binds prior to that moment, and the output will be my resulting CDF. Another way is to use Calculus to calculate the area under the curve at the region where you want the CDF, then graph the perfect line from there to find the inner area. As a result, when we integrate PDF, we get CDF, and when we differentiate CDF, we obtain PDF.
When we reduce each frequency count value by the overall sum, we obtain the probability density function, and when we compute the cumulative sum of PDF, we get the CDF.
Mathematics in Machine Learning
Statistics and mathematics are two sides of the same coin. Mathematical concepts are frequently employed in conjunction with one another. So in this article,we will explore ML mathematics while also looking at which libraries to use.
1. Linear Algebra
Linear algebra serves as a foundation for understanding other subjects in ML mathematics. It covers all of the required and fundamental notions that have a role in other important areas of mathematics.
Linear algebra is a method of describing data in an equation-like manner that may then be represented in matrix and vector forms. It is a technique for dealing with different coordinates, particularly higher dimensions and planes.
Linear algebra involves equations. We can solve them using simple cross-multiplication or elimination techniques. It also involves matrices, NumPy library may be used to code matrices and other arithmetic topics.
There is also matrix factorization or matrix decomposition. This approach is useful when we have complicated matrices that are difficult for the algorithm to calculate. We just convert the matrix into smaller values using decomposition to facilitate calculation.
We have three options for doing so: The LU decomposition technique, the QR decomposition method, and the Cholesky decomposition method are all examples of decomposition methods.
Finally, there are probability distributions to consider. Probability is an indivisible notion in machine learning, and it is employed in the development of predictions .
Calculus in ML
Calculus is useful in a variety of deep learning and neural network approaches. It aids in the construction of any neural network’s forward and backward propagation. Using calculus, we may create specific error and loss functions.
Differentiation aids in optimization strategies and is useful to determine function maxima and minima. Integration, on the other hand, is useful in a variety of probability-based tasks as well as calculating areas in complicated algebraic graphs.
Conclusion
We’ve studied statistics and their significance in Machine Learning. We began with categories of statistics, what kinds of data we interact with, and fundamental terminology we need to execute certain mathematical and statistical operations to grasp the complexity of data. Additionally, we have also looked at other mathematical concepts that apply to ML. Linear algebra and calculus were the two most important maths principles when it comes to machine learning.