Statistics For Data science

Data Science is the effective extraction of insights and data information. It is the science of going beyond numbers to find real-world applications and meanings in the data. To extract the information embedded in complex datasets, Data Scientists use myriad techniques and tools in modelling, data exploration, and visualization.

The most important mathematical tool of statistics brings in a variety of validated tools for such data exploration. Statistics is an application of mathematics that provides for mathematical concrete data summarization. Rather than use one or all data points, it renders a data point that can be effectively used to describe the properties of the point regarding its make-up, structure and so on.

Here are the most basic techniques of statistics most popularly used and very effective in Data Science and its practical applications.

(1) Central Tendency

This feature is the typical variable value of the dataset. When a normal distribution is x-y centered at (110, 110) it means the distribution contains the typical central tendency (110, 110) and that this value is chosen as the typical summarizing value of the data set. This also provides us with the biasing information of the set.

There are 2 methods commonly used to select central tendency.

Mean:

The average value is the mid-point around which data is distributed. Given 5 numbers here is how you calculate the Mean. Ex: There are five numbers

Mean= (188 2 63 13 52) / 5 = 65.6 aka mathematical average value used in Numpy and other Python libraries.

Median:

Median is the true middle value of the dataset when it is sorted and may not be equal to the mean value. The Median for the sample set requires sorting and is:

[2, 13, 52, 63, 188] → 52

The median and mean can be calculated using simple numpy Python one-liners:

numpy.median(array)

numpy.mean(array)

(2) Spread

The spread of data shows whether the data is around a single value or spread out across a range. If we treat the distributions as a Gaussian probability figure of a real-world dataset, the blue curve has a small spread with data points close to a narrow range. The red line curve has the largest spread. The figure also shows the curves SD-standard deviation values.

Standard Deviation:

This quantifies the spread of data and involves these 5 steps:

1. Calculate mean.

2. For each value calculate the square of its distance from the mean value.

3. Add all the values from Step 2.

4. Divide by the number of data points.

5. Calculate the square root.

Made with https://www.mathcha.io/editor

Bigger values indicate greater spread. Smaller values mean the data is concentrated around mean value.

In Numpy SD is calculated as

numpy.std(array)

(3) Percentiles

The percentile shows the exact data point position in the range of values and if it is low or high.

By saying the pth percentile one means there is p% of data in the lower part and the remaining in the upper part of the range.

Take the set of 11 numbers below and arrange them in ascending values.

3, 1, 5, 9, 7, 11, 15,13, 19, 17, 21. Here 15 is at the 70th percentile dividing the set at this number. 70% lies below 15 and the rest above it.

The 50th percentile in Numpy is calculated as

numpy.percentile(array, 50)

(4) Skewness

The Skewness or data asymmetry with a positive value means the values are to the left and concentrated while negative means a right concentration of the data points.

Skewness is calculated as

Skewness informs us about data distribution is Gaussian. The higher the skewness, the further away from being a Gaussian distribution the dataset is.

Here’s how we can compute the Skewness in Scipy code:

scipy.stats.skew(array)

(5) Covariance and Correlation

Covariance

The covariance indicates if the two variables are “related” or not. The positive covariance means if one value increases so do the other and a negative covariance means when one increases the other decreases.

Correlation

Correlation values lie between -1 and 1 and are calculated as the covariance divided by the product of SD of the two variables. When 1 it has perfect values and one increase leads to the other moving in the same direction. When less than one and negative the increase in one leads to a decline in the other.

Conclusion:

When doing PCA-Principal Component Analysis knowing the above 5 concepts is useful and can explain data effectively and helps summarize the dataset in terms like correlation in techniques like Dimensionality Reduction. Thus when more data can be defined by a median or mean values the remaining data can be ignored. If you want to learn data science, try the Imarticus Learning Academy where careers in data science are made.