• Post Graduate Program in Analytics and Artificial Intelligence
    Co-created with UCLA Extension
    4.6 out of 5 by 1937 learners
    12 X industry demand
  • Machine Learning and Deep Learning Prodegree
    Co-created with IBM
    4.6 out of 5 by 3487 learners
    32 X industry demand
  • Post Graduate Program In Data Analytics
    4.8 out of 5 by 3278 learners
    14 X industry demand
  • Data Science Prodegree
    Co-created with KPMG in India
    4.7 out of 5 by 6233 learners
    16 X industry demand

 

Data Science is the effective extraction of insights and data information. It is the science of going beyond the numbers to find real-world applications and meanings in the data. To extract the information embedded in complex datasets, Data Scientists use myriad techniques and tools in modeling, data exploration, and its visualization.

The most important mathematical tool of statistics brings in a variety of validated tools for such data exploration. Statistics is an application of mathematics that provides for mathematical concrete data summarization. Rather than use one or all data points, it renders a data point that can be effectively used to describe the properties of the point regarding its make-up, structure and so on.

Here are the most basic techniques of statistics most popularly used and very effective in Data Science and its practical applications.

(1) Central Tendency

This feature is the typical variable value of the dataset. When a normal distribution is x-y centered at (110, 110) it means the distribution contains the typical central tendency (110, 110) and that this value is chosen as the typical summarizing value of the data set. This also provides us with the biasing information of the set.

There are 2 methods commonly used to select central tendency.

Mean:

The average value is the mid-point around which data is distributed. Given 5 numbers here is how you calculate the Mean. Ex: There are five numbers

Mean= (188 2 63 13 52) / 5 = 65.6 aka mathematical average value used in Numpy and other Python libraries.

Median:

Median is the true middle value of the dataset when it is sorted and may not be equal to the mean value. The Median for the sample set requires sorting and is:

[2, 13, 52, 63, 188] → 52

The median and mean can be calculated using simple numpy Python one-liners:

numpy.median(array)

numpy.mean(array)

(2) Spread

The spread of data shows whether the data is around a single value or spread out across a range. If we treat the distributions as a Gaussian probability figure of a real-world dataset, the blue curve has a small spread with data points close to a narrow range. The red line curve has the largest spread. The figure also shows the curves SD-standard deviation values.

Standard Deviation:

This quantifies the spread of data and involves these 5 steps:

1. Calculate mean.

2. For each value calculate the square of its distance from the mean value.

3. Add all the values from Step 2.

4. Divide by the number of data points.

5. Calculate the square root.

Made with https://www.mathcha.io/editor

Bigger values indicate greater spread. Smaller values mean the data is concentrated around mean value.

In Numpy SD is calculated as

numpy.std(array)

(3) Percentiles

The percentile shows the exact data point position in the range of values and if it is low or high.

By saying the pth percentile one means there is p% of data in the lower part and the remaining in the upper part of the range.

Take the set of 11 numbers below and arrange them in ascending values.

3, 1, 5, 9, 7, 11, 15,13, 19, 17, 21. Here 15 is at the 70th percentile dividing the set at this number. 70% lies below 15 and the rest above it.

The 50th percentile in Numpy is calculated as

numpy.percentile(array, 50)

(4) Skewness

The Skewness or data asymmetry with a positive value means the values are to the left and concentrated while negative means a right concentration of the data points.

Skewness is calculated as

Skewness informs us about data distribution is Gaussian. The higher the skewness, the further away from being a Gaussian distribution the dataset is.

Here’s how we can compute the Skewness in Scipy code:

scipy.stats.skew(array)

(5) Covariance and Correlation

Covariance

The covariance indicates if the two variables are “related” or not. The positive covariance means if one value increases so do the other and a negative covariance means when one increases the other decreases.

Correlation

Correlation values lie between -1 and 1 and are calculated as the covariance divided by the product of SD of the two variables. When 1 it has perfect values and one increase leads to the other moving in the same direction. When less than one and negative the increase in one leads to a decline in the other.

Conclusion: 

When doing PCA-Principal Component Analysis knowing the above 5 concepts is useful and can explain data effectively and helps summarize the dataset in terms like correlation in techniques like Dimensionality Reduction. Thus when more data can be defined by a median or mean values the remaining data can be ignored. If you want to learn data science, try the Imarticus Learning Academy where careers in data science are made.

For Online Course Enquiries
About Imarticus
Imarticus Learning is India’s leading professional education institute that offers training in Financial Services, Data Analytics & Technology. We’ve successfully transformed careers of over 35,000+ individuals globally through our Certification, Prodegree, and Post Graduate programs offered in association with leading and renowned global organisations in the Financial Services, Data Analytics & Technology domain.
Related course
  • POST GRADUATE PROGRAM
    Post Graduate Program in Analytics and Artificial Intelligence
    Co-created with UCLA Extension
    Course duration(Weeks)
    28
    Upcoming batches
    2
    Organizations enrolled
    20
    4.6 out of 5 by 1937 learners
    12 X industry demand
    Upcoming Batches
    Date Location Schedule
    10th March CHENNAI Weekend
    Date Location Schedule
    27th March BANGALORE-KORAMANGALA Weekend
  • Prodegree
    Machine Learning and Deep Learning Prodegree
    Co-created with IBM
    Course duration(Months)
    4
    Upcoming batches
    3
    Organizations enrolled
    20
    4.6 out of 5 by 3487 learners
    32 X industry demand
    Upcoming Batches
    Date Location Schedule
    20th March CHENNAI Weekend
    27th March BANGALORE-KORAMANGALA Weekday
    Date Location Schedule
    20th March BANGALORE-KORAMANGALA Weekend
  • Post Graduation
    Post Graduate Program In Data Analytics
    Course duration(Months)
    5
    Upcoming batches
    4
    Organizations enrolled
    20
    4.8 out of 5 by 3278 learners
    14 X industry demand
    Upcoming Batches
    Date Location Schedule
    16th March BANGALORE-KORAMANGALA Weekday
    23rd March BANGALORE-KORAMANGALA Weekday
    Date Location Schedule
    19th March DELHI Weekend
    25th March CHENNAI Weekday
  • Prodegree
    Data Science Prodegree
    Co-created with KPMG in India
    Course duration(Months)
    2-4
    Upcoming batches
    7
    Organizations enrolled
    20
    4.7 out of 5 by 6233 learners
    16 X industry demand
    Upcoming Batches
    Date Location Schedule
    6th March BANGALORE-KORAMANGALA Weekend
    20 March DELHI Weekend
    20 March BANGALORE-KORAMANGALA Weekend
    27 March BANGALORE-MARATHAHALLI Weekend
    Date Location Schedule
    6th March DELHI Weekend
    20 March CHENNAI Weekend
    20 March ONLINE Weekend