• POST GRADUATE DIPLOMA IN MANAGEMENT
    Co-created with BIMTECH
    4.8 out of 6071 learners
    2x industry demand
  • PROFESSIONAL CERTIFICATION IN SUPPLY CHAIN MANAGEMENT AND ANALYTICS
    Co-created with IIT Roorkee
    4.8 out of 5 by 469 learners
    4x
  • CERTIFICATION IN ARTIFICIAL INTELLIGENCE and MACHINE LEARNING
    Co-created with E&ICT Academy, IIT Guwahati
    4.8 out of 5 by 621 learners
    4x industry demand
  • POST GRADUATE PROGRAM IN DATA ANALYTICS and MACHINE LEARNING
    4.8 out of 5 by 3278 learners
    14 X industry demand

Data Science is the effective extraction of insights and data information. It is the science of going beyond numbers to find real-world applications and meanings in the data. To extract the information embedded in complex datasets, Data Scientists use myriad techniques and tools in modelling, data exploration, and visualization.

The most important mathematical tool of statistics brings in a variety of validated tools for such data exploration. Statistics is an application of mathematics that provides for mathematical concrete data summarization. Rather than use one or all data points, it renders a data point that can be effectively used to describe the properties of the point regarding its make-up, structure and so on.

Here are the most basic techniques of statistics most popularly used and very effective in Data Science and its practical applications.

(1) Central Tendency

This feature is the typical variable value of the dataset. When a normal distribution is x-y centered at (110, 110) it means the distribution contains the typical central tendency (110, 110) and that this value is chosen as the typical summarizing value of the data set. This also provides us with the biasing information of the set.

There are 2 methods commonly used to select central tendency.

Mean:

The average value is the mid-point around which data is distributed. Given 5 numbers here is how you calculate the Mean. Ex: There are five numbers

Mean= (188 2 63 13 52) / 5 = 65.6 aka mathematical average value used in Numpy and other Python libraries.

Median:

Median is the true middle value of the dataset when it is sorted and may not be equal to the mean value. The Median for the sample set requires sorting and is:

[2, 13, 52, 63, 188] → 52

The median and mean can be calculated using simple numpy Python one-liners:

numpy.median(array)

numpy.mean(array)

(2) Spread

The spread of data shows whether the data is around a single value or spread out across a range. If we treat the distributions as a Gaussian probability figure of a real-world dataset, the blue curve has a small spread with data points close to a narrow range. The red line curve has the largest spread. The figure also shows the curves SD-standard deviation values.

Standard Deviation:

This quantifies the spread of data and involves these 5 steps:

1. Calculate mean.

2. For each value calculate the square of its distance from the mean value.

3. Add all the values from Step 2.

4. Divide by the number of data points.

5. Calculate the square root.

Made with https://www.mathcha.io/editor

Bigger values indicate greater spread. Smaller values mean the data is concentrated around mean value.

In Numpy SD is calculated as

numpy.std(array)

(3) Percentiles

The percentile shows the exact data point position in the range of values and if it is low or high.

By saying the pth percentile one means there is p% of data in the lower part and the remaining in the upper part of the range.

Take the set of 11 numbers below and arrange them in ascending values.

3, 1, 5, 9, 7, 11, 15,13, 19, 17, 21. Here 15 is at the 70th percentile dividing the set at this number. 70% lies below 15 and the rest above it.

The 50th percentile in Numpy is calculated as

numpy.percentile(array, 50)

(4) Skewness

The Skewness or data asymmetry with a positive value means the values are to the left and concentrated while negative means a right concentration of the data points.

Skewness is calculated as

Skewness informs us about data distribution is Gaussian. The higher the skewness, the further away from being a Gaussian distribution the dataset is.

Here’s how we can compute the Skewness in Scipy code:

scipy.stats.skew(array)

(5) Covariance and Correlation

Covariance

The covariance indicates if the two variables are “related” or not. The positive covariance means if one value increases so do the other and a negative covariance means when one increases the other decreases.

Correlation

Correlation values lie between -1 and 1 and are calculated as the covariance divided by the product of SD of the two variables. When 1 it has perfect values and one increase leads to the other moving in the same direction. When less than one and negative the increase in one leads to a decline in the other.

Conclusion: 

When doing PCA-Principal Component Analysis knowing the above 5 concepts is useful and can explain data effectively and helps summarize the dataset in terms like correlation in techniques like Dimensionality Reduction. Thus when more data can be defined by a median or mean values the remaining data can be ignored. If you want to learn data science, try the Imarticus Learning Academy where careers in data science are made.

For Online Course Enquiries
About Imarticus
Imarticus Learning is India’s leading professional education institute that offers training in Financial Services, Data Analytics & Technology. We’ve successfully transformed careers of over 35,000+ individuals globally through our Certification, Prodegree, and Post Graduate programs offered in association with leading and renowned global organisations in the Financial Services, Data Analytics & Technology domain.
Related course
  • Finance
    POST GRADUATE DIPLOMA IN MANAGEMENT
    Co-created with BIMTECH
    Course duration(Months)
    24
    Upcoming batches
    1
    Organizations enrolled
    20
    4.8 out of 6071 learners
    2x industry demand
    Upcoming Batches
    Date Location Schedule
    Live Instructor - Led Training Online
    Date Location Schedule
  • Analytics
    PROFESSIONAL CERTIFICATION IN SUPPLY CHAIN MANAGEMENT AND ANALYTICS
    Co-created with IIT Roorkee
    Course duration()
    Upcoming batches
    1
    Organizations enrolled
    20
    4.8 out of 5 by 469 learners
    4x
    Upcoming Batches
    Date Location Schedule
    21st November ONLINE Online
    Date Location Schedule
  • Placement Assistance
    CERTIFICATION IN ARTIFICIAL INTELLIGENCE and MACHINE LEARNING
    Co-created with E&ICT Academy, IIT Guwahati
    Course duration(Months)
    8
    Upcoming batches
    1
    Organizations enrolled
    20
    4.8 out of 5 by 621 learners
    4x industry demand
    Upcoming Batches
    Date Location Schedule
    23rd October ONLINE Online
    Date Location Schedule
  • Post Graduation
    POST GRADUATE PROGRAM IN DATA ANALYTICS and MACHINE LEARNING
    Course duration(Months)
    5
    Upcoming batches
    1
    Organizations enrolled
    20
    4.8 out of 5 by 3278 learners
    14 X industry demand
    Upcoming Batches
    Date Location Schedule
    30th October CHENNAI Weekend
    Date Location Schedule