upskill yourself otherwise you will be replace

STEP BY STEP GUIDE FOR STATISTICS FOR DATA ANALYTICS

In data analytics, statistics is essential because it offers methods and tools for understanding and translating data. Here are some essential statistical ideas that are frequently applied in data analytics:

•Descriptive Statistics
•Inferential Statistics
•Probability
•Sampling Methods
•Correlation and Regression:
•Data Distributions
•Hypothesis Testing

Descriptive Statistics, 

Descriptive Statistics, 
The primary properties of a dataset have been identified and described using descriptive statistics. They give a brief overview of the data, enabling analysts to understand its distribution, change, and core a pattern. The following are a few typical metrics in descriptive statistics:

Descriptive statistics

  • Measures of Central Tendency:
  • Measures of Variability
  • Measures of Shape and Distribution
  • Percentiles
  • Range
  • Frequency
  • istribution

Measures of Central Tendency

Measures of Central Tendency 

are employed to identify the average or centre value within a dataset. The “centre” of the data distribution is represented by a single value that they offer. The three standard central tendency measures

  • Mean
  • Median 
  • Mode
Mean


Mean:
Consider the following dataset representing the scores of 10 students on a test:
{85, 90, 78, 92, 88, 95, 82, 86, 91, 89}. To find the mean, we sum up all the values and divide by the total number of observations:
(85 + 90 + 78 + 92 + 88 + 95 + 82 + 86 + 91 + 89) / 10 = 876 / 10 = 87.6. Therefore, the mean score is 87.6.

Median:

Median:
Let’s consider the same dataset as above:

{85, 90, 78, 92, 88, 95, 82, 86, 91, 89}.

To find the median, we first arrange the values in ascending order:
78, 82, 85, 86, 88, 89, 90, 91, 92, 95. Since there are 10 observations, the median is the middle value, which is the 5th value, 88. Therefore, the median score is 88.

Mode:

Mode:
 Take look at the dataset below, which shows the number of pets that each home in neighbourhood owns
{0, 1, 2, 2, 3, 1, 4, 0, 2, 1}.
To find the mode, we identify the value(s) that occur most frequently. In this case, both 1 and 2 occur three times, which makes them the modes of the dataset.
Therefore, the modes for this dataset are 1 and 2.

Measures of Variability

Let’s use a sample dataset to describe metrics of variability:
Look at the dataset below, which shows the ages of 10 people: {20, 25, 22, 28, 30, 21, 24, 27, 26, 23}.

  • Variance:
  • Standard Deviation 
     

 

Add Your Heading Text Here

VARANCE

Calculate the mean:

(165 + 170 + 172 + 168 + 175 + 169 + 171 + 174) / 8 = 1374 / 8 = 171.75.

1.Subtract the mean from each value and square the differences:

(165 – 171.75)^2, (170 – 171.75)^2, (172 – 171.75)^2, (168 – 171.75)^2, (175 – 171.75)^2, (169 – 171.75)^2, (171 – 171.75)^2, (174 – 171.75)^2.

2.The squared differences are: (39.0625, 3.0625, 0.5625, 12.5625, 14.0625, 5.0625, 0.5625, 6.0625).

3.Calculate the sum of these squared differences:

39.0625 + 3.0625 + 0.5625 + 12.5625 + 14.0625 + 5.0625 + 0.5625 + 6.0625 = 81.4375.

4.Divide the sum by the total number of observations minus 1 (8 – 1 = 7)

5.to get the sample variance:

81.4375 / 7 = 11.6339285714.

Standard deviation

1.Calculate the variance (as explained in the previous response): Variance = 11.6339 square centimeters.

2.Take the square root of the variance to obtain the standard deviation: Square root of 11.6339 ≈ 3.4112.

Therefore, the standard deviation of the dataset is approximately 3.4112 centimeters.

  • Standard deviation measures the dispersion or spread of data points around the mean. In this example, the standard deviation indicates the average distance between each height

Measures of Shape and Distribution:

Measures of Shape and Distribution:

Shape and distribution metrics give information about the distributional properties of a dataset. They aid in understanding the data’s uniformity, specific form, and tail behavior. Here are some typical metrics for distribution and shape:

  • Skewness:
  • Kurtosis:
Skewness:
Skewness:

 

Positive Skewness:


Consider a dataset with the following values for the income levels of a population’s members (in thousands of dollars): 20, 25, 30, 35, 40, 45, 50, and 150.
Positive skewness is indicated by the distribution’s right-side longer tail.
The majority of people’s incomes are concentrated in the lower range, whereas only a small number of people have much higher wages.
In this positively skewed distribution, the mean income is higher than the median income.

Negative Skewness:


Let’s consider a dataset representing the waiting times (in minutes) at a doctor’s office:
{50, 40, 30, 20, 15, 10, 5, 1}.
The distribution has a longer tail on the left side, indicating negative skewness.
Most individuals experience shorter waiting times, with a few outliers experiencing longer waiting times.
The mean waiting time is lower than the median waiting time in this negatively skewed distribution.

Skewness of Zero:

Consider a dataset representing the heights (in centimeters) of individuals in a sample:

{160, 165, 170, 175, 180, 185}.

1.The distribution is symmetric, with no significant imbalance in the tails.

2.The data is equally distributed around the mean, and there are no substantial skewness effects.

These examples illustrate the concept of skewness in different datasets. Positive skewness indicates a longer or fatter tail on the right side, negative skewness indicates a longer or fatter tail on the left side, and skewness of zero suggests a symmetric distribution.

 

Kurtosis

Kurtosis

*Consider a dataset representing the returns (in percentages) of a stock over a certain period: {-3, 2, 1, -1, 0, 1, 4, 2}.

To calculate the kurtosis:

Calculate the mean:

(-3 + 2 + 1 + -1 + 0 + 1 + 4 + 2) / 8 = 6 / 8 = 0.75.

Calculate the variance:

Variance = [(1/8) * ((-3 – 0.75)^2 + (2 – 0.75)^2 + (1 – 0.75)^2 + (-1 – 0.75)^2 + (0 – 0.75)^2 + (1 – 0.75)^2 + (4 – 0.75)^2 + (2 – 0.75)^2)] ≈ 4.875.

Calculate the fourth moment:

Fourth Moment = [(1/8) * ((-3 – 0.75)^4 + (2 – 0.75)^4 + (1 – 0.75)^4 + (-1 – 0.75)^4 + (0 – 0.75)^4 + (1 – 0.75)^4 + (4 – 0.75)^4 + (2 – 0.75)^4)] ≈ 33.7969.

Calculate the kurtosis:

Kurtosis = (Fourth Moment / Variance^2) – 3 ≈ (33.7969 / (4.875^2)) – 3 ≈ 0.9796.

Therefore, the kurtosis of the dataset is approximately 0.9796.

Kurtosis measures the peakedness or flatness of a distribution compared to a normal distribution. In this example, the positive kurtosis value suggests that the distribution has a sharper peak and heavier tails compared to a normal distribution. It indicates a higher likelihood of extreme returns or outliers in the dataset.

Percentile :

Percentile :

75th percentile  is the value below which 75% of the data falls. This means that 75% of the data values in the dataset are less than or equal to the 75th

Range

Range, an easy measure of variability, shows the difference between a dataset’s highest and minimum values. It gives an elementary understanding of how the data are distributed.

Use these steps to determine a dataset’s range:

The dataset should be sorted ascending.

Take the least value and subtract it from the maximum value.

Range

For example, consider the following dataset representing the heights (in centimeters) of a group of individuals: {160, 165, 170, 175, 180, 185}.

To calculate the range:

Sort the dataset in ascending order: {160, 165, 170, 175, 180, 185}.

Subtract the minimum value (160) from the maximum value (185):

Range = 185 – 160 = 25.

Therefore, the range of the dataset is 25 centimeters. It indicates that there is a 25-centimeter difference between the tallest and shortest individuals in the group.

While the range provides a simple measure of variability, it has limitations as it only considers the extremes of the dataset. It does not provide information about the distribution or variability within the dataset. Other measures, such as variance and standard deviation, offer more comprehensive insights into the spread of data.

 

Frequency Distribution

Consider the following dataset representing the scores of students in a class: {75, 80, 65, 70, 85, 90, 75, 80, 75, 85, 80, 70, 75}.

To create a frequency distribution table:

Sort the dataset in ascending order: {65, 70, 70, 75, 75, 75, 75, 80, 80, 80, 85, 85, 90}.

Identify the unique values in the dataset and determine their frequencies (number of occurrences).

Frequency Distribution

Count how many times each value appears in the dataset.

Value | Frequency

65 | 1

70 | 2

75 | 4

80 | 3

85 | 2

90 | 1

This frequency distribution table shows the unique values in the dataset and their corresponding frequencies. For example, the value 75 appears four times, while the values 65 and 90 appear only once.

Frequency distribution tables provide a summary of the distribution of values in a dataset. They help identify the most common values and provide an overview of the frequency or occurrence of different values. This information can be useful for further analysis and understanding of the dataset.

Inferential Statistics

Inferential statistics is an area of statistics that deals with making inferences or conclusions about a wider population based on sample data. It helps investigators in generalizing from small data sets and making predictions. Here is an illustration of the idea of inferential statistics:
Let’s imagine a business wishes to gauge how satisfied its staff members are with their jobs. There are 10,000 people working for the company. They choose to get information from a randomly selected sample of 500 employees rather than conducting time-consuming and unrealistic surveys of every employee.

Sampling:

Out of a total workforce of 10,000 employees, 500 are chosen at random by the corporation. By doing this, bias in the sample is diminished and every employee has an equal chance of getting chosen.

Data collection:

A job satisfaction survey is given to the 500 employees that were chosen. The survey includes inquiries about respondents’ general job satisfaction, work-life balance, pay, chances for career advancement, and other pertinent issues.

s.

Descriptive Statistics

In order to summarise the sample data, descriptive statistics are utilised. For the 100 cars in the sample, the manufacturer computes the mean, median, and standard deviation of the fuel efficiency figures.

Inferential Statistics:

Based on the sample data, inferential statistics are used to make inferences about the typical fuel efficiency of the 1,000 cars.

Confidence Intervals:

To determine the range within which the actual population mean fuel economy lies, the manufacturer calculates a confidence interval, such as a 95% confidence interval. Suppose they discover that the sample’s average fuel economy is 50 MPG, with a 95% confidence range ranging from 48 MPG to 52 MPG. Since the genuine population mean fuel efficiency falls within this range, they may be 95% confident in this.

Example

Let’s use an example to calculate inferential statistics. Let’s say we’ve gathered the MPG (miles per gallon) information for a sample of 50 cars. These are the data:

45, 42, 50, 48, 43, 47, 52, 55, 49, 51, 46, 44, 53, 47, 49, 48, 50, 45, 46, 52, 50, 48, 45, 47, 51, 49, 52, 54, 46, 47, 49, 48, 50, 45, 43, 51, 48, 50, 47, 49, 46, 52, 49, 44, 47, 48, 51, 50, 48, 46, 49, 51.

Step 1: Descriptive Statistics
Let’s figure out the sample’s mean, median, and standard deviation.

The sum of all the values divided by the total number of values is the mean.
Sum = 2,417
Mean = Sum / Number of values = 2,417 / 50 = 48.34 MPG

Median: Arrange the values in ascending order and find the middle value:
42, 43, 43, 44, 45, 45, 45, 46, 46, 46, 46, 47, 47, 47, 47, 47, 48, 48, 48, 48,
48, 48, 48, 49, 49, 49, 49, 49, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 52,
52, 52, 52, 53, 54, 55
Median = 48 MPG

Standard Deviation:
Calculate the standard deviations for each value:
Value – Mean equals deviation
We determine the deviations using the example data in the manner shown below:

(-3.34), (-6.34), 1.66, (-0.34), (-5.34), (-1.34), 3.66, 6.66, 0.66, 2.66,
(-2.34), (-4.34), 4.66, (-1.34), 0.66, (-0.34), 1.66, (-3.34), (-2.34), 3.66,
1.66, (-0.34), (-3.34), (-1.34), 2.66, 0.66, 3.66, 5.66, (-2.34), (-1.34),
0.66, (-0.34), 1.66, (-3.34), (-5.34), 2.66, (-0.34), 1.66, (-1.34), 0.66,
(-2.34), 3.66, 0.66, (-4.34), (-1.34), 0.66, (-0.34), 2.66, 1.66, (-0.34),
(-2.34), 0.66, 2.66.
Square each deviation:
Squared Deviation = Deviation^2
Squaring each deviation, we have:

11.1556, 40.1956, 2.7556, 0.1156, 28.5156, 1.7956, 13.4356, 44.3556,
0.4356, 7.0756, 5.4756, 18.8356, 21.5956, 1.7956, 0.4356, 0.1156, 2.7556,
11.1556, 5.4756, 13.4356, 2.7556, 0.1156, 11.1556, 1.7956, 7.0756, 0.4356,
13.4356, 32.1156, 5.4756, 1.7956, 0.4356, 2.7556, 11.1556, 28.5156, 0.1156,
2.7556, 1.7956, 0.4356, 0.1156, 5.4756, 13.4356, 0.4356, 18.8356, 1.7956,
0.4356, 0.1156, 7.0756, 2.7556, 0.1156, 5.4756, 0.4356, 7.0756.

Calculate the sum of squared deviations:
Sum of Squared Deviations = Σ(Squared Deviation)
Adding up the squared deviations, we get:
Sum of Squared Deviations = 581.92

Divide the sum of squared deviations by the number of values:
Variance = Sum of Squared Deviations /

10 common NumPy functions that are useful for data analysis: 10 common use cases for SQL in data analytics 10 commonly used Matplotlib commands for data analytics 10 different between SQL and No SQL 10 important of data analytics for society 10 steps that show how data analytics is changing the banking industry: 10 steps to learn SQL for Data Analytics 10 steps to start career in data science 10 WAYS AGRICULTURE IS TRANSFORMED BY DATA ANALYTICS 10 ways data analytics can be used in addressing climate change: