Basic Concepts of Statistics For Data Scientists and Analysts
Table of Contents
ToggleWhat is statistics ?
Statistics is the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data.
It keeps us informed about, what is happening in the world around us. Statistics are important because today we live in the information world and much of this information’s are determined mathematically by Statistics Help.It means to be informed correct data and statics concepts are necessary.
The main advantage of statistics is that information is presented in an easy way.
Statistics is a well-known subject that focuses on data collecting, data management, data analysis, data interpretation, and data visualisation, among other things. Previously, statisticians, economists, and business leaders used statistics to calculate and portray significant data in their fields. Statistics is now used in a variety of industries, including data science, machine learning, data analyst roles, business intelligence analyst roles, computer science roles, and more.
Certain statistical concepts, such as central tendency and standard deviation, are presented considerably earlier. There are many more significant statistical ideas for data science and machine learning that we must learn and apply. Let’s go over the fundamental terminologies and categories of statistics.
Why is it important to understand statistics concepts?
Basic statistics concepts for becoming a data scientist
Data scientists are in high demand and in some cases, data scientists are taking over legacy statistician roles. While a career in data science might sound interesting and available, prospective data scientists should consider their comfort with statistics before planning their next step
Understand the Type of Analytics
Descriptive analytics statistics
Descriptive analytics is a field of statistics that focuses on gathering and summarizing raw data to be easily interpreted. Generally, descriptive analytics concentrate on historical data, providing the context that is vital for understanding information and numbers.
Diagnostic Analytics
Diagnostic Analytics is a form of advanced analytics that examines data or content to answer the question, “Why did it happen?” It is characterized by techniques such as drill-down, data discovery, data mining and correlations.
Predictive analytics
Prescriptive analytics
Probability
Probability is the measure of the likelihood that an event will occur in a Random Experiment. It deals with the chance (the likelihood) of an event occurring. For example, if you toss a fair coin four times, the outcomes may not be two heads and two tails If you tossed a coin 2,000 times. The results were 996 heads. The fraction 996/2,000 is equal to 0.498 which is very close to 0.5, the expected probability.
Statistical formulas related to probability are used in many ways, including actuarial charts for insurance companies, the likelihood of the occurrence of a genetic disease, political polling, and clinical trials, according to Britannica.
Conditional probability
Conditional probability is calculated by multiplying the probability of the preceding event by the probability of the succeeding or conditional event. Conditional probability looks at the probability of one event happening based on the probability of a preceding event happening.
Independent event
Independent events are those events whose occurrence is not dependent on any other event. For example, if we flip a coin in the air and get the outcome as Head, then again if we flip the coin but this time we get the outcome as Tail. In both cases, the occurrence of both events is independent of each other.
Mutually Exclusive Events:
Two events are mutually exclusive if they cannot both occur at the same time. P(A∩B)=0 and P(A∪B)=P(A)+P(B).
Bayes’ theorem,
Bayes’ theorem, named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions
Measures of Central Tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data.
Mean is also known as the arithmetic mean of the given data.
Median is the middlemost value of the given grouped data if the data is grouped and arranged in ascending order.
Mode is the value that appears most in the data.
Variability
Variability refers to how spread scores are in a distribution out; that is, it refers to the amount of spread of the scores around the mean.
Range
The difference between the highest and lowest value in the dataset.
Percentiles, Quartiles and Interquartile Range (IQR)
Percentiles
A measure that indicates the value below which a given percentage of observations in a group of observations falls.
Quantiles—
Values that divide the number of data points into four more or less equal parts, or quarters.
Interquartile Range (IQR)
A measure of statistical dispersion and variability based on dividing a data set into quartiles. IQR = Q3 − Q1.
Variance:
The average squared difference of the values from the mean to measure how to spread out a set of data is relative to the mean.
Standard Deviation:
The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance.
Relationship Between Variables
The statistical relationship between two variables is referred to as their correlation. A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease,
Causality:
Relationship between two events where one event is affected by the other.
Covariance:
A quantitative measure of the joint variability between two or more variables.
Correlation:
Measure the relationship between two variables and ranges from -1 to 1, the normalized version of covariance.
Probability Distribution
In probability theory and statistics, a probability distribution is a mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment.
Bernoulli distribution
A Bernoulli distribution is a discrete probability distribution for a Bernoulli trial — a random experiment that has only two outcomes (usually called a “Success” or a “Failure”). … The expected value for a random variable, X, for a Bernoulli distribution is: E[X] = p. For example, if p = . 04, then E[X] = 0.4
uniform distribution
In statistics, uniform distribution refers to a type of probability distribution in which all outcomes are equally likely. A deck of cards has within it uniform distributions because the likelihood of drawing a heart, a club, a diamond, or a spade is equally likely.
Binomial distribution
The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two, or twice). For example, a coin toss has only two possible outcomes: heads or tails and taking a test could have two possible outcomes: pass or fail. A Binomial Distribution shows either (S)uccess or (F)ailure.
Normal distribution
Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.
Poisson distribution
In statistics, a Poisson distribution is a probability distribution that is used to show how many times an event is likely to occur over a specified period
Hypothesis testing
Hypothesis testing in statistics is a way for you to test the results of a survey or experiment to see if you have meaningful results.