Data Science – Intro to Statistics

Introduction to Statistics

Table of Contents

Statistics is the science of analyzing data. In any field, whether it’s healthcare, finance, or marketing, understanding data is key to making informed decisions. A model or prediction is only valuable if it can be trusted. The reliability of these predictions must be assessed to determine how useful they are in real-world applications.

Descriptive Statistics

Descriptive statistics is a method used to summarize and describe important features of a data set. It provides an overview of the main aspects of the data, helping to identify patterns, outliers, and trends. Some common measures used in descriptive statistics include:

Count: The total number of observations or entries in the dataset.
Sum: The total value of all the data points combined.
Standard Deviation: A measure of the amount of variation or dispersion in the data.
Percentile: A value below which a given percentage of observations fall.
Average (Mean): The sum of all values divided by the number of values.

These are the basic statistical measures that help us to become familiar with the data and draw initial insights.

Using Python for Descriptive Statistics

In Python, we can use the describe() function from the pandas library to quickly summarize a dataset. This function computes the common descriptive statistics, providing a high-level overview of the data.

import pandas as pd

# Assuming 'full_health_data' is a pandas DataFrame containing the health dataset
print(full_health_data.describe())

This code will output the following descriptive statistics for all numerical columns in the full_health_data dataset:

Sample Output:

       Count   Sum   Mean    Std    Min   25%    50%    75%    Max
    Age      100   3200   32     5      20     30     32     35     50
    Weight   100   8000   80     10     60     75     80     85     120
    Height   100   17000  170    5      160    165    170    175    180

By running the describe() function, we get the count, sum, average (mean), standard deviation (Std), and various percentiles (25%, 50%, and 75%) for each column of data. This helps us understand the data’s distribution and central tendencies.

Do You See Anything Interesting Here?

Looking at the output, you may notice trends or anomalies. For example, you could examine the standard deviation to understand how spread out the data is, or you could look at the percentiles to assess the distribution of the data and detect outliers. These are the first steps toward analyzing the data and forming hypotheses.

Introduction to Data Science Written Edition English Tutorial

Curriculum