top of page

The Math behind Skewness and Kurtosis in Qlik

Writer's picture: Igor AlcantaraIgor Alcantara

On October 15th, 2024, a user in Qlik Community (Martin) asked a very interesting question: what is the exact mathematical formula for Skewness and Kurtosis in Qlik? Check the original post here.


Since that information is not available in Qlik Documentation, I decided to do some reverse engineering. My approach was to use a technology where I have the mathematical documentation about it, test it there, and compare it to the numbers I got in Qlik. What technology? The R Language! Check it all in the video below.




For this exercise you saw in the video, I used the Student Performance Factor dataset, which you can download here. The explanation for each concept follows:


Skewness


Skewness measures the asymmetry of a data distribution. It indicates whether the observations in a dataset are concentrated on one side. Specifically, it quantifies how much a distribution deviates from a symmetric bell curve (normal distribution).


Interpretation


  • Zero Skewness: The distribution is perfectly symmetrical.


  • Positive Skewness (Right-Skewed):

    • Value: Skewness > 0

    • Characteristics: The right tail (higher values) is longer or fatter than the left tail. Most data are concentrated on the left.

    • Implication: There are extreme high values (outliers) pulling the mean to the right.


  • Negative Skewness (Left-Skewed):

    • Value: Skewness < 0

    • Characteristics: The left tail (lower values) is longer or fatter than the right tail. Most data are concentrated on the right.

    • Implication: There are extreme low values (outliers) pulling the mean to the left.


Kurtosis


Kurtosis measures the tailedness of a distribution. It indicates how heavily the tails of a distribution differ from the tails of a normal distribution. In essence, it tells us about the presence of outliers.


Interpretation


  • Mesokurtic:

    • Value: Kurtosis ≈ 3

    • Characteristics: Similar to the normal distribution.


  • Leptokurtic (Heavy-Tailed):

    • Value: Kurtosis > 3

    • Characteristics: Distribution has heavy tails and a sharp peak.

    • Implication: Higher probability of extreme values (outliers).


  • Platykurtic (Light-Tailed):

    • Value: Kurtosis < 3

    • Characteristics: Distribution has light tails and a flatter peak.

    • Implication: Lower probability of extreme values.


Importance in Data Analysis


Why Skewness and Kurtosis Matter


  1. Assumptions of Statistical Tests: Many statistical tests assume normality. Skewness and kurtosis help assess how much a dataset deviates from normality.

  2. Outlier Detection: High kurtosis can indicate the presence of outliers.

  3. Data Transformation Decisions: Understanding skewness can guide the need for data transformations (e.g., log transformation) to normalize data.


Impact on Mean and Median


  • Skewed Distributions:

    • Right-Skewed: Mean > Median

    • Left-Skewed: Mean < Median

  • The mean is affected by extreme values, while the median is more robust.


Practical Examples


Example 1: Income Distribution

  • Typical Observation: Most people earn around the median income, but a small number earn significantly more.

  • Skewness: Positively skewed due to high-income outliers.

  • Kurtosis: High kurtosis indicates the presence of these outliers.


Example 2: Test Scores

  • Observation: Most students score high, but a few score very low.

  • Skewness: Negatively skewed because of low-score outliers.

  • Kurtosis: Depending on the spread, could have high or low kurtosis.


Types (1, 2, 3)


Type 1: Basic Bias Correction


What It Is:

  • Type 1 calculations include basic adjustments to correct for bias due to sample size.

  • They aim to provide more accurate estimates, especially when dealing with small datasets.


How It Works:

  • The formulas for skewness and kurtosis include factors that adjust the calculations based on the number of data points.

  • These adjustments help reduce the bias that naturally occurs when estimating population parameters from a sample.


When to Use It:

  • Small Sample Sizes: When you have a small dataset (e.g., less than 100 observations), Type 1 provides better estimates than unadjusted formulas.

  • Educational Purposes: Often taught in statistics courses to illustrate the concept of bias correction.


Type 2: Enhanced Bias Correction


What It Is:

  • Type 2 calculations take bias correction a step further than Type 1.

  • They incorporate more complex adjustments to minimize bias even more effectively.


How It Works:

  • The formulas include additional correction factors that depend on the sample size.

  • These factors are designed to provide an unbiased estimate of skewness and kurtosis for normally distributed data.

When to Use It:

  • Very Small Samples: Particularly useful when dealing with very small datasets.

  • High Precision Needs: When accurate estimation is critical, and you want to minimize bias as much as possible.


Type 3: No Bias Correction (Simplest Form)


What It Is:

  • Type 3 calculations are the simplest and do not include any bias correction.

  • They directly compute skewness and kurtosis from the sample data without adjustments.


How It Works:

  • The formulas use the raw moments of the data (mean, standard deviation) without adjusting for sample size.

  • This method assumes that the sample is large enough for the estimates to be reliable.


When to Use It:

  • Large Sample Sizes: Suitable when you have a large dataset (e.g., thousands of observations) where bias is less of a concern.

  • Quick Estimates: When you need a straightforward calculation and can tolerate some bias in the estimates.


Key Differences Summarized


  1. Bias Correction:

    • Type 1: Basic correction for bias due to sample size.

    • Type 2: Enhanced correction for greater accuracy.

    • Type 3: No correction; simplest calculation.


  2. Sample Size Consideration:

    • Type 1 and Type 2: Adjust calculations based on how many data points you have.

    • Type 3: Assumes sample size is large enough that adjustments aren't necessary.


  3. Complexity:

    • Type 1 and Type 2: More complex formulas due to adjustment factors.

    • Type 3: Simpler formulas, easier to compute.


Why Are There Different Types?


  • Accuracy vs. Simplicity: Different types offer a trade-off between computational simplicity and the accuracy of the estimates.

  • Sample Size Impact: In small samples, unadjusted estimates can be significantly biased. Adjustments help improve the estimates.

  • Statistical Practice: Different fields or applications may prefer one type over another based on convention or specific needs.


The Formulas


Now, let's finally get to the answer of the original question: what formulas Qlik uses? Here we you have it:




The 3r and 4th Central Moments


Central moments are measures that help us understand the shape and characteristics of a data distribution by looking at how data points differ from the mean (average) value. Let's focus on the 3rd and 4th central moments, which are important for understanding the skewness and kurtosis of a distribution.


3rd Central Moment: Measuring Skewness


What Is the 3rd Central Moment?


  • Definition: The 3rd central moment is a way to quantify how asymmetric a distribution is around its mean.

  • Simple Explanation: It tells us whether the data leans more to the left or right of the mean.


In Everyday Terms


  • Imagine: You're looking at the ages of people at a family gathering.

    • If most people are young children with a few older adults, the ages are skewed towards the younger side.

  • The 3rd central moment captures this tilt or lean in the data.


Interpreting the 3rd Central Moment


  • Positive Value:

    • Right-Skewed Distribution.

    • The tail on the right side (higher values) is longer.

    • Example: Income levels where most people earn less, but a few earn a lot more.


  • Negative Value:

    • Left-Skewed Distribution.

    • The tail on the left side (lower values) is longer.

    • Example: Test scores where most students score high, but a few score much lower.


  • Zero:

    • Symmetrical Distribution.

    • Data is evenly distributed around the mean.


4th Central Moment: Measuring Kurtosis


What Is the 4th Central Moment?


  • Definition: The 4th central moment measures the tailedness or peakedness of the distribution.

  • Simple Explanation: It tells us how heavy or light the tails of the distribution are compared to the middle.


In Everyday Term


  • Imagine: You're stacking blocks to make a tower.

    • A tall, narrow tower (peaked) represents data where values are clustered tightly around the mean with more extreme values.

    • A short, wide tower (flat) represents data spread out more evenly with fewer extreme values.


Interpreting the 4th Central Moment


  • High Value:

    • Leptokurtic Distribution.

    • Sharp peak and heavy tails.

    • More data in the tails (extreme values).

    • Example: Financial returns with frequent extreme gains or losses.


  • Low Value:

    • Platykurtic Distribution.

    • Flat peak and light tails.

    • Less data in the tails.

    • Example: Uniform distribution where all outcomes are equally likely.


  • Moderate Value:

    • Mesokurtic Distribution.

    • Similar to a normal bell-shaped curve.

    • Example: Heights of people in a large population.


Why Are They Important?


  • 3rd Central Moment (Skewness):

    • Helps identify bias in data.

    • Important for statistical modeling and hypothesis testing.

    • Influences mean and median relationships.


  • 4th Central Moment (Kurtosis):

    • Highlights the presence of outliers.

    • Crucial for risk assessment in fields like finance.

    • Affects probability estimates of extreme events.


Conclusion


I believe now this is all explained. If you like it, please make sure to share it!

71 views0 comments

Recent Posts

See All

Kommentare


© 2024 Data Voyagers

  • Youtube
  • LinkedIn
bottom of page