The measure of dispersion (Measures of Dispersion) or a measure of the diversity of observations from their average value is called deviation/dispersion. There are several measures to determine the dispersion of observational data, such as range , quartile deviation , mean deviation , and standard deviation .
The measure of central tendency ( mean , median , mode ) is a representative value of a frequency distribution, but this measure does not provide a complete picture of information about how the distribution of observational data is to the central value. Measures of central tendency alone are not sufficient to describe the frequency distribution. In addition, we must have a measure of the distribution of observational data.
For example, we have a yield distribution of two rice varieties (kg per plot), each consisting of 5 plots. Suppose the distribution of the data is as follows:
Varieties I : 45 42 42 41 40
Varieties II : 54 48 42 36 30
Varieties III : 45 40 44 41 40
We can see that the mean value of varieties I and II is the same, 42 kg, but if we pay attention, the diversity of the two varieties is different. Variety I may be preferred because it is more consistent. This can be seen from the yield data on variety I which is more uniform than that of Varieties II. In Varieties I, the results are not too far from the central value, 42 kg, while in Varieties II, the distribution of data is very diverse (see the following picture).
In this example, it is clear that a measure of central tendency alone is not sufficient to describe the frequency distribution. In addition, we must have a measure of the distribution of observational data. The measure of the dispersion or measure of the diversity of observations from their average value is called deviation/dispersion . There are several measures to determine the dispersion of observational data, such as range , quartile deviation , mean deviation , and standard deviation .
Measures of Dispersion
Range
The simplest measure of dispersion is Range ( Range / Range , sometimes translated in some literature by the term region ). The range of a group of observational data is the difference between the minimum and maximum values.
$$ Range=\ Maximum\ value-\ Minimum\ value$$
For example, the range for Variety I in the table above is 45 - 40 = 5 (45 is the maximum value and 40 is the minimum value). Often we say range with statements like "yield ranges from 40 - 45 kg per plot". The range is narrower than the statement "yield ranges from 40 - 60 kg per plot". The first statement illustrates that the variation in rice yields is not too diverse, while in the second statement, the opposite occurs.
The range only takes into account two values, namely the maximum value and the minimum value and does not take into account all values, so it is very unstable or unreliable as an indicator of the size of the spread. This happens because the range is strongly influenced by extreme values . In the example above, if the highest yield of variety I is 60 kg/plot, not 45 kg/plot, then the range = 60-40= 20 kg/plot.
Obviously, our interpretation will change. We are more in agreement to say that the variation in results is very diverse. Is that right? If we look again, the other rice yield values are almost uniform, ranging from 40-44 kg/plot. However, with the outlier results, 60 kg/plot, the interpretation is different, we tend to say that the results are diverse, even though this diversity does not actually represent all the values in the sample/population.
A yield of 60 kg/plot is an example of an extreme and unusual value. This value is an outlier and should be re-checked for the correctness of the data or removed from the observation data, because it will lead to inappropriate conclusions.
Example 2:
Examples of other cases that can lead to misinterpretation of the size of the spread of data using Range are as follows:
The following are the scores for the 1st and 2nd Quiz for Statistics Course. Define a Range for each Quiz. What is your conclusion?
1st Quiz: | 1 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 |
2nd Quiz: | 2 | 3 | 4 | 5 | 6 | 14 | 15 | 16 | 17 | 18 | 19 |
Answer:
Quiz 1: range = 20-1 = 19
Quiz 2: range = 19-2 = 17
Conclusion :
Quiz 1 is more varied than Quiz 2 because the value range for Quiz 1 > Quiz 2. Compare with the conclusions obtained by using the quartile deviation and standard deviation.
Another weakness of Range is that it does not describe the distribution of data to its center value. Consider the following examples and pictures.
Example 3:
Determine the Mean and Range of the following two Varieties. What conclusions can you draw based on the mean (average) and range?
Varieties I | 45 | 42 | 42 | 41 | 40 |
Varieties III | 45 | 40 | 44 | 41 | 40 |
Answer:
Varieties I: Mean = 42; range = 5
Varieties II: Mean = 42; range = 5
Conclusion:
Both Varieties, I and III have the same mean and range values, namely mean = 42 and range = 5.
If we only use the size of the range as a measure of dispersion, we will definitely say that the yield diversity of the two varieties is the same. However, if we look at how the data distribution of the two varieties with respect to the central value is, we may prefer Varieties I, because in Varieties I the data distribution is not far from the central value.
To avoid range weakness as above, other dispersion measures such as quartile deviation are preferred.
Quartile Deviation
Quartile deviation is calculated by removing values that are below the first quartile and values above the third quartile, so that extreme values, both below and above the data distribution, are eliminated.
Quartile deviation is obtained by calculating the average value of the two quartiles, Q 1 and Q 3 .
$$ \dfrac{\left(Q_3-Q_2\right)+(Q_2-Q_1)}{2}=\dfrac{(Q_3-Q_1)}{2}$$
Quartile deviation is more stable than Range because it is not affected by extreme values. Extreme values have been removed. However, just like Range, quartile deviation also does not pay attention to and takes into account the deviation of all data sets. The quartile deviation only takes into account the values in the first and third quartile.
Example 4
Determine the value of the quartile deviation in Example 2.
Answer:
To determine the quartile value, the sample data must first be sorted. Incidentally in this example, the data is already sorted.
Next determine the location of the quartile and finally determine the value of the quartile.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
Quiz 1: | 1 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 | 20 |
Quiz 2: | 2 | 3 | 4 | 5 | 6 | 14 | 15 | 16 | 17 | 18 | 19 |
n = 11
$$ Location\ Q_i=\dfrac{i}{4}(n+1)$$
Quiz 1:
The location of Q1 = (11+1) = 3 so that the value of Q1 is the data that lies in the 3rd order, which is 20
The location of Q3 = (11+1) = 9 so that the value of Q1 is the data that lies in the 9th order, which is 20
$$ Quartile\ deviation =\dfrac{(Q_3-Q_1)}{2}=\dfrac{(20-20)}{2}=0$$
Quiz 2:
The location of Q1 = (11+1) = 3 so that the value of Q1 is data that lies in the 3rd order, which is 5
The location of Q3 = (11+1) = 9 so that the value of Q1 is data that lies in the 3rd order, which is 17
$$ Quartile\ deviation =\dfrac{(Q_3-Q_1)}{2}=\dfrac{(17-5)}{2}=6$$
Conclusion:
Based on the quartile deviation, Quiz 2 is more varied than Quiz 1. (the conclusion is different from the conclusion based on the range)
Mean Deviation
The mean deviation is the deviation of individual values from the average value. The mean can be either the mean or the median. For raw data, the mean deviation from the median is small enough that this deviation is considered the most appropriate for the raw data. But in general, the average deviation calculated from the mean is often used for the mean deviation value. The mean deviation is calculated by the following formula:
$$ \text{Mean Deviation}=\dfrac{\Sigma (x_i-\overline{x})}{n}$$
The formula certainly satisfies the previous two criteria, calculated from all the data and shows the mean dispersion of the mean, but does not meet the third criterion. Regardless of the dispersion of the data, all calculations using this formula will always return zero. This is because the numerator of the above formula $\Sigma (x_i-x)$ indicates that the sum will always equal zero.
There are two ways to anticipate this problem, both of which will remove negative signs from the calculation.
The first method is to use the following formula:
Sample :
$$ \text{Mean Deviation}=\dfrac{\Sigma |x_i-\overline{x}|}{n}$$
Population :
$$ \text{Mean Deviation} =\dfrac{\Sigma |x_i-\mu |}{N}$$
For data that has been compiled in the form of a frequency table:
Single Data (not grouped by class interval):
$$ \text{Mean Deviation}=\dfrac{\sum^{{\rm k}}_{{\rm i=1}}{f_i|x_i-\overline{x}|}}{\Sigma f_i}=\dfrac{\sum^{{\rm k}}_{{\rm i=1}}{f_i|x_i-\overline{x}|}}{n}$$
Group data (already grouped by a certain interval):
The mean deviation calculated from the frequency distribution of grouped data uses the estimated data values, not the original data. The representative data is symbolized by m . To make calculations from grouped data we must assume that all values in a class are equal to their representative values (class sign, i i ). Furthermore, the approximate value of the mean deviation can be calculated using the formula:
$$ \text{Mean Deviation} \approx \dfrac{\sum^{{\rm k}}_{{\rm i=1}}{f_i|m_i-\overline{x}|}}{\Sigma f_i}=\dfrac{\sum^{{\rm k}}_{{\rm i=1}}{f_i|m_i-\overline{x}|}}{n}$$
In the above formula, the numerator will always be positive, because what is taken is the absolute value, pay attention to the sign of the modulus || which means that both negative and positive results will always be treated as positive data.
The second method is to use the sum of the squares of all the data deviation values. This method is known as variance and standard deviation .
Example 5
Determine the value of the mean deviation in Example 2.
Answer:
Quiz I: mean =18.27
Quiz 2: mean = 10.82
No | Quiz 1 (xi) | $$ x_i-\overline{x}$$ | $$ \left|x_i-\overline{x}\right|$$ | Quiz 2 ( x i ) | $$ x_i-\overline{x}$$ | $$ \left|x_i-\overline{x}\right|$$ | |
1 | 1 | -17.27 | 17.27 | 2 | -8.82 | 8.82 | |
2 | 20 | 1.73 | 1.73 | 3 | -7.82 | 7.82 | |
3 | 20 | 1.73 | 1.73 | 4 | -6.82 | 6.82 | |
4 | 20 | 1.73 | 1.73 | 5 | -5.82 | 5.82 | |
5 | 20 | 1.73 | 1.73 | 6 | -4.82 | 4.82 | |
6 | 20 | 1.73 | 1.73 | 14 | 3.18 | 3.18 | |
7 | 20 | 1.73 | 1.73 | 15 | 4.18 | 4.18 | |
8 | 20 | 1.73 | 1.73 | 16 | 5.18 | 5.18 | |
9 | 20 | 1.73 | 1.73 | 17 | 6.18 | 6.18 | |
10 | 20 | 1.73 | 1.73 | 18 | 7.18 | 7.18 | |
11 | 20 | 1.73 | 1.73 | 19 | 8.18 | 8.18 | |
Amount | 34.55 | Amount | 68.18 |
Quiz 1:
$$ /text{Mean Deviation}=\dfrac{\Sigma |x_i-\overline{x}|}{n}=\dfrac{34.55}{11}=3.141$$
Quiz 2:
$$ /text{Mean Deviation}=\dfrac{\Sigma |x_i-\overline{x}|}{n}=\dfrac{68.18}{11}=6.198$$
Conclusion:
Based on the mean deviation, Quiz 2 is more varied than Quiz 1. (the conclusion is different from the conclusion based on the range)
Notes:
To determine the mean deviation from the frequency table, the method is similar to examples 7 and 8 .
Additional Examples:
In the same way as above, the average deviation values for the three varieties:
Varieties I = 1.2
Varieties II = 7.2
Variance and Standard deviation
The size of the spread using the calculation of the average deviation is obtained by ignoring the signs of deviation.
Mathematically this is not true. The second way is by squaring the value of the deviation so that the negative value turns into a positive one. This method is more precise. The average of the total deviation values is known as variance . After the variance value is obtained, then the variance value is rooted to get back the original unit of the variable ( not kg 2 /plot 2 , but kg / plot :-) ). This method of measuring variance is known as the standard deviation .
Mathematically, the standard deviation can be calculated using the formula:
$$ \sigma =\sqrt{\dfrac{\Sigma {\left(x_i-\mu \right)}^2}{N}}\ or\ \sqrt{\dfrac{\Sigma x^2_i-\dfrac{{\left(\Sigma x_i\right)}^2}{N}}{N}}\ $$
The population standard deviation is symbolized by σ (read 'sigma') and the sample standard deviation is denoted by s . A good sample standard deviation should be an unbiased measure of the population's standard deviation, because we use the sample's standard deviation measure to estimate the population's standard deviation value. For that, the value of n in the above formula is replaced with n - 1 so that the formula for the standard deviation of the sample is as follows:
$$ s=\sqrt{\dfrac{\Sigma {\left(x_i-\overline{x}\right)}^2}{n-1}}\ {\rm or}\ \sqrt{\dfrac{\Sigma x^2_i-\dfrac{{\left(\Sigma x_i\right)}^2}{n}}{n-1}}\ $$
Why should it be replaced with n-1 ?! The proof is beyond the discussion of this blog. :-)
Data on the frequency distribution table:
Single Data :
$$ s=\sqrt{\dfrac{\sum^k_{i=1}{{f_i\left(x_i-\overline{x}\right)}^2}}{n-1}}\ {\rm or}\ s=\sqrt{\dfrac{\Sigma {f_ix}^2_i-\dfrac{{\left(\Sigma f_ix_i\right)}^2}{n}}{n-1}}\ $$
$$ \ {\rm or}\ s=\sqrt{\dfrac{n\Sigma {f_ix}^2_i-{\left(\Sigma f_ix_i\right)}^2}{n(n-1)}}$$
Group data (already grouped by a certain interval):
Same as in the calculation of the average deviation. The standard deviation and variance are calculated from the frequency distribution of the grouped data using the estimated data values, not the original data. The representative data is symbolized by m . To make calculations from grouped data we must assume that all values in a class are equal to their representative values (class sign, i i ). Furthermore, the estimated standard deviation value can be calculated using the formula:
$$ s=\sqrt{\dfrac{\sum^k_{i=1}{{f_i\left(m_i-\overline{x}\right)}^2}}{n-1}}\ {\rm or}\ s=\sqrt{\dfrac{\Sigma {f_im}^2_i-\dfrac{{\left(\Sigma f_im_i\right)}^2}{n}}{n-1}}\ $$
$$ \ {\rm or}\ s=\sqrt{\dfrac{n\Sigma {f_im}^2_i-{\left(\Sigma f_im_i\right)}^2}{n(n-1)}}$$
The squared value of the standard deviation is known as variance . In variance analysis technique , $ \Sigma x^2-\dfrac{{\left(\Sigma x\right)}^2}{n}$ is known as Sum of Squares , and variance is known as the Mean Square.
The standard deviation is the most widely used measure of spread. All data clusters are considered so that they are more stable than other measures. However, if there are extreme values in the data set, the standard deviation is no longer sensitive, just like the mean.
The Standard Deviation has several other special characteristics. SD does not change if each element in the data cluster is added or subtracted with a certain constant value. SD changes when each element in the data cluster is multiplied/divided by a certain constant value. When multiplied by a constant value, the resulting standard deviation will be equivalent to the product of the actual standard deviation value by the constant.
Example 6
If the Quiz value data in example 2 is taken from the sample, determine the variance value and standard deviation.
Answer:
To find the sample standard deviation, we can use one of the following formulas:
$$ s=\sqrt{\dfrac{\Sigma {\left(x_i-\overline{x}\right)}^2}{n-1}}\ {\rm or}\ \sqrt{\dfrac{\Sigma x^2_i-\dfrac{{\left(\Sigma x_i\right)}^2}{n}}{n-1}}\ $$
The first formula is the definitive formula. The recommended formula for manual calculations is the second formula. The calculation method with the second formula can be seen in examples 7 and 8. In this example, as an exercise, we use the first formula. For calculations with the first formula, we need the average value, so we must first calculate the average value.
Quiz I: mean =18.27
Quiz 2: mean = 10.82
No | Quiz 1 (xi) | $$ (x_i-\overline{x})$$ | $$ {\left(x_i-\overline{x}\right)}^2$$ | Quiz 2 (xi) | $$ (x_i-\overline{x})$$ | $$ {\left(x_i-\overline{x}\right)}^2$$ | |
1 | 1 | -17.27 | 298.35 | 2 | -8.82 | 77.76 | |
2 | 20 | 1.73 | 2.98 | 3 | -7.82 | 61.12 | |
3 | 20 | 1.73 | 2.98 | 4 | -6.82 | 46.49 | |
4 | 20 | 1.73 | 2.98 | 5 | -5.82 | 33.85 | |
5 | 20 | 1.73 | 2.98 | 6 | -4.82 | 23.21 | |
6 | 20 | 1.73 | 2.98 | 14 | 3.18 | 10.12 | |
7 | 20 | 1.73 | 2.98 | 15 | 4.18 | 17.49 | |
8 | 20 | 1.73 | 2.98 | 16 | 5.18 | 26.85 | |
9 | 20 | 1.73 | 2.98 | 17 | 6.18 | 38.21 | |
10 | 20 | 1.73 | 2.98 | 18 | 7.18 | 51.58 | |
11 | 20 | 1.73 | 2.98 | 19 | 8.18 | 66.94 | |
Sum | 328.1818 | 453.6364 |
Quiz 1:
$$ s=\sqrt{\dfrac{\Sigma {\left(x_i-\overline{x}\right)}^2}{n-1}}\ \ s=\sqrt{\dfrac{328.18}{11-1}}=5.73\ $$
$$ variance=s^2={5.73}^2=32.82\ $$
Quiz 2:
$$ s=\sqrt{\dfrac{453.64}{11-1}}=6.74\ \ $$
$$ variance=s^2={6.74}^2=45.36$$
Conclusion:
Based on the value of variance and standard deviation, Quiz 2 is more varied than Quiz 1. (the conclusion is different from the conclusion based on the range)
Example 7
Calculate the standard deviation and variance values from the following single data frequency table:
No | xi | fi |
1 | 70 | 5 |
2 | 69 | 6 |
3 | 45 | 3 |
4 | 80 | 1 |
5 | 56 | 1 |
Amount | 320 | 16 |
Answer:
For convenience in manual calculations, we use the following standard deviation formula:
$$ s=\sqrt{\dfrac{n\Sigma {f_ix}^2_i-{\left(\Sigma f_ix_i\right)}^2}{n(n-1)}}\ {\rm or}\ s=\sqrt{\dfrac{\Sigma {f_ix}^2_i-\dfrac{{\left(\Sigma f_ix_i\right)}^2}{n}}{n-1}}$$
Next we create a table as shown in the following table:
No | xi | fi | fi.xi | fi.xi2 |
1 | 70 | 5 | 350 | 24500 |
2 | 69 | 6 | 414 | 28566 |
3 | 45 | 3 | 135 | 6075 |
4 | 80 | 1 | 80 | 6400 |
5 | 56 | 1 | 56 | 3136 |
Sum | 320 | 16 | 1035 | 68677 |
From the table obtained:
n = 16
mean = 1035/12 = 64.69
Standard deviation:
$$ s=\sqrt{\dfrac{{\rm 68677}-\dfrac{{\left({\rm 1035}\right)}^2}{16}}{16-1}}=10.72\ $$
$$ {\rm or}\ s=\sqrt{\dfrac{16(68677)-{\left({\rm 1035}\right)}^2}{16(16-1)}}=10.72$$
$$ {\rm Variance}=\ s^2={\left(10.72\right)}^2=115.03$$
Example 8
Calculate the standard deviation and variance values from the grouped frequency table:
The following table is the statistical test scores of 80 students that have been arranged in a frequency table. Unlike the example above, in this example, the frequency distribution table is created from data that has been grouped by a certain interval/class (number of classes = 7 and length of class = 10).
Class- | Test scores | fi |
1 | 31 - 40 | 2 |
2 | 41 - 50 | 3 |
3 | 51 - 60 | 5 |
4 | 61 - 70 | 13 |
5 | 71 - 80 | 24 |
6 | 81 - 90 | 21 |
7 | 91 - 100 | 12 |
Sum | 80 |
Answer:
For convenience in manual calculations, we use the following standard deviation formula:
$$ s=\sqrt{\dfrac{\Sigma {f_im}^2_i-\dfrac{{\left(\Sigma f_im_i\right)}^2}{n}}{n-1}}\ {\rm or}\ \ s=\sqrt{\dfrac{n\Sigma {f_im}^2_i-{\left(\Sigma f_im_i\right)}^2}{n(n-1)}}$$
Next, we list the following table, determine the middle value of the class/representative ( m i ) and complete the next column.
class- | Test scores | fi | mi | fi.mi | fi.mi2 |
1 | 31 - 40 | 2 | 35.5 | 71.0 | 2520.5 |
2 | 41 - 50 | 3 | 45.5 | 136.5 | 6210.8 |
3 | 51 - 60 | 5 | 55.5 | 277.5 | 15401.3 |
4 | 61 - 70 | 13 | 65.5 | 851.5 | 55773.3 |
5 | 71 - 80 | 24 | 75.5 | 1812.0 | 136806.0 |
6 | 81 - 90 | 21 | 85.5 | 1795.5 | 153515.3 |
7 | 91 - 100 | 12 | 95.5 | 1146.0 | 109443.0 |
Sum | 80 | 458.5 | 6090.0 | 479670.0 |
From the table obtained:
n = 80
mean = 6090/80 = 76.13
Standard deviation and variance:
$$ s=\sqrt{\dfrac{\Sigma {f_im}^2_i-\dfrac{{\left(\Sigma f_im_i\right)}^2}{n}}{n-1}}=\sqrt{\dfrac{{\rm 479670}-\dfrac{{\left({\rm 6090}\right)}^2}{80}}{80-1}}=77.92$$
$$ {\rm or}\ s=\sqrt{\dfrac{n\Sigma {f_im}^2_i-{\left(\Sigma f_im_i\right)}^2}{n(n-1)}}=\sqrt{\dfrac{80(479670)-{\left({\rm 6090}\right)}^2}{80(80-1)}}=77.92$$
$$ {\rm Variance}\ s^2={\left(77.92\right)}^2=6070.81$$
Additional Examples:
In the same way as above, the standard deviation values for the three varieties:
Varieties I = 1.87
Varieties II = 9.49
Measures of Relative Dispersion
Consider the example case of Varieties I vs Varieties II above. Both varieties have the same average value. For two data distributions with the same or nearly the same mean value, we can directly compare the variance of the two distributions by looking at their standard deviation values. We agree that Varieties II are more diverse than Varieties I. However, if the averages of the two data distributions are significantly different, we cannot compare the variances by using the standard deviation values directly. In that case, to compare the degree of variance of the two data distributions, we must use a measure of relative dispersion.
There are several measures of relative dispersion for Range, Quartile Deviation, Mean Deviation, and Standard Deviation. The most important and frequently used coefficient of variation is a measure of the relative spread of the Standard Deviation .
The Coefficient of Variation is calculated by the following formula:
$$ CV=\dfrac{s}{\overline{x}}\times 100\%$$
The Coefficient of Variation is a unit-independent measure and is always expressed as a percentage. A small CV value indicates that the data is not too diverse and is said to be more consistent. KK is not reliable if the average value is almost equal to 0 (zero). CV is also unstable if the data measurement scale used is not a ratio scale.
Example 9:
Pay attention to the data clusters for Group A and Group B
A | 2 | 4 | 5 | 6 | 6 | 7 | 7 | 7 | 8 | 9 |
B | 3 | 6 | 7 | 9 | 9 | 10 | 10 | 10 | 11 | 12 |
Group A: Mean = 6.1; s = 2.0
Group B: Mean = 8.7; s = 2.7
$$ CV\left(A\right)=\dfrac{s}{\overline{x}}\times 100\%=\dfrac{2.0}{6.1}\times 100\%=33.2\%$$
$$ CV\left(B\right)=\dfrac{s}{\overline{x}}\times 100\%=\dfrac{2.7}{8.7}\times 100\%=30.7\%$$
The value of the coefficient of diversity for group B is smaller than that of group A. However, if we look at the value of the standard deviation, on the contrary, SD A is smaller than SD B. Thus, to see the relative diversity of a data set, we should not rely solely on on the value of the standard deviation .
Skewness and Kurtosis
The mean and the size of the spread can describe the distribution of the data but are not sufficient to describe the nature of the distribution . To be able to describe the characteristics of a data distribution, we use other concepts known as skewness and kurtosis .
Skewness
Slope (skewness) means asymmetry. A distribution is said to be symmetrical if the values are evenly distributed around the mean. For example, the distribution of the following data is symmetric about the mean, 3.
x | 1 | 2 | 3 | 4 | 5 |
frequency (f) | 5 | 9 | 12 | 9 | 5 |
In the following example, the data distribution is not symmetrical. The first image is tilted to the left and the second image is tilted to the right.
In a symmetrical distribution of data, the mean, median and mode have the same value.
Several calculation steps are used to express the direction and degree of slope of the data distribution. These steps were introduced by Pearson.
Coefficient of Skewness:
$$ S_k=\dfrac{3(mean-median)}{standard\ deviation}$$
Interpretation : For a symmetrical data distribution, Sk = 0. When the data distribution is skewed to the left (negatively skewed), Sk has a negative value , and when it is skewed to the right (positively skewed), SK has a positive value. The range for SK is between -3 and 3.
Another measure of slope is the coefficient β1 (read 'beta-one'):
$$ {\rm population}:\ {\beta }_1=\frac{{\mu }^2_3}{{\mu }^3_2};\ \ \ \ \ {\rm sample}:\ b_1=\ frac{m^2_3}{m^3_2}$$
where:
$$ {m_3} = {{\Sigma {{\left( {{x_i} - \bar x} \right)}^3}} \over {n - 1}};{\rm{dan}}\;{m_2} = {{\Sigma {{\left( {{x_i} - \bar x} \right)}^2}} \over {n - 1}}$$
Interpretation :
The distribution is said to be symmetrical if the value of b1 = 0. Skewness is positive or negative depending on whether the value of b1 is positive or negative.
Commonly used Skewness measures :
Population Skewness:
$$ {m_3} = {{{{({x_i} - \bar x)}^3}} \over n};{m_2} = {{{{({x_i} - \bar x)}^2}} \over n}$$
$$ {g_1} = {{{m_3}} \over {m_2^{3/2}}}$$
Sample Skewness:
$$ {G_1} = {{{k_3}} \over {k_2^{3/2}}} = {{\sqrt {n{\mkern 1mu} (n - 1)} } \over {n - 2}}\; \cdot {g_1}$$
Source: D. N. Joanes and C. A. Gill. "Comparing Measures of Sample Skewness and Kurtosis". The Statistician 47(1):183–189.
or the following formula (MS Excel):
$${G_1} = {n \over {(n - 1)(n - 2)}} \cdot {\sum {\left( {{{{x_i} - \bar x} \over s}} \right)} ^3}$$
s = standard deviation
NB: the two formulas above produce the same skewness value
Interpretation :
The distribution is said to be symmetrical if the value of G1 = 0. Skewness is positive or negative depending on whether the value of G1 is positive or negative.
According Bulmer, M. G., in Principles of Statistics (Dover, 1979):
- highly skewed : if skewness is less than 1 or more than +1
- moderately skewed : if the skewness is between 1 and or between +½ and +1.
- approximately symmetric : if the skewness is between and +½.
Kurtosis
Kurtosis is a measure to measure the sharpness of the data distribution.
The distributions in the figure above are all symmetrical to their mean values. However, the three forms are not the same. Blue curves are known as mesokurtic (normal curves), red curves are known as leptokurtic (pointed curves) and green curves are known as platykurtic (flat curves).
Kurtosis is calculated using the Pearson coefficient, β2 (read 'beta - two').
$$ {\rm population}:\ {\beta }_2=\dfrac{{\mu }_4}{{\mu }^2_2};\ \ \ \ \ {\rm sample}:b_2=\dfrac{ m_4}{m^2_2}$$
where:
$$ {m_4} = {{\Sigma {{\left( {{x_i} - \bar x} \right)}^4}} \over {n - 1}};{\rm{dan}}\;{m_2} = {{\Sigma {{\left( {{x_i} - \bar x} \right)}^2}} \over {n - 1}}$$
Commonly used kurtosis sizes:
Population Kurtosis:
$$ {m_4} = {{{{({x_i} - \bar x)}^4}} \over n};{m_2} = {{{{({x_i} - \bar x)}^2}} \over n}$$
Kurtosis: $$ {a_4} = {{{m_4}} \over {m_2^2}}$$
Excess Kurtosis: $${g_2} = {a_4} - 3$$
Sample Kurtosis:
$$ {G_2} = {{n - 1} \over {(n - 2)(n - 3)}}\; \cdot [(n + 1){g_2} + 6]$$
or the following formula (MS Excel):
$${G_2} = \left\{ {{{n(n + 1)} \over {(n - 1)(n - 2)(n - 3)}}{{\sum {\left( {{{{x_i} - \bar x} \over s}} \right)} }^4}} \right\} - {{3{{(n - 1)}^2}} \over {(n - 2)(n - 3)}}$$
s = standard deviation
NB: Excel uses the Excess Kurtosis value. The calculation results of the two formulas above, produce the same value
Interpretation:
The distribution says:
- Mesokurtic (Normal) if kurtosis = 3
- Leptokurtic if kurtosis > 3
- Platykurtic if kurtosis < 3
Examples of Skewness and Kurtosis Calculations
Examples of Skewness and Kurtosis Calculations
Calculations Measure of Dispersion with Data Processing Applications
SmartstatXL (Excel Add-In)
The calculation of statistical values for the size of the data spread ( range ), quartile deviation ( quartile deviation ), average deviation ( mean deviation ), and standard deviation ( standard deviation )) using SmartstatXL can be studied at the following link: How to Analyze Descriptive Statistics and Normality Test
Reference:
- Mario Triola. 2004. Elementary Statistics. 9 th Edition. Pearson Education.
- Stephen Bernstein and Ruth Bernstein. 1999. Elements of Statistics I: Descriptive Statistics and Probability. The McGraw-Hill Companies, Inc
- Web:
- Indian Agricultural Statistics Research Institute: http://www.iasri.res.in/
- Statistical dispersion: http://en.wikipedia.org/wiki/Statistical_dispersion