Sidebar Menu

Statistics is a set of procedures for collecting, measuring, classifying, calculating, explaining, synthesizing, analyzing, and interpreting quantitative data obtained systematically. Broadly speaking, statistics are divided into two main components, namely descriptive statistics and inferential statistics . Descriptive statistics use numerical and graphical procedures to summarize data sets in a clear and understandable way, while inferential statistics provide procedures for drawing conclusions about the population based on the sample we observe. Descriptive statistics helps us to simplify large amounts of data in a logical way. Data that is much reduced and summarized so that it is simpler and easier to interpret.

There are two basic methods in descriptive statistics, namely numerical and graphical .

  • Numerical approaches can be used to calculate statistical values from a set of data, such as the mean and standard deviation . These statistics provide information about the average and detailed information about the distribution of the data.
  • The graphical method is more suitable than the numerical method for identifying certain patterns in the data, on the other hand, the numerical approach is more precise and objective. Thus, numerical and graphical approaches are complementary to each other, so it is wise to use both methods together.

There are three main characteristics of a single variable:

  • Data distribution (frequency distribution)
  • The measure of concentration / central tendency
  • Size of the spread (Dispersion)

Info : The full discussion will be described in a separate topic…

Data Distribution

Organizing, organizing, and summarizing data by creating tables is often helpful, especially when we are working with large amounts of data. The table contains a list of data values that may differ (either single data or data that has been grouped) along with their frequency values. Frequency shows the number of occurrences/occurrence of data values with certain categories. The distribution of the data that has been set is often called the frequency distribution. Thus, the frequency distribution is defined as a list of data distributions (both single data and group data), which is accompanied by the frequency value. The data are grouped into several classes so that the important characteristics of the data can be seen immediately. The simplest frequency distribution is a distribution that displays a list of each value of the variable accompanied by its frequency value. The frequency distribution can be described in two ways, namely as a table or as a graph. Distributions can also be displayed using percentage values. The presentation of the distribution in the form of a graph makes it easier to see certain characteristics and tendencies of a set of data. Quantitative data charts include Histograms, Frequency Polygons etc., while charts for qualitative data include Bar Charts, Pie Charts etc. The frequency distribution will make it easier for us to see patterns in the data, however, we will lose information on the individual values.

Distribution Form

An important aspect of the "description" of a variable is the shape of its distribution, which indicates the frequency of various intervals of the variable's values. Usually, a researcher is interested in how well a distribution can be estimated by a normal distribution. Simple descriptive statistics can provide some information relevant to this problem. For example, if skewness, which measures the symmetry of the data distribution, is not equal to 0, then the distribution is said to be asymmetric ( a symmetric), and if the skewness is 0 it means that the data is normally distributed (symmetrically). If the kurtosis, which measures the sharpness of the data distribution, is not equal to 0, then the data distribution may be flatter or more pointed than the normal distribution. The kurtosis value of the normal distribution is 0. More accurate information can be obtained by using one of the normality tests, namely to determine the probability of whether the sample comes from observations of a normally distributed population or not (for example, the Kolmogorov-Smirnov test, or the Shapiro-Wilks'W test. ) . However, none of these formal tests can completely replace visual inspection of data using graphical means, such as a histogram (a graph showing the frequency distribution of a variable). Graphs (Histograms, for example) allows us to evaluate the normality of the empirical distribution because the histogram also includes an overlay of the normal curve. It also allows us to examine various aspects of the shape of the data distribution qualitatively. For example, distributions can be bimodal (having 2 peaks) or multimodal (more than 2 peaks). This shows that the sample is not homogeneous and the elements come from two different populations.

Central Tendency

One of the most important aspects to describe the distribution of data is the value of the center of observation. Any arithmetic measurement that is intended to describe a value that represents the central value or central value of a data set (set of observations) is known as a measure of central tendency . There are three types of measures of central tendency that are often used, namely:

  • Mean
  • Median
  • Mode

The arithmetic mean or often referred to as the mean is the most widely used method for describing measures of central tendency. The mean is calculated by adding up all the observed data values and then dividing by the number of data. The mean is affected by the extreme value . The median is the value that divides the set of observations into two equal parts, 50% of the observations are below the median and 50% are above the median. The median of n measurements or observations x 1 , x 2 ,..., x nis the observation value located in the middle of the data cluster after the data is sorted. If the number of observations ( n ) is odd, the median is located right in the middle of the data cluster, whereas if n is even, the median is obtained by interpolation, which is the average of the two data in the middle of the data cluster. The median is not affected by extreme values . Mode is the data that occurs most often. To determine the mode, first arrange the data in ascending or reverse order, then calculate the frequency. The value with the greatest frequency (often appears) is the mode. The mode is used for both numeric and categorical data types. The mode is not affected by extreme values .

Important characteristics for good center size

The measure of the center value ( average ) is a representative value of a data distribution, so it must have the following properties:

  • Must consider all datasets
  • Should not be affected by extreme values.
  • Must be stable from sample to sample.
  • Must be capable of being used for further statistical analysis.

From several measures of central value, Mean almost fulfills all of these requirements, except for the condition in the second point, the average is influenced by extreme values. For example, if the item is 2; 4; 5; 6; 6; 6; 7; 7; 8; 9 then the mean, median and mode are all equal to 6. If the last value was 90 instead of 9, the mean would be 14.10, while the median and mode were unchanged. Although the median and mode are better in this regard, they do not meet the other requirements. Therefore Mean is the best measure of central value and is often used in statistical analysis.

When do we use different center values?

The appropriate center size value to use depends on the nature of the data, the nature of the frequency distribution and the purpose. If the data is qualitative, only the mode can be used. For example, if we are interested in knowing the typical soil type in a location, or cropping patterns in an area, we can use the mode. On the other hand, if the data is quantitative, we can use one of these measures of center value. If the data is quantitative, we must consider the nature of the frequency distribution of the data cluster.

  • When the frequency distribution of the data is not normal (not symmetrical), the median or mode is an appropriate measure of the center.
  • When there are extreme values, whether small or large, it is more accurate to use the median or mode.
  • If the data distribution is normal (symmetrical), all measures of the central value, either mean, median, or mode can be used. However, the mean is used more often than the others because it satisfies the requirements for a good center measure.
  • When we are dealing with rate, velocity and price it is more appropriate to use the harmonic average.

If we are interested in relative changes, as in the case of bacterial growth, cell division and so on, the geometric mean is the most appropriate mean.