This is a relatively short and simple project which will calculate a few simple statistics from an array of numbers. It covers the most basic areas of classical statistics which might seem a bit old-fashioned in an era of big data and machine learning algorithms, but even the most complex of data science investigations are likely to start out with a few simple statistics.
At the heart of this project will be a struct to hold each of the stats we will be calculating, and a function which will calculate the stats and use them to populate a struct when we thrown an array of data at it. There will also be a few utility functions, and of course some code to test and demonstrate what we have written.
Create a new folder somewhere and then create the following empty files in it. You can download the source code as a zip or from Github if you prefer.
- data.h
- data.c
- statistics.h
- statistics.c
- main.c
Source Code Links
The Statistics
I don't want to mix up the discussion of the various statistics we will be calculating with the discussion of the actual code, so will run through it first. If you understand this stuff already just skip straight to Coding
The statistics we'll calculate are the following:
- Count
- Total
- Arithmetic Mean
- Minimum
- Lower Quartile
- Median
- Upper Quartile
- Maximum
- Range
- Inter-Quartile Range
- Standard Deviation of Population
- Standard Deviation of Sample
- Variance of Population
- Variance of Sample
- Skew
Many of these are self-explanatory but a few might not be familiar so I will give a brief overview of those.
Quartiles and Medians
I am sure everyone understands the arithmetic mean: it is what most people think of as the "average" and is just the total of all numbers divided by the count. It is one of several values known as "measures of central tendency" which are intended to give an idea of a central or typical value. However, if the data is not evenly distributed this can give a distorted impression, so the median gives a better idea of a typical or central value. It is quite simply the middle value when the data is sorted into order.
The quartiles are examples of percentiles, ie. the values a certain percentage from the beginning and end of sorted data. Lower and upper quartiles are the values 25% and 75% along respectively. Their purpose is to complement or even replace the minimum and maximum values which might be what are known as "outliers", ie. they are significantly lower or higher than the main body of values and therefore give a misleading impression of the range. I have used the terms lower quartile, median and upper quartile, but the terms 1st, 2nd and 3rd quartile are also widely used. Quartiles are probably the most widely used percentiles but any percentage can be used, deciles (the 10th and 90th percentiles) also being commonly used.
Calculating quartiles and the median sounds simple, just make sure the data is sorted and pick the relevant values from it. However, it's not quite so simple because if there is an even number of values in the data there is no single middle value. In this case we take the mean of the two central values. Irrespective of whether the count is odd or even, the counts of the lower and upper halves may be odd or even, requiring the same approach as with calculating the median.
In order to show how the quartiles and median are calculated in each of the four possible permutations, I will show some sample data.
Firstly, the overall count is odd, but the count of the two halves used to calculate the quartiles is even. The blue cells show the quartiles or the cells averaged to calculate the quartiles. The green cells are the median or cells averaged to calculate the median. Note that if the count is odd we ignore the median when calculating the quartiles.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
Secondly, the overall count is odd, but this time the count of the two halves used to calculate the quartiles is also odd so we just pick the middle values for the quartiles.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Next, the count is even and the count of the two halves used to calculate the quartiles is also even. The two median values are included in the values used to calculate the quartiles.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
Lastly, the count is even but the count of the two halves used to calculate the quartiles is odd.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Measures of central tendency give no impression of how widely the data is spread so we will also calculate the range (maximum - minimum) and the inter-quartile range. The latter of course is useful in eliminating the misleading effects of any outliers. Such values are known as measures of spread and another is the standard deviation which deserves a section to itself.
Variances and Standard Deviations
The standard deviation can be thought of as the average (ie. mean) amount by which values differ from the mean. That is not a precise definition but it gives an impression of what it signifies. Of course the actual mean by which values differ from the mean would be 0 as positive and negative values cancel out. To get round this the variance is calculated using the squares of each value, and the square root is taken to obtain the standard deviation.
The standard deviation is a useful indicator in its own right but along with the variance it is also used to calculate other statistics such as various coefficients of skewness, as we shall see later, as well as in correlations and regressions and many other applications.
If you want a detailed description of standard deviation take a look at the Wikipedia article.
Another statistic we will calculate is a coefficient of skewness. I say a rather than the as there are plenty to choose from, the one I am using is Pearson's second skewness coefficient (median skewness). Again you might like to read the Wikipedia article for full details but briefly this gives an indicator of how assymetric the data is around the median.
Coding
To test our little statistics library we will need a few arrays. The median and quartiles are calculated in different ways depending on whether the number of items is odd or even, and on whether the count divided by 2 is odd or even. We will therefore need several sets of data of different sizes to test the various permutations, so we'll write a function to populate an array of the specified size with random data.
Let's get the code to create test data out of the way as it is rather boring.
Open data.h and enter the following.
data.h
//-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- void populate_data(double* data, int size);
Now open data.c and enter the function body.
data.c
#include<stdlib.h> #include<time.h> //-------------------------------------------------------- // FUNCTION populate_data //-------------------------------------------------------- void populate_data(double* data, int size) { srand(time(NULL)); for(int i = 0; i < size; i++) { data[i] = (rand() % 128) + 1; } }
Now we'll write the header file for the statistics library. This is quite straightforward but apart from the function prototypes it also includes a rather large struct to hold the statistics being calculated.
statistics.h
// -------------------------------------------------------- // STRUCT statistics // -------------------------------------------------------- typedef struct { int count; double total; double arithmetic_mean; double minimum; double lower_quartile; double median; double upper_quartile; double maximum; double range; double inter_quartile_range; double standard_deviation_population; double standard_deviation_sample; double variance_population; double variance_sample; double skew; } statistics; // -------------------------------------------------------- // FUNCTION PROTOTYPES // -------------------------------------------------------- void calculate_statistics(double* data, int size, statistics* stats); void output_data(double* data, int size); void output_statistics(statistics* stats);
We can now start working on statistics.c which aside from the core calculate_statistics function contains a few simple utility functions. Let's get those out of the way first. (Note that there are a couple of static functions to be used internally, but as they are at the beginning of the file before being called we don't need function prototypes for them.)
statistics.c (part 1)
#include<stdlib.h> #include<stdbool.h> #include<stdio.h> #include<string.h> #include<math.h> #include"statistics.h" //-------------------------------------------------------- // FUNCTION compare_doubles //-------------------------------------------------------- static int compare_doubles(const void* a, const void* b) { if(*(double*)a < *(double*)b) return -1; else if(*(double*)a > *(double*)b) return 1; else return 0; } //-------------------------------------------------------- // FUNCTION is_even //-------------------------------------------------------- static bool is_even(int n) { return n % 2 == 0; } //-------------------------------------------------------- // FUNCTION output_data //-------------------------------------------------------- void output_data(double* data, int size) { for(int i = 0; i < size; i++) { printf("%d\t%lf\n", i, data[i]); } } //-------------------------------------------------------- // FUNCTION output_statistics //-------------------------------------------------------- void output_statistics(statistics* stats) { printf("%-33s%12d\n", "Count", stats->count); printf("%-33s%12.4lf\n", "Total", stats->total); printf("%-33s%12.4lf\n", "Arithmetic Mean", stats->arithmetic_mean); printf("%-33s%12.4lf\n", "Minimum", stats->minimum); printf("%-33s%12.4lf\n", "Lower Quartile", stats->lower_quartile); printf("%-33s%12.4lf\n", "Median", stats->median); printf("%-33s%12.4lf\n", "Upper Quartile", stats->upper_quartile); printf("%-33s%12.4lf\n", "Maximum", stats->maximum); printf("%-33s%12.4lf\n", "Range", stats->range); printf("%-33s%12.4lf\n", "Inter-Quartile Range", stats->inter_quartile_range); printf("%-33s%12.4lf\n", "Standard Deviation of Population", stats->standard_deviation_population); printf("%-33s%12.4lf\n", "Standard Deviation of Sample", stats->standard_deviation_sample); printf("%-33s%12.4lf\n", "Variance of Population", stats->variance_population); printf("%-33s%12.4lf\n", "Variance of Sample", stats->variance_sample); printf("%-33s%12.4lf\n", "Skew", stats->skew); }
The compare_doubles function is a comparator function passed to the qsort function. For more details on using qsort take a look at the article I wrote on the topic https://www.codedrome.com/using-the-c-librarys-qsort-function.
The is_even function uses the modulus (remainder) operator. If you divide an integer by 2 and there is no remainder then the number is even.
The output_data function simply iterates a double array and prints the values, for use in demonstrating and debugging the code.
The output_statistics function is very straightforward: it just prints the members of a statistics struct. Note that the "-" in %-33s left-aligns the string, padding it with spaces to the specified width.
We can now implement the calculate_statistics function. Rather than describe the code separately I have commented each main part.
statistics.c (part 2)
//-------------------------------------------------------- // FUNCTION calculate_statistics //-------------------------------------------------------- void calculate_statistics(double* data, int size, statistics* stats) { double sum_of_squares = 0; int lower_quartile_index_1; int lower_quartile_index_2; // data needs to be sorted for median etc qsort(data, size, sizeof(double), compare_doubles); output_data(data, size); // count is just the size of the data set stats->count = size; // initialize total to 0, and then iterate data // calculating total and sum of squares stats->total = 0; for(int i = 0; i < size; i++) { stats->total += data[i]; sum_of_squares += pow(data[i], 2); } // the arithmetic mean is simply the total divided by the count stats->arithmetic_mean = stats->total / stats->count; // method of calculating median and quartiles is different for odd and even count if(is_even(stats->count)) { stats->median = (data[((stats->count) / 2) - 1] + data[stats->count / 2]) / 2; // even / even if(is_even(stats->count / 2)) { lower_quartile_index_1 = (stats->count / 2) / 2; lower_quartile_index_2 = lower_quartile_index_1 - 1; stats->lower_quartile = (data[lower_quartile_index_1] + data[lower_quartile_index_2]) / 2; stats->upper_quartile = (data[stats->count - 1 - lower_quartile_index_1] + data[stats->count - 1 - lower_quartile_index_2]) / 2; } // even / odd else { lower_quartile_index_1 = ((stats->count / 2) - 1) / 2; stats->lower_quartile = data[lower_quartile_index_1]; stats->upper_quartile = data[stats->count - 1 - lower_quartile_index_1]; } } else { stats->median = data[((stats->count + 1) / 2) - 1]; // odd / even if(is_even((stats->count - 1) / 2)) { lower_quartile_index_1 = ((stats->count - 1) / 2) / 2; lower_quartile_index_2 = lower_quartile_index_1 - 1; stats->lower_quartile = (data[lower_quartile_index_1] + data[lower_quartile_index_2]) / 2; stats->upper_quartile = (data[stats->count - 1 - lower_quartile_index_1] + data[stats->count - 1 - lower_quartile_index_2]) / 2; } // odd / odd else { lower_quartile_index_1 = (((stats->count - 1) / 2) - 1) / 2; stats->lower_quartile = data[lower_quartile_index_1]; stats->upper_quartile = data[stats->count - 1 - lower_quartile_index_1]; } } // the data is sorted so the minimum and maximum are the first and last values stats->minimum = data[0]; stats->maximum = data[size - 1]; // the range is difference between the minimum and the maximum stats->range = stats->maximum - stats->minimum; // and the inter-quartile range is the difference between the upper and lower quartiles stats->inter_quartile_range = stats->upper_quartile - stats->lower_quartile; // this is the formula for the POPULATION variance stats->variance_population = (sum_of_squares - ((pow(stats->total, 2)) / stats->count)) / stats->count; // the standard deviation is the square root of the variance stats->standard_deviation_population = sqrt(stats->variance_population); // the formula for the sample variance is slightly different in that it use count -1 stats->variance_sample = (sum_of_squares - ((pow(stats->total, 2)) / stats->count)) / (stats->count - 1); // the sample standard deviation is the square root of the sample variance stats->standard_deviation_sample = sqrt(stats->variance_sample); // this is Pearson's second skewness coefficient, one of many measures of skewness stats->skew = (3.0 * (stats->arithmetic_mean - stats->median)) / stats->standard_deviation_population; }
Now all we need to do is write a bit of code to test the statistics. This involves nothing more than creating and populating an array of data and a statistics struct, passing them to calculate_statistics, and then printing the results. As you may have noticed, the calculate_statistics prints the data after sorting it which makes checking easier.
Because of the four different methods of calculating the quartiles and median we will need four sets of data of varying sizes to test them. I have chosen 9, 10, 11 and 12, and also included a larger data set of 138 values. The following code is the complete main.c.
main.c
#include<stdio.h> #include<stdlib.h> #include"statistics.h" #include"data.h" int main(void) { puts("-----------------"); puts("| codedrome.com |"); puts("| Statistics |"); puts("-----------------\n"); double data138[138]; double data9[9]; double data10[10]; double data11[11]; double data12[12]; statistics s; populate_data(data138, 138); puts("data138\n--------"); calculate_statistics(data138, 138, &s); output_statistics(&s); populate_data(data9, 9); puts("\ndata9\n--------"); calculate_statistics(data9, 9, &s); output_statistics(&s); populate_data(data10, 10); puts("\ndata10\n--------"); calculate_statistics(data10, 10, &s); output_statistics(&s); populate_data(data11, 11); puts("\ndata11\n--------"); calculate_statistics(data11, 11, &s); output_statistics(&s); populate_data(data12, 12); puts("\ndata12\n--------"); calculate_statistics(data12, 12, &s); output_statistics(&s); return EXIT_SUCCESS; }
Now compile and build the code by running the following commands in Terminal.
Compile and Run
gcc main.c data.c statistics.c -std=c11 -lm -o main ./main
The output is rather long with all the data printed so I'll just show one of the small data sets.
Program Output
----------------- | codedrome.com | | Statistics | ----------------- data12 ------ 0 13.000000 1 15.000000 2 17.000000 3 31.000000 4 33.000000 5 40.000000 6 50.000000 7 53.000000 8 59.000000 9 64.000000 10 74.000000 11 107.000000 Count 12 Total 556.0000 Arithmetic Mean 46.3333 Minimum 13.0000 Lower Quartile 17.0000 Median 45.0000 Upper Quartile 64.0000 Maximum 107.0000 Range 94.0000 Inter-Quartile Range 47.0000 Standard Deviation of Population 26.4302 Standard Deviation of Sample 27.6054 Variance of Population 698.5556 Variance of Sample 762.0606 Skew 0.1513