Yahoo Answers is shutting down on May 4th, 2021 (Eastern Time) and beginning April 20th, 2021 (Eastern Time) the Yahoo Answers website will be in read-only mode. There will be no changes to other Yahoo properties or services, or your Yahoo account. You can find more information about the Yahoo Answers shutdown and how to download your data on this help page.

Question on statistical analysis?

I have a population data set, within this there are 4 regions, I want to run some simple descriptive stats on this data; mean, SD, SE etc. However in some cases there is only values for 3 or 2 of the regions, the other regions having a value of 0. this is not a problem for calculating the mean, but I was wondering what the correct procedure would be when calculating the SD, do I omit the data with the 0 values and calculate the SD based only on the data with values? I would have included the 0's however, when looking at a published paper I noticed than in a similar scenario, they published the SD values omitting the 0 values. I have included an example of what I mean below.

also I know there are few points which makes the data unreliable but this is not the issue at this time.

thanks for your help!

region _ 1___2___3__ _4__mean__SD

no. __150__300__80__70__150__106

no. ___60__400___0__30__122.5 __ ?

1 Answer

Relevance
  • 1 decade ago
    Favorite Answer

    It depends on your variable. If the variable from which the data are drawn can be in value 0, there is nothing wrong with it and you must calculate both the mean and SD with all the numbers including 0's. If you omit the 0's just because they are zero it makes no sense.

    But if the variable form which the data are drawn can not be in value 0 and there are some 0's in the data, it can be assumed as 'missing data'. Missing data occures when some values of the variable weren't observed, or were observed but were missed, manipulated or wrongly reproted. In such cases you see that there are some values that aren't in the domain of the variable, so are not logical to be included in the calculations and analysis. So they contract to show the missing data by a certain number (or letter if the variable is not numerical) which is not in the domain of the variable. In your case, maybe it's contracted to show the missing data with 0 (again note that 0 should not be in the domain of your observed variable).

    Here by the table, you can see that the mean of the second row is calculated by including 0, so it seems that the variable can have value 0.

    Anyway, there are some methods to deal with missing data. Some people omit them. This is the simplest way, so is not the best way, because in this way you lose some information, time and expense you paid on gathering the data. There is a more logical method which is to ESTIMATE the missing values. In this case, you estimate the missing data by the other values. There are some different methods to do this, the most usual one is to substitute the missing value with the mean calculated from the other values of the variable. Here, in the table provided by you, you can estimate the missing value of the 3rd number of the second row of the table by averaging the other numbers of the second row. That is to substitute 0 with

    (60+400+30)/3=230.

    Now you can calculate the mean, SD and other descriptive statistics including this number.

Still have questions? Get your answers by asking now.