I have a large data set that I need to calculate mean/std dev/min/ and max on for several columns. The data set uses a "." to denote when a value is missing for a subject. When running the mean or sd function this causes R to return NA . Is there a simple way around this?
my code is just this
xCAL<-mean(longdata$CAL)
sdCAL<-sd(longdata$CAL)
minCAL<-min(longdata$CAL)
maxCAL<-max(longdata$CAL)
but R will return NA on all these variables. I get the following Error
Warning message: In mean.default(longdata$CAL) : argument is not numeric or logical: returning NA
You need to convert your data to numeric to be able to do any calculations on it. When you run as.numeric
, your .
will be converted to NA
, which is what R uses for missing values. Then, all of the function you mention take an argument na.rm
that can be set to TRUE
to remove (rm) missing values (na).
If your data is a factor
, you need to convert it to character
first to avoid loss of information as explained in this FAQ.
Overall, to be safe, try this:
longdata$CAL <- as.numeric(as.character(longdata$CAL))
xCAL <- mean(longdata$CAL, na.rm = TRUE)
sdCAL <- sd(longdata$CAL, na.rm = TRUE)
# etc
Do note that na.rm
is a property of the function - it's not magic that works everywhere. If you look at the help pages for ?mean
?sd
, ?min
, etc., you'll see the na.rm
argument documented. If you want to remove missing values in general, the na.omit()
function works well.