To get back to a question asked after the last course (still on nonlife insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,
> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";") > attach(db)
The first step is to get a model. For instance, a logistic regression, where some factors were merged together,
> X3bis=rep(NA,length(X3)) > X3bis[X3%in%c("A","C","D")]="ACD" > X3bis[X3%in%c("B","E")]="BE" > db$X3bis=as.factor(X3bis) > reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)
From this model, we can predict a probability, not a variable,
> S=predict(reg,type="response")
Let denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a variable. As we usually do in decision theory. The idea is to consider a threshold , so that
 if , then will be , or “positive” (using a standard terminology)
 si , then will be , or “negative“
Then we derive a contingency table, or a confusion matrix
observed value  
predicted
value

“positive“  “négative“  
“positive“  TP  FP  
“négative“  FN  TN 
where TP are the socalled true positive, TN the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold
> roc.curve=function(s,print=FALSE){ + Ps=(S>s)*1 + FP=sum((Ps==1)*(Y==0))/sum(Y==0) + TP=sum((Ps==1)*(Y==1))/sum(Y==1) + if(print==TRUE){ + print(table(Observed=Y,Predicted=Ps)) + } + vect=c(FP,TP) + names(vect)=c("FPR","TPR") + return(vect) + } > threshold = 0.5 > roc.curve(threshold,print=TRUE) Predicted Observed 0 1 0 5 231 1 19 745 FPR TPR 0.9788136 0.9751309
Here, we also compute the false positive rates, and the true positive rates,
 TPR = TP / P = TP / (TP + FN) also called sentivity, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
 FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)
The ROC curve is then obtained using severall values for the threshold. For convenience, define
> ROC.curve=Vectorize(roc.curve)
First, we can plot (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors
> I=(((S>threshold)&(Y==0))((S<=threshold)&(Y==1))) > plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="") > abline(v=threshold,col="gray")
And for the ROC curve, simply use
> M.ROC=ROC.curve(seq(0,1,by=.01)) > plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")
This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree
> library(tree) > ctr < tree(Y~X1+X2+X3bis,data=db) > plot(ctr) > text(ctr)
To plot the ROC curve, we just need to use the prediction obtained using this second model,
> S=predict(ctr)
All the code described above can be used. Again, we can plot (observe that we have 5 possible values for , which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,
An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models
> plot(M.ROC[1,],M.ROC[2,],type="l") > lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)
The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.
Hi Arthur,
Could you please explain me what are the red and blue plots generated by the piece of code:
I=(((S>threshold)&(Y==0))((S<=threshold)&(Y==1)))
plot(S,Y,col=c("red","blue")
[I+1],pch=19,cex=.7,xlab="Predito",ylab="Observado")
abline(v=threshold,col="gray")
For me, I is a matriz with values "1" at FN and FP, am I correct? But I can't understand what is:
c("red","blue")[I+1]
Thank you, Arthur!
Thank you.
the URls are not “live” . is this on purpose ?
Repards,
pb
a migration issue…. I have to update all the links… sorry, that’s much longer than I thought
Hey man, thanks for the article! Looks good, however at least for me starting with the paragraph you define Shat the blocks of code in the article stop appearing from then onwards. Makes the article much tougher to understand!
can any one tell me what is small s, and these two lines not working what to do?
M.ROC=ROC.curve(seq(0,1,by=.01))
plot(M.ROC[1,],M.ROC[2,],col=”grey”,lwd=2,type=”l”)
Thankyou Arthur for your post. It is clear and easy to follow. I am relatively new to R & have chosen it as one of my dissertation tools. It requires that one knows a lot more than pointnclick (e.g. SPSS). However, it is definitely more exacting and thereby has less forgiving of not thoroughly befriending one’s dataset. I have racked my brains and tried everything I could for the last two days to make this code work for my project. I keep getting the following error:
“Error in table(Observed = TX, Predicted = Ps) :
all arguments must have the same length
In addition: Warning messages:
1: In (Ps == 1) * (TX == 0) :
longer object length is not a multiple of shorter object length
2: In (Ps == 1) * (TX == 1) :
longer object length is not a multiple of shorter object length”.
This (of course) is not tremendously helpful to me. I cannot figure out what is different in my model that does not allow your elegant programme to work. HELP please, Phillip
I do not see why you get those TX variables… indeed, they were not defined here…
please, see also http://freakonometrics.hypotheses.org/9307 (posted just after that post actually)
Hi Arthur,
Comme vous devinez, c’etais les cas incomplets. Merci mille fois.
Phillip
Hello!Thanks a lot for sharing.
I have a question.
M.ROC=ROC.curve(seq(0,1,by=.01))
the parameters of ROC.curve() is “threshold”? or what is the parameters of ROC.curve()?
yes, the parameter is the threshold
Very clear and excellent post. Thanks a lot for sharing.
agreed. It didn’t take me too long to work it out.
After running
plot(M.ROC[1,],M.ROC[2,],type=”l”)
lines(M.ROC.tree[1,],M.ROC.tree[2,],type=”l”,col=”grey”,lwd=2)
I’m getting an error:
Error in lines(M.ROC.tree[1, ], M.ROC.tree[2, ], type = “l”, col = “grey”, :
object ‘M.ROC.tree’ not found
Have I missed something?
yes, I did not type all the code, since it is exactly the same as before… this is a code for my students (mainly), and after some experience, I usually upload incomplete code. It takes 2 minutes to complete it, but I want my student to work a little bit, and to aviod copy and paste of the code.
hello:
i m new in R , i need the complete code its very helpful for my thesis how can i got all code.
M.ROC=ROC.curve(seq(0,1,by=.01))
plot(M.ROC[1,],M.ROC[2,],col=”grey”,lwd=2,type=”l”)
i can’t understand these two lines. Need your help .
did you run those lines ?
the first one creates a matrix with two rows, that will be used in the second to draw a graph, i.e. each dot of the curve is a (x,y) point, x beeing in row 1 and y in row 2
yes i run those lines but these lines not plotting a graph its just showing like x, y c ordinates nothing else. Why?
codes are missing?
Choosing a cutpoint in this manner is at odds with decision theory. Decision theory tells us that a cutpoint cannot be chosen without a loss/utility/cost function being specified.
Indeed. But this was not my point in this post. Actually, I mention decision theory in my course to mention that you can not get a decision ‘process’ that will give you noerror. You can easily get a decision rule that will give no False Positive, and you can get another decision rule that will give no False Negative. What I explain in my course is that any decision rule is somehow a tradeoff between the two types of errors, and indeed, it is related to some loss/utility/cost function. Based on a rule, you will get a cutoff point, but here, we would like to see what happens for all possible cutoffs. I guess your point can be related to my latest paragraph, where I mention that comparing ROC curve is not that simple, and again, it does where your focus is, which is the same as ‘it depends what your loss function is’.
I think this example will lull many readers to use cutpoints where that use will be inconsistent about what we know about optimal decisions. Decision theory shows that there is no single optimal cutpoint except in the unusual case that every subject has the same utility function and you know the function. ROC curves are only useful for onetime group decision making (“one size fits all”).
Perhaps I was not clear… my point is not ‘ROC curves are great’ or ‘ROC curves are useless’. My point was more ‘ROC curves exist and a lot of people use them’. Some students came back to me last week, asking ‘How do we interpret them?’. And the way I see interpretation is ‘if you want to discuss about a curve, first, we have to understand how that curve what plotted’. So I wrote a post to explain how those curves where plotted. ROC curves are based on the idea of using several cutoff points, and for each cutoff point, I see a Decision Theory problem, like ‘that point can be obtained for someone with a very specific loss or utility curve’. My course is not a course on theoretical statistics, but more a statistical toolbox for actuaries. And since the ROC curve is popular (for good reasons or not), I want my students to understand how this curve is obtained. And what could be the possible interpretations.
The fact that a lot of people are drawing ROC curves does not mean that they are useful for the intended purposes. ROC curves play absolutely no role in optimal Bayes decisions, which uses full conditioning and a loss function. Conditioning on X=x instead of X>x is consistent with using the available data. Incomplete conditioning such as X>x causes many artifacts and misinterpretations.
ROC curves exist and are used. Quite often they are used regardless of whether they are decision theoretically optimal due to ease of use, time constraints, and ease of explanation to the nonstat inclined.
If you have an alternative that you prefer please create a post as I would love to read more.
Thank you Arthur for the clear, easy to understand descriptions.
Again, the fact that they are used does not mean they are relevant or well understood. Understanding probabilities and checking their absolute accuracy (e.g., with nonparametric calibration curves) is more important in my opinion.
Thanks for the great post! I have two questions:
1. For the code: abline(v=seuil,col=”gray”), can I know where you defined “seuil”?
2. How you made the interactive sidebyside plot? The codes produce only static plots.
Thank you so much 🙂
sorry, ‘seuil’ is the French for ‘threshold’. I did type in French, and forgot to translate that part when I posted the code on the blog. I did change it in the text. And indeed, the code produces statics plot. You can use the animate package and a loop on s (here my loop was based on values from 100% to 40$, with 1% for each picture).
Thank you so much for the clarifications! I tried to use the animation package but failed to find the correct functions… Is it possible that you post the codes for the animation part? I am really interested in how it works 🙂 It is totally find if you prefer to make it private. Thanks in advance!
‘Sensitivity’, not ‘sensibility’!
ooops, thanks ! such a shame… Today was a busy and stressful day (promise, I will get back to the story which started a few days ago) .
Don’t worry Arthur, it’s not a shame, when you’re french you don’t notice the error 😉