Official Statements
In the past the base R download.file()
was unable to work with HTTPS protocols and it was necessary to use RCurl. Since R 3.3.0:
All builds have support for https: URLs in the default methods for download.file(), url() and code making use of them. Unfortunately that cannot guarantee that any particular https: URL can be accessed. ... Different access methods may allow different protocols or use private certificate bundles ...
The download.file()
help still says:
Contributed package 'RCurl' provides more comprehensive facilities to download from URLs.
which (by the way includes cookies and headers management).
Based on RCurl FAQ (look for "When I try to interact with a URL via https, I get an error"), HTTPS URLs can be managed with:
getURL(url, cainfo="CA bundle")
where CA bundle
is the path to a certificate authority bundle file. One such a bundle is available from the curl site itself:
https://curl.haxx.se/ca/cacert.pem
Current status
For many HTTPS websites download.file()
works as stated:
download.file(url="https://www.google.com", destfile="google.html") download.file(url="https://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
As regards RCurl, using the cacert.pem
bundle, downloaded above, one might get an error:
library(RCurl) getURL("https://www.google.com", cainfo = "cacert.pem") # Error in function (type, msg, asError = TRUE) : # SSL certificate problem: unable to get local issuer certificate
In this instance, simply removing the reference to the certificate bundle solves the problem:
getURL("https://www.google.com") # works getURL("https://www.google.com", ssl.verifypeer=TRUE) # works
ssl.verifypeer = TRUE
is used to be sure that success is no due to getURL()
suppressing security. The argument is documented in RCurl FAQ.
However, in other instances, the connection fails:
getURL("https://curl.haxx.se/ca/cacert.pem") # Error in function (type, msg, asError = TRUE) : # error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
And similarly, using the previously downloaded bundle:
getURL("https://curl.haxx.se/ca/cacert.pem", cainfo = "cacert.pem") # Error in function (type, msg, asError = TRUE) : # error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version
The same error happens even when suppressing the security:
getURL("https://curl.haxx.se/ca/cacert.pem", ssl.verifypeer=FALSE) # same error as above
Questions
- How to use HTTPS properly in RCurl?
- As regards mere file downloads (no headers, cookies, etc.), is there any benefit in using RCurl instead of
download.file()
? - Is RCurl become obsolete and should we opt for curl?
0 comments:
Post a Comment