Friday, June 9, 2017

R: download data securely using TLS/SSL

Leave a Comment

Official Statements

In the past the base R download.file() was unable to work with HTTPS protocols and it was necessary to use RCurl. Since R 3.3.0:

All builds have support for https: URLs in the default methods for download.file(), url() and code making use of them. Unfortunately that cannot guarantee that any particular https: URL can be accessed. ... Different access methods may allow different protocols or use private certificate bundles ...

The download.file() help still says:

Contributed package 'RCurl' provides more comprehensive facilities to download from URLs.

which (by the way includes cookies and headers management).

Based on RCurl FAQ (look for "When I try to interact with a URL via https, I get an error"), HTTPS URLs can be managed with:

getURL(url, cainfo="CA bundle") 

where CA bundle is the path to a certificate authority bundle file. One such a bundle is available from the curl site itself:
https://curl.haxx.se/ca/cacert.pem

Current status

For many HTTPS websites download.file() works as stated:

download.file(url="https://www.google.com", destfile="google.html") download.file(url="https://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem") 

As regards RCurl, using the cacert.pem bundle, downloaded above, one might get an error:

library(RCurl) getURL("https://www.google.com", cainfo = "cacert.pem")     # Error in function (type, msg, asError = TRUE)  :  #   SSL certificate problem: unable to get local issuer certificate 

In this instance, simply removing the reference to the certificate bundle solves the problem:

getURL("https://www.google.com")                      # works getURL("https://www.google.com", ssl.verifypeer=TRUE) # works 

ssl.verifypeer = TRUE is used to be sure that success is no due to getURL() suppressing security. The argument is documented in RCurl FAQ.

However, in other instances, the connection fails:

getURL("https://curl.haxx.se/ca/cacert.pem") # Error in function (type, msg, asError = TRUE)  :  #  error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version 

And similarly, using the previously downloaded bundle:

getURL("https://curl.haxx.se/ca/cacert.pem", cainfo = "cacert.pem") # Error in function (type, msg, asError = TRUE)  :  #   error:1407742E:SSL routines:SSL23_GET_SERVER_HELLO:tlsv1 alert protocol version 

The same error happens even when suppressing the security:

getURL("https://curl.haxx.se/ca/cacert.pem", ssl.verifypeer=FALSE) # same error as above 

Questions

  1. How to use HTTPS properly in RCurl?
  2. As regards mere file downloads (no headers, cookies, etc.), is there any benefit in using RCurl instead of download.file()?
  3. Is RCurl become obsolete and should we opt for curl?

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment