Thursday, January 11, 2018

Ruby NET::HTTP Read the header BEFORE the body (without HEAD request)?

Leave a Comment

I'm using Net::HTTP with Ruby to crawl an URL.

I don't want to crawl streaming audio such as: http://listen2.openstream.co/334

in fact i only want to crawl Html content, so no pdfs, video, txt..

Right now, I have both open_timeout and read_timeout set to 10, so even if I do crawl these streaming audio pages they will timeout.

url = 'http://listen2.openstream.co/334' path = uri.path  req= Net::HTTP::Get.new(path, {'Accept' => '*/*', 'Content-Type' => 'text/plain; charset=utf-8', 'Connection' => 'keep-alive','Accept-Encoding' => 'Identity'})  uri = Addressable::URI.parse(url)     resp =  Net::HTTP.start(uri.host, uri.inferred_port) do |httpRequest|     httpRequest.open_timeout = 10     httpRequest.read_timeout = 10     #how can I read the headers here before it's streaming the body and then exit b/c the content type is audio?     httpRequest.request(req) end 

However, is there a way to check the header BEFORE I read the body of a http response to see if it's an audio? I want to do so without sending a separate HEAD request.

4 Answers

Answers 1

net/http supports streaming, you can use this to read the header before the body.

Code example,

url = URI('http://stackoverflow.com/questions/41306082/ruby-nethttp-read-the-header-before-the-body-without-head-request')  Net::HTTP.start(url.host, url.port) do |http|   request = Net::HTTP::Get.new(url)   http.request(request) do |response|      # check headers here, body has not yet been read     # then call read_body or just body to read the body      if true         response.read_body do |chunk|         # process body chunks here       end     end   end end 

Answers 2

Since I did not find a way to properly do this in Net::HTTP, and I saw that you're using the addressable gem as an external dependency already, here's a solution using the wonderful http gem:

require 'http'  response = HTTP.get('http://listen2.openstream.co/334') # Here are the headers puts response.headers  # Everything ok? Start streaming the response body = response.body body.stream!  # now just call `readpartial` on the body until it returns `nil` # or some other break condition is met 

Sorry if you're required to use Net::HTTP, hopefully someone else will find an answer. A separate HEAD request might indeed be the way to go in that case.

Answers 3

You can do a whole host of net related things without using a gem. Just use the net/http module.

require 'net/http'  url = URI 'http://listen2.openstream.co/334'  Net::HTTP.start(url.host, url.port){|conn|   conn.request_get(url){|resp|     resp.each{|k_header, v_header|       # process headers       puts "#{k_header}: #{v_header}"     }     #     # resp.read_body{|body_chunk|     #   # process body     # }   } } 

Note: while processing headers, just make sure to check the content-type header. For audio related content it would normally contain audio/mpeg value.

Hope, it helped.

Answers 4

I will add a ruby example later tonight. However, for a quick response. There is a simple trick to do this.

You can use HTTP Range header to indicate if which range of bytes you want to receive from the server. Here is an example -

curl -XGET http://www.sample-videos.com/audio/mp3/crowd-cheering.mp3 -v -H "Range: bytes=0-1"

The above example means. The server will return data from 0 to 1 byte range.

FYI: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests

Hope that works for you.

Thanks

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment