Monday, February 27, 2017

How to avoid “Invalid byte sequence” when looking for link with text using Nokogiri

By Hường Hana 4:30 PM encoding, nokogiri, ruby, ruby-on-rails Leave a Comment

I'm using Rails 5 with Ruby 4.2 and scanning a document that I parsed with Nokogiri, looking in a case insensitive way for a link with text:

a_elt = doc ? doc.xpath('//a').detect { |node| /link[[:space:]]+text/i === node.text } : nil

After getting the HTML of my web page in content, I parse it into a Nokogiri doc using:

doc = Nokogiri::HTML(content)

The problem is, I'm getting

ArgumentError invalid byte sequence in UTF-8

on certain web pages when using the above regular expression.

2.4.0 :002 > doc.encoding  => "UTF-8"  2.4.0 :003 > doc.xpath('//a').detect { |node| /individual[[:space:]]+results/i === node.text } ArgumentError: invalid byte sequence in UTF-8     from (irb):3:in `==='     from (irb):3:in `block in irb_binding'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:187:in `block in each'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `upto'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/nokogiri-1.7.0/lib/nokogiri/xml/node_set.rb:186:in `each'     from (irb):3:in `detect'     from (irb):3     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console.rb:65:in `start'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/console_helper.rb:9:in `start'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:78:in `console'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands/commands_tasks.rb:49:in `run_command!'     from /Users/davea/.rvm/gems/ruby-2.4.0@global/gems/railties-5.0.1/lib/rails/commands.rb:18:in `<top (required)>'     from bin/rails:4:in `require'     from bin/rails:4:in `<main>'

Is there a way I can rewrite the above to automatically account for the encoding or weird characters and not flip out?

1 Answers

Answers 1

Your question may have already been answered before. Have you tried the methods from "Is there any way to clean a file of "invalid byte sequence in UTF-8" errors in Ruby?"?

Specifically before the detect block, try to remove the invalid bytes and control characters except new line:

doc.scrub!("") doc.gsub!(/[[:cntrl:]&&[^\n\r]]/,"")

Remember, scrub! is a Ruby 2.1+ method.

Coding Question

Monday, February 27, 2017

How to avoid “Invalid byte sequence” when looking for link with text using Nokogiri

1 Answers

Answers 1

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment

Search

Popular Posts

Labels

Blog Archive

Find Us On Facebook