Sunday, April 24, 2016

Pig write to S3 via HCatStorer() “succeeds” with 0-bytes written

Leave a Comment

I created an external Hive (1.0 on EMR) table that is stored in S3. I can successfully use Hive to insert records into this table, query them back, and pull the files directly from the S3 bucket as verification. So far, so good.

I would like to be able to use Pig (v0.14, also on EMR) to both read and write to this logical table. Loading with HCatLoader() works fine, and dump/explain confirm that my data and schema are as expected.

When I try to write with HCatStorer() however, I have problems. Pig reports success, with N records, but 0 bytes, written. I see nothing that seems relevant or indicative of a problem in the log, and no data is written into the table/bucket.

a = load 'myfile' as (foo: int, bar: chararray); // Just assume that this works.  dump a; // Records are there describe a; // Correct schema, as specified above store a into 'mytable' using org.apache.hive.hcatalog.pig.HCatStorer();  

The output (which, again contains no other indication of problems that I can see) concludes with:

Success!  ...  Input(s): Successfully read 2 records (24235 bytes) from: "myfile"  Output(s): Successfully stored 2 records in: "mytable"  Counters: Total records written : 2 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 

Of note:

  • This works in the same environment if the table location is in HDFS instead of S3 - for both external and internal tables, and from either Hive or Pig.
  • I can successfully store directly to S3 with e.g. store a into 's3n://mybucket/output' using PigStorage(',');
  • An insert via the Hive shell to the same query works fine.

So this appears to be a problem with the interplay of Pig/HCatalog/S3 as a stack; any two of these together seem to work fine.

Given that I don't see anything very useful in the Pig log, what else should I look at to debug this? Are there any particular configuration parameters for any of these technologies that I should look at?

0 Answers

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment