Showing posts with label talend. Show all posts
Showing posts with label talend. Show all posts

Sunday, December 17, 2017

Running Hive or Spark using Amazon EMR on Talend?

Leave a Comment

I am trying to run hive queries on Amazon AWS using Talend. So far I can create clusters on AWS using the tAmazonEMRManage object, the next steps would be 1) To load the tables with data 2) Run queries against the Tables.

My data sits in S3. So far the documentation on talend does not seem to indicate the Hive objects tHiveLoad and tHiveRow support S3 which makes me wonder whether running hive queries on EMR via Talend is even possible

The documentation on how to do this is scarce. Has anyone tried doing this successfully or can point me in the right direction please?

0 Answers

Read More

Sunday, June 4, 2017

talend - specify jndi as datasource

Leave a Comment

I have a talend job that uses tOracleInput component with connection type of ORACLE CUSTOM. It is working well.

Now, I have a requirement to use jndi as the database connection. Any ideas how can this be achieved?

1 Answers

Answers 1

First deploy your job as a webservice. After that you should be able to alter the Use or register a shared DB Connection in tOracleConnection. There you can define your JNDI datasource.

Source: https://www.talendforge.org/forum/viewtopic.php?pid=50374#p50374

Read More

Monday, April 17, 2017

receiving batch error running an update on talend into PostgreSQL database

Leave a Comment

I have a talend solution where inside it rests a tMap --> tPostgreSQLOutput.

Inside the schema is a integer(key field) and a Date(Timestamp) in the format of "dd-MM-yyyy HH:mm:ss". The intent is to update the date field with the current time/date (Timestamp).

the date is set with this talend function call in the tMap:

TalendDate.parseDate("yyyy-MM-dd HH:mm:ss", TalendDate.getDate("yyyy-MM-dd HH:mm:ss"))  

I have confirmed the date(timestamp) format, and confirmed that the timestamp data type in the PostgreSQL database. However, I'm getting this error upon runtime:

Batch entry 0 UPDATE "bitcoin_options" SET "last_notified" = 2017-04-08 12:02:40.000000 -05:00:00 WHERE "id" = 3 was aborted.  Call getNextException to see the cause. 

I took the query it errored and manually ran it into PostgreSQL. I got this response:

ERROR:  syntax error at or near "11" LINE 1: ...bitcoin_options" SET "last_notified" = 2017-04-08 11:53:11.0...                                                              ^ 

Again, I checked the format, the datatype, and compared them against other tables and their UPSERTS. same format. same datatype.

In addition, I attempted to add a second space between date and time, with no avail.

UPDATE 1

I updated the tMap output to:

TalendDate.getCurrentDate(); 

and got the same error. Thanks

UPDATE 2

Here's my layout for Talend:

enter image description here

2 Answers

Answers 1

I figured it out. after much trial and error. The tPostgresSQLCommit x3 was redundant. When I removed the first two and placed just one, it gave me the proper output.

LESSONS LEARNED: You only need 1 commit.

Answers 2

Notice your timestamp is not properly formatted: UPDATE "bitcoin_options" SET "last_notified" = '2017-04-08 12:02:40.000000 -05:00:00' WHERE "id" = 3

It's missing the single quotes surrounding the timestamp. If you add those you should be good to go.

Read More

Monday, April 10, 2017

Processing a single group of rows at once in Talend Open Studio for Data Integration

Leave a Comment

I have a data source where each row has five fields:

company name; year; code; value; 

In my target output row model I want to produce a row like so

company name;year;value1;value2;value3;value4 

Where value1,value..N are not concatenation for a single code but rather a "mapping". I.e. code 50 => "Total Revenues"

So I need to perform the following logic:

  1. First select all the records for the same company name / year
  2. Then apply some custom java logic that performs the mapping between my codes and my fields of the output row.

This is a in-memory map reduce with about 1M rows. How should this be handled in Talend Open Studio for Data Integration?

3 Answers

Answers 1

select all the records for the same company name / year

You might want to use tAggregate (https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.1+tAggregateRow) to group the flow by company name and year

apply some custom java logic that performs the mapping between my codes and my fields of the output row.

Talend has a component called tMap that allows you to map input fields into output fields.

In your tMap you can use something like:

(assuming that input is the name of the flow into your tMap and output is the name of your flow out of your tMap)

In output.field1 put input.code == 50? input.value : 0

In output.field2 put input.code == 60? input.value : 0

In output.field2 put input.code == 70? input.value : 0

etc

This is assuming you are ok with leaving the field columns with 0 if the value was for another code.

If you want the value for each code to be in a different output row out of the tMap you can use a logic similar to the above, only putting each test (code == 70? input.value : 0) in a different output table, and then filtering out the rows that have 0 (using a tFilter) after the tMap.

To add output tables you can use the + symbol on the top right of the tMap.

See here for more detials on how to use tMap: https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide54EN/tMap

I hope this helps!

Answers 2

You could do it like that, and it's essentially the approach Maira Bay already suggested:

  1. Set up your data source to emit those lines one at a time. I used tFixedFlowInput for that. You'd probably have to read from a file.
  2. Optionally sort by company name and year with a tSortRow.
  3. Map with a tMap the value of each line to the corresponding column in the result line with a guard clause like input.code.equals("code for this column") ? input.value : null .
  4. Aggregate the rows with a tAggregateRow, grouping by company name and year, selecting the first value for each of the value rows - but make sure to ignore the nulls.
  5. Do anything you want with the resulting lines.

I tried that with some sample data, hence the tFixedFlowInput in step 1, and it worked for me on my machine in TOS 6.3.1.

Beware: the solution proposed assumes you only got one value per combination of company name, year and code.

Answers 3

See solution below which I believe will fulfill your precise requirement of taking a delimited file data source and transforming it into a denormalized out as specified above.

First I mocked up a file with the same format as you specified. I made the values a logical concatenation of Company, Year, and sequence. This makes it easy to verify the output.

enter image description here

Next I use that as an input, run it thru a sorter, then denormalize on the value field. Finally you can see the output in a tLogRow.

enter image description here

I also included the component view of tDenormalize so you can see how that is done. You can use this technique in any falvor of Talend Open Studio.
enter image description here

Read More

Wednesday, April 13, 2016

EOFException java.io.EOFException in Talend after x Records

Leave a Comment

I created a Talend job which does this: Read record x from table A. Write record x + 1.000.000 in table A. This works great, but fails after 310 records. It clearly has notthing to do with the values in the records. If I alter the input query with X >= 1 and X <= 300, and after running succesfully without errors, alter it to X >= 301 and X <= 600, it runs again succesfully without any errors. But if X >= 1 and X <= 600, it stops after 310 records. I have to do thousands of records, so changing my query every 310 records is not an option.

I tried to alter the "commit after" from 10.000 to 100, 10 even 1, but that doesn't help.

What can I do?

    java.io.EOFExceptionException in component tJDBCOutput_1 java.sql.SQLException: java.io.EOFException     at com.kewill.jdbc.JdbcUnimsConnection.sendMessage(JdbcUnimsConnection.java:182)     at com.kewill.jdbc.JdbcUnimsConnection.commit(JdbcUnimsConnection.java:255)     at local_project.tsdsmd_0_1.tsdsmd.tJDBCInput_1Process(tsdsmd.java:12790)     at local_project.tsdsmd_0_1.tsdsmd.runJobInTOS(tsdsmd.java:13237)     at local_project.tsdsmd_0_1.tsdsmd.main(tsdsmd.java:13036) Caused by: java.io.EOFException     at java.io.DataInputStream.readFully(Unknown Source)     at com.kewill.jdbc.JdbcUnimsSocket.readFully(JdbcUnimsSocket.java:170)     at com.kewill.jdbc.JdbcUnimsMessage.init(JdbcUnimsMessage.java:114)     at com.kewill.jdbc.JdbcUnimsMessage.<init>(JdbcUnimsMessage.java:96)     at com.kewill.jdbc.JdbcUnimsSocket.readMessage(JdbcUnimsSocket.java:122)     at com.kewill.jdbc.JdbcUnimsSocket.sendMessage(JdbcUnimsSocket.java:106)     at com.kewill.jdbc.JdbcUnimsSocket.sendMessage(JdbcUnimsSocket.java:89)     at com.kewill.jdbc.JdbcUnimsConnection.sendMessage(JdbcUnimsConnection.java:166)     ... 4 more 

1 Answers

Answers 1

Let's start by asking more questions.

Questions. How much memory is allocated to the Talend job? Have you turned on debug level logging in Log4j ? Do you have the latest version of the JDBC driver? Can you use the same driver and query in a SQL tool, How long does it take to Return 10 results, 300 results, all the results? Does the driver have a setting for the timeout value?

Read More