I have a data source where each row has five fields:
company name; year; code; value;
In my target output row model I want to produce a row like so
company name;year;value1;value2;value3;value4
Where value1,value..N are not concatenation for a single code but rather a "mapping". I.e. code 50 => "Total Revenues"
So I need to perform the following logic:
- First select all the records for the same company name / year
- Then apply some custom java logic that performs the mapping between my codes and my fields of the output row.
This is a in-memory map reduce with about 1M rows. How should this be handled in Talend Open Studio for Data Integration?
3 Answers
Answers 1
select all the records for the same company name / year
You might want to use tAggregate
(https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.1+tAggregateRow) to group the flow by company name and year
apply some custom java logic that performs the mapping between my codes and my fields of the output row.
Talend has a component called tMap
that allows you to map input fields into output fields.
In your tMap
you can use something like:
(assuming that input
is the name of the flow into your tMap
and output
is the name of your flow out of your tMap
)
In output.field1
put input.code == 50? input.value : 0
In output.field2
put input.code == 60? input.value : 0
In output.field2
put input.code == 70? input.value : 0
etc
This is assuming you are ok with leaving the field
columns with 0
if the value
was for another code
.
If you want the value
for each code
to be in a different output row out of the tMap
you can use a logic similar to the above, only putting each test (code == 70? input.value : 0
) in a different output table, and then filtering out the rows that have 0
(using a tFilter
) after the tMap
.
To add output tables you can use the +
symbol on the top right of the tMap
.
See here for more detials on how to use tMap
: https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide54EN/tMap
I hope this helps!
Answers 2
You could do it like that, and it's essentially the approach Maira Bay already suggested:
- Set up your data source to emit those lines one at a time. I used tFixedFlowInput for that. You'd probably have to read from a file.
- Optionally sort by company name and year with a tSortRow.
- Map with a tMap the value of each line to the corresponding column in the result line with a guard clause like input.code.equals("code for this column") ? input.value : null .
- Aggregate the rows with a tAggregateRow, grouping by company name and year, selecting the first value for each of the value rows - but make sure to ignore the nulls.
- Do anything you want with the resulting lines.
I tried that with some sample data, hence the tFixedFlowInput in step 1, and it worked for me on my machine in TOS 6.3.1.
Beware: the solution proposed assumes you only got one value per combination of company name, year and code.
Answers 3
See solution below which I believe will fulfill your precise requirement of taking a delimited file data source and transforming it into a denormalized out as specified above.
First I mocked up a file with the same format as you specified. I made the values a logical concatenation of Company, Year, and sequence. This makes it easy to verify the output.
Next I use that as an input, run it thru a sorter, then denormalize on the value field. Finally you can see the output in a tLogRow
.
I also included the component view of tDenormalize
so you can see how that is done. You can use this technique in any falvor of Talend Open Studio
.
0 comments:
Post a Comment