Tuesday, November 24, 2015

How to design a Talend Map/Reduce Job



This describes how you can create a Talend Map/Reduce job. This job will be reading the file that was written to HDFS (http://sindhujak.blogspot.com/2015/11/how-to-write-file-to-hdfs-using-talend.html) and counting the number of times a word occurs using Map/Reduce.

 - Right Click on Job Designs and select Create Map/Reduce Job.
 - Give a name and click on finish.

Create Map/Reduce job

The overall design of the job is shown below:

Map/Reduce job design
-         Now we can design the Job. You just need the following components:

          - tHDFSInput: Read a file from HDFS. 
          - tNormalize: Split words in the text file.
          - tAggregateRow: Count words using the SUM function.
          - tMap: Change all the words to upper case.
          - tHDFSOutput: Write the word and the occurrence count to a file in HDFS. 



Basic Settings:
Since we have already created the Hadoop connection, select property type as  repository and choose the connection.
You can browse HDFS and select the path of the text file.
Select the Type as Text File.
Click on edit schema and create one column. I have created one called “line”.

tHDFSInput basic settings



Basic Settings:
Column to normalize should be “line”.
Item separator should be “ “. 
tNormalize basic settings


Basic Settings:
tAggregateRow Basic Settings

       Schema:
tAggregateRow Schema



Schema:





Basic Settings:
tHDFSOutput basic settings


- In the Run View -> Hadoop Configurations set the Named node, Job tracker and User name


- Now you can run the job. If your job executed successfully, you should see the following on the console.




      - Check the output folder in HDFS that you have specified (in this case the folder is called “out”). You can see that the output has been written to a file. The words have been changed to upper case and the occurrence of the word has been calculated. 




- The difference in creating a Talend Map/Reduce job compared to a Talend standard job is that in a Talend Map/Reduce job the java code will consist of a Map function and a Reduce function. You can check this in the code section in your design workspace. 



No comments:

Post a Comment