Tuesday, November 24, 2015

How to write a file to HDFS using Talend

Prerequisites

You should have the following installed, configured and running:
Ubuntu 12.04.4
- Talend Enterprise Data Integration 5.6 or you can use any other Talend version that includes Big Data.
Apache Hadoop 1.0: Ensure that the following demons are running. You can check this by using the jps command. 


Using JPS

Write a file to HDFS

The following describes how to design a job to write a text file from your local machine to HDFS.


       - Create a new project and open it in Talend Studio.

       - Create a Hadoop cluster connection under the Metadata area in the repository.


Create Hadoop Cluster
    - Give a name and click on Next.
    - Select the Distribution as Apache and the version of Hadoop you have installed.
    - The namenode URI and jobtracker URI depend on the configurations you have given when you installed Hadoop. You can check your core-site.xml and mapred-site.xml files found in <HADOOP_HOME>/conf. 

Hadoop Connection Configurations
core-site.xml

mapred-site.xml

        - Then click on check connections to check the connection and click on finish. 
       - Now you can set the HDFS connection. You can do this by right clicking the newly created Hadoop connection name and selecting “Create HDFS”.
       - Give a name and click on Next.


Create HDFS
            Give the username of the superuser that you created when installing Hadoop.
         - Click on "Check" to ensure that the connection is successful.

HDFS Connection

     - Now we can design the Job. You just need three components which are:
     tHDFSConnection: Establish connection to HDFS.
     tFileList: Iterates on a set of files in the defined directory.
     tHDFSPut: Write file to HDFS.

- The overall design of the job is shown below:

Write to HDFS Job Design
   - Since we have already created the Hadoop cluster connection in the repository you can simply click on the HDFS connection you created and drag it to the design workspace. It will give a list of the components you can use. You can select tHDFSConnection from the list.


     Basic Settings: 

tHDFSConnection Basic Settings
   Basic Settings:
      - Give the folder path that consists of the text files that you want written to HDFS.
    - Make sure that you give the correct File mask. Note the asterisk in front of “.txt”, this will allow any file that ends with “.txt”. 

tFileList Basic Settings
    
    Basic Settings:
      - Tick the “use an existing connection” check box.
      - Local Directory should be the path where the source files are located. You can type Ctrl + Space which will give you a list. Select “tFileList_1_CURRENT_FILEDIRECTORY” from the list which will allow you to use the variable passed by the tFileList component.
      - Give the path of HDFS where you want your files to be written.

tHDFSPut Basic Settings


       - Add a file mask and name by using Ctrl + Space and selecting “tFileList_1_CURRENT_FILE” from the list.
       - Now you are ready to run your job. Once the job has been run successfully. Browse through your HDFS by using the web UI(<hostame/IP>:50070/) or by using command line to check if the folder called “TextFiles” has been created and the file has been written to it.

HDFS File Directory



       - Click on the file to view its contents.


Text File Contents

























- 








How to design a Talend Map/Reduce Job



This describes how you can create a Talend Map/Reduce job. This job will be reading the file that was written to HDFS (http://sindhujak.blogspot.com/2015/11/how-to-write-file-to-hdfs-using-talend.html) and counting the number of times a word occurs using Map/Reduce.

 - Right Click on Job Designs and select Create Map/Reduce Job.
 - Give a name and click on finish.

Create Map/Reduce job

The overall design of the job is shown below:

Map/Reduce job design
-         Now we can design the Job. You just need the following components:

          - tHDFSInput: Read a file from HDFS. 
          - tNormalize: Split words in the text file.
          - tAggregateRow: Count words using the SUM function.
          - tMap: Change all the words to upper case.
          - tHDFSOutput: Write the word and the occurrence count to a file in HDFS. 



Basic Settings:
Since we have already created the Hadoop connection, select property type as  repository and choose the connection.
You can browse HDFS and select the path of the text file.
Select the Type as Text File.
Click on edit schema and create one column. I have created one called “line”.

tHDFSInput basic settings



Basic Settings:
Column to normalize should be “line”.
Item separator should be “ “. 
tNormalize basic settings


Basic Settings:
tAggregateRow Basic Settings

       Schema:
tAggregateRow Schema



Schema:





Basic Settings:
tHDFSOutput basic settings


- In the Run View -> Hadoop Configurations set the Named node, Job tracker and User name


- Now you can run the job. If your job executed successfully, you should see the following on the console.




      - Check the output folder in HDFS that you have specified (in this case the folder is called “out”). You can see that the output has been written to a file. The words have been changed to upper case and the occurrence of the word has been calculated. 




- The difference in creating a Talend Map/Reduce job compared to a Talend standard job is that in a Talend Map/Reduce job the java code will consist of a Map function and a Reduce function. You can check this in the code section in your design workspace.