Tuesday, November 24, 2015

How to write a file to HDFS using Talend

Prerequisites

You should have the following installed, configured and running:
Ubuntu 12.04.4
- Talend Enterprise Data Integration 5.6 or you can use any other Talend version that includes Big Data.
Apache Hadoop 1.0: Ensure that the following demons are running. You can check this by using the jps command. 


Using JPS

Write a file to HDFS

The following describes how to design a job to write a text file from your local machine to HDFS.


       - Create a new project and open it in Talend Studio.

       - Create a Hadoop cluster connection under the Metadata area in the repository.


Create Hadoop Cluster
    - Give a name and click on Next.
    - Select the Distribution as Apache and the version of Hadoop you have installed.
    - The namenode URI and jobtracker URI depend on the configurations you have given when you installed Hadoop. You can check your core-site.xml and mapred-site.xml files found in <HADOOP_HOME>/conf. 

Hadoop Connection Configurations
core-site.xml

mapred-site.xml

        - Then click on check connections to check the connection and click on finish. 
       - Now you can set the HDFS connection. You can do this by right clicking the newly created Hadoop connection name and selecting “Create HDFS”.
       - Give a name and click on Next.


Create HDFS
            Give the username of the superuser that you created when installing Hadoop.
         - Click on "Check" to ensure that the connection is successful.

HDFS Connection

     - Now we can design the Job. You just need three components which are:
     tHDFSConnection: Establish connection to HDFS.
     tFileList: Iterates on a set of files in the defined directory.
     tHDFSPut: Write file to HDFS.

- The overall design of the job is shown below:

Write to HDFS Job Design
   - Since we have already created the Hadoop cluster connection in the repository you can simply click on the HDFS connection you created and drag it to the design workspace. It will give a list of the components you can use. You can select tHDFSConnection from the list.


     Basic Settings: 

tHDFSConnection Basic Settings
   Basic Settings:
      - Give the folder path that consists of the text files that you want written to HDFS.
    - Make sure that you give the correct File mask. Note the asterisk in front of “.txt”, this will allow any file that ends with “.txt”. 

tFileList Basic Settings
    
    Basic Settings:
      - Tick the “use an existing connection” check box.
      - Local Directory should be the path where the source files are located. You can type Ctrl + Space which will give you a list. Select “tFileList_1_CURRENT_FILEDIRECTORY” from the list which will allow you to use the variable passed by the tFileList component.
      - Give the path of HDFS where you want your files to be written.

tHDFSPut Basic Settings


       - Add a file mask and name by using Ctrl + Space and selecting “tFileList_1_CURRENT_FILE” from the list.
       - Now you are ready to run your job. Once the job has been run successfully. Browse through your HDFS by using the web UI(<hostame/IP>:50070/) or by using command line to check if the folder called “TextFiles” has been created and the file has been written to it.

HDFS File Directory



       - Click on the file to view its contents.


Text File Contents

























- 








2 comments:

  1. i followed the steps you have specified to move local files in to hdfs , but it doesn't seem to work for me.
    No match file(constant_filter_valeus.csv) exists!
    [ERROR]: etl_3_1.reportfilters_0_1.reportFilters - tHDFSPut_1 - No match file(constant_filter_valeus.csv) exist.
    I trying to copy the linux files in to hdfs.

    ReplyDelete
  2. un-checking the regex box worked for me

    ReplyDelete