CIFS SMB to HDFS and FTP to HDFS
CIFS/SMB to HDFS and FTP to HDFS
Over the past few years since working on with Hadoop and HDFS. Two types of requests that came up pretty regularly. One being can we move files from a Windows SMB/CIFS file share into Hadoop/HDFS usually containing 1000’s of CSVs or XLSX/XLS files. The other use case was can we move files from a mainframe into HDFS. This led me to develop two programs using opensource components. FTP2HDFSÂ and CIFS2HDFSÂ both of these programs have the ability to run on Hadoop edge nodes and use the Hadoop IPC protocol to move data directly into HDFS or they can be run as a Mapreduce job and can be scheduled to run on a regular basis using Oozie or another scheduler. CIFS2HDFS uses the JCIFS library with Hadoop Mapreduce and HDFS APIs and FTP2HDFS uses commons-net and the Hadoop Mapreduce and HDFS APIs.
FTP2HDFS How to
Execute FTP2HDFS in Client Mode
Note: You must create Hadoop Credential Store to secure FTP Credentials
• hadoop credential create ftptestzos -provider jceks://hdfs//user/username/ftp2hdfs.password.jceks
• ./ftp2hdfs get -ftp_host zftp.example.com -hdfs_outdir /landing/copybook/t1000 -ftp_pds T1000.COPYBOOK –
ftp_filename DATA -transfer_type fb -ftp_userid testzosuser -ftp_pwd_alias ftptestzos -ftp_hadoop_cred_path jceks:// hdfs/user/username/ftp2hdfs.password.jceks
Execute FTP2HDFS in MapReduce Mode
Note: You must create Hadoop Credential Store to secure FTP Credentials
• hadoop credential create ftptestzos -provider jceks://hdfs//user/username/ftp2hdfs.password.jceks
• Note: Important to set -ftp_transfer_limit 4 this will prevent MapReduce from effectively DOSing the FTP Server that
is being used.
• ./ftp2hdfs-mr get -ftp_host zftp.example.com -hdfs_outdir /landing/copybook/t1000 -ftp_pds T1000.COPYBOOK
-ftp_filename DATA -transfer_type fb -ftp_transfer_limit 4 -ftp_userid testzosuser -ftp_pwd_alias ftptestzos — ftp_hadoop_cred_path jceks://hdfs/user/username/ftp2hdfs.password.jceks
CIFS2HDFS How to
Execute CIFS2HDFS in Client Mode
Note: You must create Hadoop Credential Store to secure SMB/CIFS Credentials
hadoop credential create cifsuser -provider jceks://hdfs//user/gss2002/cifs2hdfs.password.jceks
Note: Important to set -cifs_transfer_limit 4-10 is a good limit this will prevent MapReduce from effectively DOSing the SMB CIFS Server that is being used.
Example 1 – Transfer Specific Files or a File using -cifs_file arg.
./cifs2hdfs -cifs_domain HDPUSR.SENIA.ORG -cifs_folder /GSS/POC/HDFS_TEST/JUNK/JUNK2/JUNK4/ -cifs_host GSSNAS.HDPUSR.SENIA.ORG -hdfs_outdir ./gss_cifs – cifs_hadoop_cred_path jceks://hdfs/user/gss2002/cifs2hdfs.password.jceks -cifs_pwd_alias cifstest -cifs_userid gss2002 -cifs_file 14924571* -cifs_transfer_limit 2
Example 2 – Transfer Entire Folders
./cifs2hdfs -cifs_domain HDPUSR.SENIA.ORG -cifs_folder /GSS/POC/HDFS_TEST/ -cifs_host GSSNAS.HDPUSR.SENIA.ORG -hdfs_outdir ./gss_cifs -cifs_hadoop_cred_path jceks://hdfs/user/gss2002/cifs2hdfs.password.jceks -cifs_pwd_alias cifstest -cifs_userid gss2002 -cifs_transfer_limit 2
Execute CIFS2HDFS in MapReduce Mode
Note: You must create Hadoop Credential Store to secure SMB/CIFS Credentials
hadoop credential create cifstest -provider jceks://hdfs//user/gss2002/cifs2hdfs.password.jceks
Note: Important to set -cifs_transfer_limit 4-10 is a good limit this will prevent MapReduce from effectively DOSing the SMB CIFS Server that is being used.
Example 1 – Transfer Specific Files or a File using -cifs_file arg.
./cifs2hdfs_mr -cifs_domain HDPUSR.SENIA.ORG -cifs_folder /GSS/POC/HDFS_TEST/JUNK/JUNK2/JUNK4/ -cifs_host GSSNAS.HDPUSR.SENIA.ORG -hdfs_outdir ./gss_cifs – cifs_hadoop_cred_path jceks://hdfs/user/gss2002/cifs2hdfs.password.jceks -cifs_pwd_alias cifstest -cifs_userid gss2002 -cifs_file 14924571* -cifs_transfer_limit 2
Example 2 – Transfer Entire Folders
./cifs2hdfs_mr -cifs_domain HDPUSR.SENIA.ORG -cifs_folder /GSS/POC/HDFS_TEST/ -cifs_host GSSNAS.HDPUSR.SENIA.ORG -hdfs_outdir ./gss_cifs -cifs_hadoop_cred_path jceks://hdfs/user/gss2002/cifs2hdfs.password.jceks -cifs_pwd_alias cifstest -cifs_userid gss2002 -cifs_transfer_limit 2
The next article in the hadoop series will be on how JRecord can be used to convert EBCIDC variable length/block copybooks FTP’d into HDFS using ftp2hdfs.