Apache Sqoop Tutorial

06 October 2023

|
6 min read
Blog Image

Learn the Process to Import/Export Data Between HDFS and RDBMS

To process the data, the data must be loaded into the Hadoop framework. In general, we load bulk amounts of data from Relational databases to Hadoop for processing purpose. The traditional methodology of using scripts for loading bulk data into Hadoop has failed. Moreover, this approach is inefficient and takes a lot of time. These issues have laid roots for the invention of the Sqoop tool. Sqoop is a highly flexible tool and seamlessly integrates with Hadoop. It can easily import and export a large amount of data into and out of HDFS. The Sqoop tool was originally developed by Cloudera and was acquired by Apache in 2011. In 2012 Sqoop was promoted as an Apache top-level project. In this Apache Sqoop Tutorial, we are going to cover the below concepts in detail.

1) Introduction to Apache Sqoop

2) How does Apache Sqoop work?

3) Sqoop Architecture & working

4) Apache Sqoop features

5) Sqoop vs Flume vs HDFS

Let's get into the Apache Sqoop tutorial part.

Introduction to Apache Sqoop

In general, applications communicate with relational databases with the help of RDBMS, and this process made relational databases as one of the major sources of Big Data generation. This data gets stored in RDB Server in a structured way. Here comes the Apache Sqoop which is an important component of the Hadoop ecosystem and provides seamless interaction between the HDFS and relational database servers. Apache Sqoop is one of the essential tools of Hadoop ecosystem designed to create a seamless data flow between HDFS (Hadoop storage) and relational database servers such as Oracle RDB, MySQL, Teradata, SQLite, Postgres, Netezza etc. Apache Sqoop exports data from HDFS to relational databases and imports data from relational databases to HDFS. It simplifies the data transfer between Hadoop and external data sources such as relational databases, data warehouses, etc. Moreover, Sqoop is also used as a medium to export data from external datastores into Hadoop tools such as HBase & Hive.

Check out our Big Data Hadoop & Spark training by TrainingHub.io

How Does Sqoop Work?

Apache Sqoop tutorial - Blog Image 1.png

Following above diagram give us a clear idea of how Sqoop acts as a medium between Hadoop and external data sources to accomplish data import and export tasks.

There are mainly two functions associated with Apache Sqoop tool which are Sqoop Import and Sqoop export.

a. Sqoop Import: This is an important function which executes the task of data importing from external sources to HDFS. In HDFS, each row of a table is considered as a record. The entire records are stored in a text format in the text files or as binary data in Sequence files.

b. Sqoop Export: This function performs bulk data exportation tasks from the HDFS to RDBMS. Once the modifications are done to the imported records you will get a result set and the next process is to send back the data to the relational database. Sqoop export function reads a group of delimited text files from HDFS in parallel, divides the files into records, and stores them as new rows in a targeted database table.

Sqoop Architecture & working

Let us have a clear glance of how Apache Sqoop works in real-time by using the above diagram.

How does data Import work in Sqoop?

The execution process starts with submitting a Sqoop command. The main task is divided into small tasks and these tasks are handled by individual Map tasks. The Map Task is a subtask and responsible for importing the data into the Hadoop Ecosystem. There are several Map Tasks which work collectively to import the whole data into Hadoop.

Now let us understand how the Sqoop export happens

Sqoop Export also works similar to Sqoop Import. Here the Sqoop exports a group of files from the HDFS to external RDBMS. An input file is given to Sqoop that consists of records, which are called as rows in a table. The Submitted Job is mapped into a Map task which collects a chunk of data from HDFS. All these chunks are exported to the desired destination (ex: Oracle/ MYSQL/SQL Server) and by combining all these chunks of data we get the whole data. If there is a need for aggregations then the Reduce phase will take care of it. But this is not the case with Sqoop as it only performs tasks such as import and export data between Hadoop and RDBMS. Based on the requirement specified by the user the Map job launches multiple mappers. For Importing data each mapper task is allotted with a specific amount of data. To get high-performance Sqoop distributes the input data equally among the mappers. Now each mapper uses JDBC to create a connection with the database and fetches the assigned data and writes it into HBase or HDFS or Hive based on the arguments specified in the CLI.

Apache Sqoop Features:

Sqoop comes with advanced features to handle complex data problems. Following are the salient features of Apache Sqoop.

Full Load: Apache Sqoop allows the users to load huge volumes of data using a single command. It also supports the loading of all the tables from a database with a single command.

Incremental Load: Sqoop is a highly flexible tool and supports incremental data load. Using this feature one can load parts of the table when it is updated.

Parallel import/export: Sqoop utilizes YARN framework for executing data import and export functions. This process provides fault-tolerance on top of parallelism.

Import SQL query results: Sqoop allows you to import SQL query results returned from a query.

Compression: You can use the compress argument along with the deflate algorithm to compress your data. It is also possible to load a compressed table in Hive.

Connectors: Apache Sqoop comprises connectors for various RDBMS databases.

Kerberos Security Integration: It is a computer network authentication protocol that works based on ‘tickets’ and allows nodes to communicate over an unsecured network. Sqoop also supports Kerberos authentication.

Direct data loading into HBase/Hive: Using Sqoop you can directly load data into Apache Hive for analysis purposes and also allows dumping your data in HBase.

Supports Accumulo: Sqoop also supports Accumulo and allows you to import tables in Accumulo rather than in an HDFS directory.

Conclusion: With this, we have come to the end of this Apache Hadoop tutorial. We hope this tutorial has helped you with detailed information on what is Sqoop, how it works, Sqoop architecture, features etc.

If you are about to start your career in Big Data Hadoop, we suggest you check out our Big Data Hadoop & Spark training by TrainingHub.io. This course has been designed to make the aspirants professional in HDFS, MapReduce, Yarn, HBase, Hive, Pig, Flume, Oozie, Sqoop, etc. Happy learning!