Geek Logbook

Tech sea log book

Understanding How Hive Converts SQL Queries into Hadoop Jobs

When you execute a SQL query in Apache Hive, the query is not directly run on a traditional database. Instead, Hive translates it into a Hadoop job, which is then executed across a distributed system. This blog post will break down the step-by-step process of how Hive converts a simple SQL query into a Hadoop MapReduce job.

1. Executing the Query in Hive

Consider the following HiveQL query:

SELECT * FROM productos;

At first glance, this looks like a standard SQL query, but under the hood, Hive transforms it into a Hadoop job.

2. Query Parsing and Logical Plan Generation

Hive first parses the query to verify its syntax and then generates a logical execution plan. This plan consists of a series of operations required to execute the query efficiently.

Running EXPLAIN SELECT * FROM productos; in Hive provides a detailed breakdown of how the query will be executed.

3. Translating into a MapReduce Job

If Hive is configured to use MapReduce (instead of Tez or Spark), it converts the logical plan into a Hadoop MapReduce job. The job consists of the following components:

Mapper Phase

Since SELECT * only reads data, Hive generates a Mapper-only job without a Reducer. The mapper reads the data from HDFS and passes it along without transformation. A simplified Java equivalent of the Mapper function could be:

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;

public class TableScanMapper extends Mapper<Object, Text, Text, Text> {
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        context.write(null, value);  // Passing data as-is
    }
}

Job Configuration

Hive then prepares a MapReduce job configuration that specifies the input and output locations:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class HiveTableScanJob {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Hive SELECT * FROM productos");
        
        job.setJarByClass(HiveTableScanJob.class);
        job.setMapperClass(TableScanMapper.class);
        job.setNumReduceTasks(0); // No Reducer needed

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        TextInputFormat.addInputPath(job, new Path("/user/hive/warehouse/productos"));
        TextOutputFormat.setOutputPath(job, new Path("/user/hive/output"));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

4. Executing the Hadoop Job

Once the job is submitted, Hadoop’s YARN ResourceManager schedules and executes the tasks across multiple nodes.

To monitor the job, you can use:

yarn application -list

For logs:

yarn logs -applicationId <APPLICATION_ID>

5. Returning Results to Hive

After the job finishes:

  • If the query is part of an INSERT, results are stored in HDFS (/user/hive/output).
  • If it’s an interactive query, Hive retrieves the results and displays them in the CLI.

Conclusion

Understanding how Hive transforms SQL into Hadoop jobs helps in optimizing queries for better performance. Whether using MapReduce, Tez, or Spark, the fundamental process remains the same: Hive parses, optimizes, translates, and executes the query in a distributed manner.

For improved performance, consider switching from MapReduce to Tez or Spark, which provide better query optimization and execution speed.

Tags: