Friday, October 24, 2014

HIVE Q & A

How to configure remote metastore mode in Hive?


Hive provide three ways of metastore type. Embedded, Local and remote. 
hive.metastore.local=false
hive.metastore.uris set the metastore server URIs

Is it possible to create multiple table in hive for same data?


 Is it possible to have multiple table for one data file? initially it look not possible as i am a regular RDBMS user like other programmer, but then i tried to connect in Hive context. I found it is possible as Hive creates schema and append on top of an existing data file. One can have multiple schema for one data file, schema would be saved in hive’s metastore and data will not be parsed read or serialized to disk in given schema. When s/he will try to retrieve data schema will be used. Lets say if my file have 5 column (Id,Name,Class,Section,Course) we can have multiple schema by choosing any number of column.


Use of overwite keyword in Hive Load Data statement


Overwrite keyword in Load data statement pass message to hive to delete existing data from file. before we will understand the Overwrite we need to understand the role and use of Load data in Hive. Load data is command which will be used to load data in hive’s existing table. As i mentioned existing table to get the better understanding of table creation one can refer.



Interaction with HDFS during Hive Table creation



ive itself provides RDBMS like feature of creating the table structure and loading data to createtables. Though there is difference between load data in RDBMS and Hive. Hive follows Schema on Read while RDBMS follows Schema on Write. in short Hive does not validate schema at the time of data loading, it validated schema at the time of data reading. Well this is not the agendaof schema-on-read-vs-schema-on-write!!!!

What is the maximum size of string data type supported by Hive?

Maximum size claimed is 2 GB with constraints.

What is Metastore and it’s role in Hive?


It’s a central repository of hive metadata. it has 2 parts services and data. by default it uses derby db in local disk. it is referred as embaded metastore configuration. It tends to the limitation that only one session can be served at any given point of time..

Binary storage formats hive supports. ?


Hive natively supports text file format, however hive also has support for other binary formats. Hive supports Sequence, Avro, RCFiles.
  1. Sequence files :-General binary format. splittable, compressible and row oriented.a typical example can be. if we have lots of small file, we may use sequence file as a container, where file name can be a key and content could stored as value. it support compression which enables huge gain in performance.
  2. Avro datafiles:-Same as Sequence file splittable, compressible and row oriented except support of schema evolution and multilingual binding support.
  3. RCFiles :-Record columnar file,  it’s a column oriented storage file. it breaks table in row split. in each split stores that value of first row in first column and followed sub subsequently.

Difference between Hive managed tables vs External tables.?

Hive takes care n-n management of internal tables from creation to drop. By N-N i mean from creation to drop. lets see in bit detail.

Creation of Hive Table

Managed Table

When hive create managed(default) tables, it follows schema on read and load complete file as it is, without any parsing or modification to hive data warehouse directory. And it’s schema information would be saved in hive metastore for later operational use. during drop table it drop data file from warehouse directory as well as schema from metastore, i.e it is called hive managed table.

External Table

When we create hive external tables, it does not load source file in hive data warehouse, only add schema information in metstore.Hive does not removed or drop any thing related to source file. It only drops schema information from hive metastore at time of drop tables. in gist for external table user is responsible to manage them.

Which database hive used for Metadata store? What are the metastore configuration hive supports?


Hive can use derby by default and can have three type metastore configuration. It supports
  • Embedded Metastore
  • Local Metastore
  • Remote Metastore
Embedded
uses derby db to store data backed by file stored in disk. It can’t support multi session at same time. and services of metastore runs in same JVM as hive.
Local Metastore
In this case we need to have stand alone db like MySql, which would be communicated by metastore services. Benefit of this approach is, it can support multiple hive session at a time. and service still runs in same process as Hive.
Remote Metastore
Metastore and Hive service would run in different process. with stand alone Mysql kind db.

is HQL case senstive?


HQL is not case sensitive.

User Defined Function in Hive

What is user defined function or UDF in Hive? if you are coming from RDBMS back ground than you would be easily catch the concept of UDF. in simple words UDF’s  are small fuctions designed to perform a task. UDF’s are used to achieve the functionality which can not be avail through primitive & native function. UDF’s are supported in most of the programming languages as well as in RDBMS. Hive also supports user defined functions like other propriety solutions like Microsoft sql server and Orcale.
Hive supports three type UDF
UDF- User Defined Function works on single row and results a processed single row as outcome. IsNumeric(), IsNull, IsEmoty() are of this type.
UDAF User Defined Aggregate Function works on rows and produces a output row. Aggregate Count, Max, Avg are of this type.
UDTF User Defined Table-Generating Function operates on a single row and produces multiple row a table as output.

Is multiline comment supported in Hive Script ?
No.


What is SerDe in Apache Hive ?
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.

Which classes are used by the Hive to Read and Write HDFS Files
Following classes are used by Hive to read and write HDFS files
 •TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
•SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.


Hive currently use these SerDe classes to serialize and deserialize data:
• MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)
• ThriftSerDe: This SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.
• DynamicSerDe: This SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of
different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
How do you write your own custom SerDe ?
•In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
•For example, the RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and possibly a list of column names
•If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through “thrift DDL” format, and it’s non-trivial to write a “thrift DDL” parser.


What is ObjectInspector functionality ?
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.
ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
 •Instance of a Java class (Thrift or native Java)
•A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
•A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field)
A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
What is the functionality of Query Processor in Apached Hive ?
This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

Does Hive support record level Insert, delete or update? 
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

hat kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where 
1) Relatively static data is analyzed, 
2) Fast response times are not required, and 
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where 
1) Relatively static data is analyzed, 
2) Fast response times are not required, and 
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.
CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.
Example:
CONCAT (‘Hive’,'-’,'performs’,'-’,'good’,'-’,'in’,'-’,'Hadoop’);
Output:
Hive-performs-good-in-Hadoop
So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.
CONCAT_WS (‘-’,'Hive’,'performs’,'good’,'in’,'Hadoop’);
Output: Hive-performs-good-in-Hadoop
REPEAT function in Hive with example?
REPEAT function will repeat the input string n times specified in the command.
Example:
REPEAT(‘Hadoop’,3);
Output:
HadoopHadoopHadoop.
Note: You can add a space with the input string also.
TRIM function in Hive with example?
TRIM function will remove the spaces associated with a string.
Example:
TRIM(‘  Hadoop  ‘);
Output:
Hadoop.
Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM(‘  Hadoop’);
RTRIM(‘Hadoop  ‘);
REVERSE function in Hive with example?
REVERSE function will reverse the characters in a string.
Example:
REVERSE(‘Hadoop’);
Output:
poodaH
LOWER or LCASE function in Hive with example?
LOWER or LCASE function will convert the input string to lower case characters.
Example:
LOWER(‘Hadoop’);
LCASE(‘Hadoop’);
Output:
hadoop
Note:
If the characters are already in lower case then they will be preserved.
UPPER or UCASE function in Hive with example?
UPPER or UCASE function will convert the input string to upper case characters.
Example:
UPPER(‘Hadoop’);
UCASE(‘Hadoop’);
Output:
HADOOP
Note:
If the characters are already in upper case then they will be preserved.
Double type in Hive – Important points?
It is important to know about the double type in Hive. Double type in Hive will present the data differently unlike RDBMS.
See the double type data below:
24624.0
32556.0
3.99893E5
4366.0
E5 represents 10^5 here. So, the value 3.99893E5 represents 399893. All the calculations will be accurately performed using double type. The maximum value for a IEEE 754 double is about 2.22E308.
It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting.
Rename a table in Hive – How to do it?
Using ALTER command, we can rename a table in Hive.
ALTER TABLE hive_table_name RENAME  TO new_name;
There is another way to rename a table in Hive. Sometimes, ALTER may take more time if the underlying table has more partitions/functions. In that case, Import and export options can be utilized. Here you are saving the hive data into HDFS and importing back to new table like below.
EXPORT TABLE tbl_name TO ‘HDFS_location’;
IMPORT TABLE new_tbl_name FROM ‘HDFS_location’;
If you prefer to just preserve the data, you can create a new table from old table like below.
CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name;
DROP TABLE old_tbl_name;
How to change a column data type in Hive?
ALTER TABLE table_name CHANGE column_name column_name new_datatype;
Example: If you want to change the data type of ID column from integer to bigint in a table called employee.
ALTER TABLE employee CHANGE id id BIGINT;
Difference between order by and sort by in hive?
SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.
ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.
ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results
RLIKE in Hive?
RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE. 
Examples:
‘Express’ RLIKE ‘Exp’ –> True
‘Express’ RLIKE ‘^E.*’ –> True (Regular expression)
Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value ‘Express ‘ (2 spaces additionally) and B has value ‘Express’ RLIKE will work better without using TRIM.
‘Express’ RLIKE ‘Express’ –> True 
Note:
RLIKE evaluates to NULL if A or B is NULL.
Difference between external table and internal table in HIVE ?
Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION ‘hdfs://tmp/’;, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the filesystem and finding them, etc). These sorts of things are the ‘metadata’.
When you drop an internal table, it drops the data, and it also drops the metadata. When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.

Hive Interview Questions

HiveWhat is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.
A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.
Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.
What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.
Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.
Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

Hive Interview Questions

HiveWhat is Hive?
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.
Hive was originally developed at Facebook. It’s now a Hadoop subproject with many contributors. Users need to concentrate only on the top level hive language rather than java map reduce programs. One of the main advantages of Hive is its SQLish nature. Thus it leverages the usability to a higher extend.
A hive program will be automatically compiled into map-reduce jobs executed on Hadoop. In addition, HiveQL supports custom map-reduce scripts to be plugged into queries.
Hive example:
selecting the employee names whose salary more than 100 dollars from a hive table called tbl_employee.
SELECT employee_name FROM tbl_employee WHERE salary > 100;
Users are excited to use Hive since it is very similar to SQL.
What are the types of tables in Hive?
There are two types of tables.
1. Managed tables.
2. External tables.
Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.
Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
What kind of datawarehouse application is suitable for Hive?
Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
How can the columns of a table in hive be written to a file?
By using awk command in shell, the output from HiveQL (Describe) can be written to a file.
hive -S -e “describe table_name;” | awk -F” ” ’{print 1}’ > ~/output.

CONCAT function in Hive with Example?
CONCAT function will concat the input strings. You can specify any number of strings separated by comma.
Example:
CONCAT (‘Hive’,’-’,’performs’,’-’,’good’,’-’,’in’,’-’,’Hadoop’);
Output:
Hive-performs-good-in-Hadoop
So, every time you delimit the strings by ‘-’. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first.
CONCAT_WS (‘-’,’Hive’,’performs’,’good’,’in’,’Hadoop’);
Output: Hive-performs-good-in-Hadoop
REPEAT function in Hive with example?
REPEAT function will repeat the input string n times specified in the command.
Example:
REPEAT(‘Hadoop’,3);

Output:
HadoopHadoopHadoop.
Note: You can add a space with the input string also.
TRIM function in Hive with example?
TRIM function will remove the spaces associated with a string.
Example:
TRIM(‘  Hadoop  ‘);

Output:
Hadoop.

Note: If you want to remove only leading or trialing spaces then you can specify the below commands respectively.
LTRIM(‘  Hadoop’);
RTRIM(‘Hadoop  ‘);