Orc table creation from spark sql with snappy compression

11/18/2023

 to make sure that there are no cache effects or undesired influences between the  repeated 3 or more times to ensure the representativeness of the results Configure the Big Data platform under test to address the benchmark scenario. Generate the test data using typically data generator Scale Factor, query type, workload type, ) Install and configure the Big Data Benchmark. Operating System, Network, Programming Frameworks, Install and configure all hardware and software components.The general approach consists of 4 phases:. by using a different columnar file format configuration  by changing the columnar file format type or Investigate how the overall performance of an engine (Hive or  Parquet is first choice for SparkSQL and ImpalaĬontrary to other studies, we compared ORC and Parquet File Formatsīy executing each file format on the same processing engine! SQL-on-Hadoop Engines + Default File Format

 offer a high-level abstraction on top of processing engine (like MapReduce  provide SQL-like dialect (called HiveQL) to work with structured data  efficiently query data stored in columnar file formats (typically in HDFS)  can be used or integrated with any data processing framework or engine  take advantage of data encoding and compression strategies  open source, general purpose columnar file formats

Data encoding and compression algorithms can take advantage of the dataĬolumnar File Formats and SQL-on-Hadoop Engines.
It is efficient to scan only a subset of columns!.
Data compression or encoding is inefficient because different data types are.
To select a subset of columns, all rows need to be read!.
Storage and processing of data-intensive applications.
Complex distributed software systems (Hadoop, Spark etc.).
Big Data benchmarking / Performance optimizations.
Senior Researcher, Lab CTO Frankfurt Big Data Lab * Published in the journal Concurrency and Computation: Practice and Experience 2019, The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A Study on ORC and Parquet Exceptions are the queries involving text processing, which do not benefit from using any compression. Using ZLIB compression brings up to 60.2% improvement with ORC, while Parquet achieves up to 7% improvement with Snappy. We show that ORC generally performs better on Hive, whereas Parquet achieves best performance with SparkSQL. Our experiments confirm that the file format selection and its configuration significantly affect the overall performance. We use BigBench (TPCx-BB), a standardized application-level benchmark for Big Data scenarios. We apply our strategy to two processing engines, Hive and SparkSQL, and evaluate the performance of two columnar file formats, ORC and Parquet. In this work, we propose an alternative approach: by executing each file format on the same processing engine, we compare the different file formats as well as their different parameter settings. Related works consider the performance of processing engine and file format together, which makes it impossible to predict their individual impact. (“year”,”month”).format(“orc”).option(“compression”, “snappy”).mode(“append”).Columnar file formats provide an efficient way to store data to be queried by SQL-on-Hadoop engines. Val results=hiveContext.sql(“select * from orctable”) Val rdd1=hiveContext.sql(s”select * from $dbname.$tab where year between ‘$newyear’ and ‘$year'”) Val lines = omFile(args(1)).getLines.toList Val hiveContext = new .hive.HiveContext(sc) Val sc=new SparkContext(args(0),”SeqtoOrc”) The below code is 10 times faster than Spark SQL. Once the data is converted to ORC format, create an external table having similar structure as that of sequential table but in ORC format and pointing to the output path. These are separted by ~ in the input file. Read the database name,table name, partition dates, output path from the file.

Suppose your existing hive table is in sequential format and partitioned by year and month. In this blog, I will detail the code for converting sequence file to orc using spark/scala.

0 Comments

Orc table creation from spark sql with snappy compression

Leave a Reply.

Author

Archives

Categories