Oct 14, 20172 min
Elastic Search is often used for indexing, searching and analyzing datasets. Sparkflows makes is very smooth to read any data, clean and transform it.
In this blog we show how we can load data on HDFS into Elastic Search and also read them back into Apache Spark from Elasticsearch in minutes.
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark.
The below workflow reads in the Housing Dataset which is in CSV format from HDFS.
It then saves the data into Elastic Search.
The Documentation processor is just for documentation purposes.
In the above workflow, the Node 'ElasticSearchLoad' takes in the incoming data and loads it into the Elastic Search Index 'sparkflows/housing'.
The below diagram shows the dialog box for the Elastic Search Load Processor.
The below workflow reads the data from the sparkflows/housing index in Elastic Search and prints out the first few lines.
In the above workflow, the Node 'ElasticSearchRead' reads in the records from the Elastic Search Index 'sparkflows/housing'.
The below diagram shows the dialog box for the Elastic Search Read Processor.
In the above dialog, the 'Refresh Schema' button infers the schema of the index. Thus it is able to pass down the output schema to the next Processor making it easy for us to build the workflow.
SQL specifies the SQL to be used for reading from Elastic Search. It allows us to limit the columns of interest, where clauses etc.
The Elastic Search Spark connector understands the SQL and translates it into the appropriate QueryDSL. The connector pushes down the operations directly a the source, where the data is efficiently filtered out so that only the required data is streamed back to Spark. This significantly increases the query performance and minimizes the CPU, memory and I/O operations on both Spark and Elastic Search clusters.