HTRUNK

hTRUNK

Super User

Super User

Super User has not set their biography yet

How hTRUNK accomplishes XML data processing

Posted by Super User
Super User
Super User has not set their biography yet
User is currently offline
on Monday, 26 October 2015
in hTRUNK

hTRUNK uses inbuilt xml processing component using this xml can now be easily processed with simple steps like. Creating the metadata with required fields. Mapping the data and scheduling the job. For further details. please refer to hTRUNK_XML_Processor video.

Tags: Untagged
Rate this blog entry

How Apache Spark complements Hadoop

Posted by Super User
Super User
Super User has not set their biography yet
User is currently offline
on Thursday, 22 October 2015
in hTRUNK

Apache Spark is a general-purpose lightening fast data processing engine, suitable for use in a wide range of circumstances. Spark leverages the hadoop's strength for cluster management and data persistence and compliance.

Spark was developed in 2009 in UC Berkeley’s AMPLab and open sourced in 2010, Apache Spark. According to stats on Apache.org, Spark can “run programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.” 

In this blog post, Lets talk about How Spark is complementing Hadoop. Although Spark is a viable alternative to Hadoop MapReduce in many circumstances, it is not a replacement for Hadoop.

Spark has been designed to run on top of Hadoop and is an alternative to the traditional batch map/reduce model, leveraging Hadoop’s cluster manager (YARN) and underlying storage (HDFS, HBase, etc.). Spark can also run completely separately outside Hadoop, integrating with alternative cluster managers like Mesos and alternative storage platforms such as Cassandra and Amazon S3.

MapReduce is a programming model. In Hadoop MapReduce, the MapReduce reads data from the disk and then write the data back to the disk for each event. Spark increased performance by tenfold because it didn’t have to store the data back to the disk, all activities are done in memory. Hadoop uses replication to achieve fault tolerance whereas Spark uses different data storage model, resilient distributed datasets (RDD), uses a clever way of guaranteeing fault tolerance that minimizes network I/O.

From the Spark academic paper: "RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information to rebuild just that partition." This removes the need for replication to achieve fault tolerance.

Spark is an independent project. It has become another data processing engine in Hadoop ecosystem adding more capability to Hadoop stack. Plus, Spark permits programmers and developers to write applications in Java, Python or Scala and to build parallel applications designed to take full and fast advantage of a distributed environment. 

Spark complements Hadoop by adding:

    Iterative Algorithms in Machine Learning

    Interactive Data Mining and Data Processing

    Spark is a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive

    Stream processing: Log processing and Fraud detection in live streams for alerts, aggregates and analysis

In certain circumstances, Spark’s SQL capabilities and streaming capabilities, and graph processing capabilities may also prove to be of value.

Summary

In this blog post, I discussed how Spark adds value to Hadoop and sign boards point to Spark becoming a significant component within Hadoop. 

Tags: Untagged
Rate this blog entry