One year on: Apache Spark continues a winning battle against data management misery

Take a deep breath and herald the arrival of Apache Spark – the missing link in data management set to make all our lives easier. I’ll explain why Spark is so good shortly, but first let’s get started by putting it into context.

Back in the old days of IBM DB2 and Oracle, we used to look at our data in very structured form such as tables and spread sheets. No surprise there, considering that tabular data and relational databases were the most efficient way to manage data with the computing power we had. Normalised, structured data spawned a very powerful ecosystem of patterns and tools to analyse data.

[easy-tweet tweet=”The days of structured #Data are done with says @ArtyomAstafurov” user=”comparethecloud” hashtags=”cloud”]

All good stuff yes?

No.

These systems only worked as long as the data was structured and normalised, and the format of the data sources was deterministic. The old systems failed to accommodate for when the landscape dramatically changed, i.e. when we added things such as social feeds, news streams, and sensor data into the mix. Data that we now call unstructured data or more commonly, Big Data; in other words, data that couldn’t be analysed with the tools and methods we inherited from the days when normalised and relational data prevailed.

What did we do about Big Data? The solution came from Google and Yahoo! In mid-2000 Google issued a paper on an algorithm called MapReduce, which streamlined analytics of Big Data, making it easy to run it on multiple parallel machines. A year later, Yahoo! released Hadoop which for many years became the industry standard for analysing Big Data.

So what issues do we face today?

Unfortunately for Hadoop and similar tools, technology never stands still. Big surprise there!

System configuration management has become a significant factor in how fast we are able to develop and test a new analytics algorithm and it is something we now have to adapt to.

Enter Apache Spark

Enter Apache Spark. The timing couldn’t have been more perfect. A project that started in 2009 in the AMP labs of UC Berkley has quickly evolved into a top level Apache project with over 465 contributors by 2014. We are now witnessing its genius as it has unprecedented positive affects on the data management industry.

[easy-tweet tweet=”In contrast to #Hadoop’s implementation of #MapReduce, #Apache #Spark provides performance up to 100x faster” via=”no” usehashtags=”no”]

In contrast to Hadoop’s implementation of MapReduce, Spark provides performance up to 100 times faster. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms. Spark allows us to interactively explore data.

Say goodbye to the days when we were forced to model data with one set of tools and then implement and run our models with another, and say hello to a more streamlined data management approach. Spark closes the gap between data discovery and running analytics in production, giving an all in one approach to looking at data by using state-of-the-art Functional Programming approach.

But hold your horses, because it gets even better. In the very near future Spark will continue to change and revolutionise the way we look at data, giving us the foundation to get the most out of it in a lean and agile manner.

Artyom Astafurov

+ posts

Artyom Astafurov, Senior Vice President, IoT/M2M, DataArt

Artyom joined DataArt in 2003, originally as a developer, but quickly moving up the ranks. He’s been in charge of establishing DataArt’s regional R&D centers in Voronezh, Kharkov and Kherson, expanding project management teams for DataArt’s key accounts. In 2008, Artyom relocated to New York headquarters from St. Petersburg. As a Senior Vice President , he oversees major U.S. accounts. He plays an active role in the development of DataArt’s software engineering initiatives and is a contributor to the company’s project management and software development knowledge bases.

Prior to DataArt, Artyom worked as a research associate at St. Petersburg State University of Information Technology Mechanics and Optics (SPbSUITMO). He has an MS in Computer Science from SPbUITMO.

Is sustainability ‘enough’ from a Cloud perspective?

AI Quantum and IP Security Shaping Innovation

How GenAI can tackle challenges in Software Engineering

Ensuring Seamless Data Shopping in 2025

We’re in a Decentralised AI Revolution

Is sustainability ‘enough’ from a Cloud perspective?

AI Quantum and IP Security Shaping Innovation

How GenAI can tackle challenges in Software Engineering

Ensuring Seamless Data Shopping in 2025

We’re in a Decentralised AI Revolution

One year on: Apache Spark continues a winning battle against data management misery

So what issues do we face today?

Artyom Astafurov

Unlocking Cloud Secrets and How to Stay Ahead in Tech with James Moore

Newsletter

Related articles

Is sustainability ‘enough’ from a Cloud perspective?

AI Quantum and IP Security Shaping Innovation

How GenAI can tackle challenges in Software Engineering

Ensuring Seamless Data Shopping in 2025

We’re in a Decentralised AI Revolution

About us

Company

Terms

Must Read

AI Quantum and IP Security Shaping Innovation

Is sustainability ‘enough’ from a Cloud perspective?

How GenAI can tackle challenges in Software Engineering

Newsletter