標籤:

Big Data: current trends & next big thing 'Apache Kudu' - Strata + Hadoop 2016

Big Data: current trends & next big thing Apache Kudu

In the past three days, I attended Strata + Hadoop 2016 at San Jose convention center. For those who dont know about strata + hadoop, here is some quote from the conference website:

We』ve assembled the world』s best data scientists, analysts, and executives from innovative companies of all sizes to share deep, hard-won knowledge. Compelling data case studies, proven best practices, effective new analytic approaches, and core skills will give you insight around.

To me, the most exciting technology goes to:

Apache Kudu: fast Analytics on fast data

As of now, in terms of OLAP, enterprises usually do batch processing and realtime processing separately. However, to make a real data driven decision, data consumers (internal and external users) are eager to have a combined view for both batch & realtime data, LAMDA architecture was born to facilitate this need, but its non-trivial to get it right (actually most companies dont get it in place, even for top tech companies).

Apache Kudu (incubating), which was started by Cloudera, completes Hadoops storage layer (different from HDFS) to enable fast analytics on fast data. It supports HBase-like feature to allow fast data ingestion as it arrives in realtime; it also supports Parquet/HDFS-like feature to run analytic workloads on both historic & fresh data. Moreover, it makes data mutation possible, while current HDFS is designed for immutable use case only.

Its worth to mention that kudu is solely a storage layer (just like HDFS), to do analytics work, you are supposed to run through any of those existing SQL-on-Hadoop engines. As of now, Kudu is already integrated with Impala, MapReduce, and Spark (beta). Additional frameworks are expected for GA with Hive being the current highest priority addition.

Here are some take-aways from hundreds of sessions

(1) Spark continues its momentum to become mainstream

Apache Spark? is a fast and general engine for large-scale data processing, it provides sort of all-in-one suite, e.g. spark SQL, spark streaming, MLib (machine learning), and GraphX (graph).

With the completion of Tungsten project, it greatly improves the efficiency of memory and CPU performance. Moreover, Alluxios Tachyon further pushes the boundary of spark memory efficiency: How Baidu combined Tachyon with Spark SQL to increase speed 30-fold.

Databricks, the biggest contributor & driver of Spark, also provides a cloud offering of spark, by which you can quickly provision & scale your spark cluster with just few clicks, you can also test on their preview version of next spark release. Ive been told that they will also try to prioritize your feature requests or bug fixes if you are their cloud customer.

(2) Kafka & realtime streaming is in the spotlight

Confluent.io, which was founded by three Kafka veterans from LinkedIn, is working hard to bring Kafka to next level. They are also working on a Kafka-as-a-service offering (just like the way you use AWS Kinesis, but with higher throughput and lower latency), targeting a release date in the beginning of 2017.

While Kafka is almost the de factor queue system in the plate, there are many different realtime streaming frameworks competing with each other, e.g. storm, heron, flink, spark streaming, and the newly released kafka stream.

Now that enterprises have had success with batch loading and processing of data in their production Hadoop clusters, there is increasing focus on real-time data ingestion, processing and analysis.

(3) Enterprise continues to move to cloud

There are numbers showing that 32% of hadoop deployment are on-premise only, 29% of them are cloud only, while the rest goes to hybrid deployment. According to some research by Airbnb, it would be more cost effective to use public cloud if you are running less than 100k nodes.

Even though everyone was noticing that dropbox moved from AWS to its own on-prem data center recently. There is other side of the story, Zynga was moved out of AWS a while back, but it decided to move back to cloud again, I was told its due to the traffic of the gaming company is very fluctuating, while I would interpret this as the scale was not going up as they expected.

The top three cloud providers: Amazon AWS is the clear winner so far;Microsoft Azure is in 2nd place, thanks to its existing enterprise customer base; Google Cloud is playing catch up game, with new head Diane Greene (former VMware cofounder & CEO), and recent big win on Spotify & Apple deal, Snapchat is also reportedly using Google Cloud.

(4) Traditional RDBMS providers are embracing Hadoop

In the expo hall, I saw Microsoft, Oracle, IBM and SAP. I would say that except embracing Hadoop, they have no other better choice. NoSQL and/or NewSQL wont completely eat the market of RDBMS, but it will make it less necessary for a lot of enterprise use cases.

Microsoft, with the new leadership from Satya Nadella, is moving steadily (if not aggressively) towards open source, e.g. Linux & Hadoop.

(5) Machine learning is still at its infancy

With AlphaGos historic 4:1 win over Lee Sedol, there is no question that machine learning is at the top of the hype curve. I was impressed by the work of Google TensorFlow, H2O.ai and Dato, but honestly, it was not surprised at all, I felt like there are way more works to do.

As one of the keynote speaker (who I forget his name) said, that "machine learning now is more about human learning". And I』ve heard that old joke 「Machine learning is like teenage sex; everyone is talking about it, no one is actually doing it」 about many times in the past week alone. The same joke went to big data, but now seems big data is not joke any more. Hooray.

However, there is no doubt that technology will change (disrupt) almost every aspect of human life, e.g. autonomous car, humanoid robot, just to name a few. There is no easy way, but there must be some way towards it, all we need is trying hard to chase it out.


推薦閱讀:

Spark排序的原理?
大數據那些事(8):HIVE之初期起
Hadoop 一般用在哪些業務場景?
世界沉醉在數據里
技術分享丨HDFS 入門

TAG:大数据 | Hadoop |