Apache Doris刚刚“结业”:为什么要关心那个SQL数据仓库

Doris是一种基于SQL的大规模并行处置(MPP)开源阐发数据仓库,正在Apache Incubator(Apache孵化器)停止开发。如今,Doris跻身顶级项目行列,据Apache 软件基金会(ASF)声称,那意味着“它已证明了可以停止恰当的自治”。


Doris原名Palo,降生于中国互联网搜刮巨头百度,是其告白营业的数据仓库系统,2017 年开源,2018年进进Apache 孵化器。

Doris植根于Apache Impala和Google Mesa

据Apache软件基金会声称,Doris基于Google Mesa和Apache Impala集成,Apache Impala是2012年开发的开源MPP SQL查询引擎,基于Google F1的根底。




Doris的其他一些功用包罗列存储、并行施行、矢量化手艺、查询优化、ANSI SQL,以及通过面向Apache Flink、Apache Hive、Apache Hudi、Apache Iceberg、Apache Spark、 Elasticsearch及其他系统的毗连件与大数据生态系统集成。





Ventana Research研究总监David Menninger说:“跟着数据量不竭增长,MPP数据库成为了可以以足够快的速度或足够低的成本处置数据以称心组织需求的独一现实办法。”



Menninger认为Doris大有期看,固然有许多MPP数据库可选,此中一些是开源的,但现实上没有一种开源的MPP MySQL替代计划。

“MySQL自己和MariaDB已颠末扩展,可撑持更浩荡的阐发工做负载,但它们最后是为事务处置设想的”,Menninger说,填补道能够将开源PostreSQL数据库Greenplum以及Google BigQuery、Amazon RedShift和Microsoft Synapse等超大规模办事视为Doris的合作敌手。

此外,Gartner大数据和阐发前研究副总裁Sanjeev Mohan表达,还能够将ClickHouse、Apache Druid和Apache Pinot视为是合作敌手。






In case you are wondering who “she” is and what school she went to, Doris is an open source, SQL-based massively parallel processing (MPP) analytical data warehouse that was under development at Apache Incubator.

Last week, Doris achieved the status of top-level project, which according to the Apache Software Foundation (ASF) means that “it has proven its ability to be properly self-governed.”

The data warehouse was recently released in version 1.0, its eighth release while undergoing development at the incubator (along with six Connector releases). It has been built to support online analytical processing (OLAP) workloads, often used in data science scenarios.

Doris, originally known as Palo, was born inside Chinese internet search giant Baidu as a data warehousing system for its advertisement business before being open sourced in 2017 and entering the Apache Incubator in 2018.

Doris has roots in Apache Impala and Google Mesa

Doris, according to the Apache Software Foundation, is based on the integration of Google Mesa and Apache Impala, an open source MPP SQL query engine, developed in 2012 and based on the underpinnings of Google F1.

Mesa, which was designed to be a highly scalable analytic data warehousing system around 2014, was used to store critical measurement data related to Google’s Internet advertising business.

According to its developers, both at Baidu and at the Apache Incubator, Doris offers simple design architecture while providing high availability, reliability, fault tolerance, and scalability.

“The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris,” the Apache Software Foundation said in a statement, adding that the data warehouse supports multidimensional reporting, user portraits, ad-hoc queries, and real-time dashboards.

Some of the other features of Doris includes columnar storage, parallel execution, vectorization technology, query optimization, ANSI SQL, and integration with big data ecosystems via connectors for Apache Flink, Apache Hive, Apache Hudi, Apache Iceberg, Apache Spark, and Elasticsearch, among other systems.

Uptake of open source databases forecast to grow

Uptake of enterprise grade, open source databases have been expected to grow. In Gartner’s State of the Open-Source DBMS Market 2019 report, the consulting firm predicted that more than 70% of new in-house applications will be developed on an Open Source Database Management System (OSDBMS) or an OSDBMS-based Database Platform-as-a-Service (dbPaaS) by the end of 2022.

In addition, as data proliferates and businesses’ need for real-time analytics grows, a simple yet massively parallel processing database that is also open source, seems to be the need of the hour.

“As data volumes have grown, MPP databases became the only realistic way to process data quickly enough or cheaply enough to meet organizations’ demands,” said David Menninger, research director at Ventana Research.

Cloud architecture fuels interest in MPP databases

The other trends fueling MPP databases are the availability of relatively inexpensive cloud-based instances of servers, which can be used as part of the MPP configuration, thus eliminating the need to procure and install the physical hardware these systems use, Menninger said.

Making a case for Doris, Menninger said that while there are many MPP database options, some of which are open sourced, there isn’t really an open source, MPP MySQL alternative.

“MySQL itself and MariaDB have been extended to support larger analytical workloads, but they were initially designed for transaction processing,” Menninger said, adding that open source PostreSQL database Greenplum and hyperscaler services such as Google BigQuery, Amazon RedShift, and Microsoft Synapse could be considered as rivals to Doris.

In addition, ClickHouse, Apache Druid, and Apache Pinot could also be considered rivals, said Sanjeev Mohan, former research vice president for big data and analytics at Gartner.

According to the Apache Foundation, using Doris could have multiple advantages, such as architectural simplicity and faster query times.

One of the reasons behind Doris’ simplicity is its non-dependency on multiple components for tasks such as class management, synchronization and communication. Its fast query times can be attributed to vectorization, a process that allows a program or an algorithm to operate on a multiple set of values at one time rather than a single value.

Another benefit of the data warehouse, according to the developers at the Apache Foundation, is Doris’ ultra-high concurrency support, meaning it can handle requests from tens of thousands of users to process data and gain insights from the database at the same time.

The need for high concurrency has increased because most organizations are allowing their employees to access data in order to drive data-driven insights in contrast to just C-suite executives having access to analytics.

本文次要内容转载出自InfoWorld,原做者为Anirban Ghoshal,仅供广阔读者参考,若有进犯您的常识产权或者权益,请联络我供给证据,我会予以删除。

