Presto(Superset)
Presto open source in 2013
- Provide SQL-Based Access to Anything : 提供对任何内容的基于SQL的访问
- Federated Queries : 联合查询
- Catalogs, Schemas and Tables : 目录、模式和表
- Lateral Join Decorrelation : 横向连接去相关; round out the content : 充实内容
- Presto is a SQL query engine enabling SQL access to any data source, and querying very large data sets with horizontally scaling the query processing : Presto是一个SQL查询引擎,使SQL能够访问任何数据源,并通过水平扩展查询处理来查询非常大的数据集
- Presto represents the compute layer, whereas the underlying data sources represent the storage layer
- Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters --- 每个Presto服务器都可以同时充当协调器和工作节点,但是将一台机器专用于仅执行协调工作可以在更大的集群上提供最佳性能。
- prestosql/presto docker image ; image from docker hub; Kubernetes Presto operator
- To overcome this inefficiency, the concept of revocable memory was introduced : 为了克服这种低效率,引入了可撤销内存的概念
- A query, that is forced to spill to disk, may have a longer execution time by orders of magnitude than a query that runs completely in memory : 与完全在内存中运行的查询相比,被强制溢出到磁盘的查询的执行时间可能长几个数量级。
- To increase query performance, it is recommended to provide multiple paths on separate local devices for spill --- 为了提高查询性能,建议在单独的本地设备上提供多条路径以进行溢出
- cost based join enumeration : 基于成本的联接枚举
- Presto enables SQL-based access to external data sources such as relational databases, key-value stores, object storage and others
- The simple query plan is split into plan fragments. A stage is the runtime incarnation of a plan fragment, and it encompasses all the tasks of the work described by the stage’s plan fragmen --- 简单查询计划分为多个计划片段。 阶段是计划片段的运行时化身,它包含阶段的阶段所描述的工作的所有任务
- The sequence of operators within a task is called a pipeline. The last operator of a pipeline typically places its output pages in the task’s output buffer
- CPU time, memory requirements and network bandwidth usage are the three dimensions that contribute to query execution time, both in single query and concurrent workloads. These dimensions constitute the cost in Presto
- Start the cluster and ramp up usage : 启动集群并提高使用率
- Hive metadata describes how data stored in HDFS maps to schemas, tables, and columns to be queried via SQL. This metadata information is persisted in a database such as MySQL or PostgreSQL and is accessible via the Hive Metastore Service (HMS)
- Presto and the Presto Hive connector do not use the Hive runtime at all. Presto is a replacement for it, and suitable for running interactive queries
- Partitioning is now a standard data organization strategy in distributed file systems, such as HDFS, and object storage, such as S3
- Hive Partitions & Buckets --- Hive分区和存储桶 /Bare-metal servers --- 裸机服务器
- Beginning in Hive 3.0, the Metastore is released as a separate package and can be run without the rest of Hive. This is referred to as standalone mode.
- Encryption is a process of transforming data from a readable form to an unreadable form, which is then used in transport or for storage, also called at rest.
- ff
Blogs
- presto-the-definitive-guide
- Presto on YugaByte DB: Interactive OLAP SQL Queries Made Easy
- Presto实现原理和美团的使用实践 2014