RDD 探究

总述

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.

RDD

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.
Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

spark 围绕着RDD 的概念展开，可容错
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, H , or any data source offering a Hadoop InputFormat.

RDD工作原理：
主要分为三部分：创建RDD对象，DAG调度器创建执行计划，Task调度器分配任务并调度Worker开始运行。

RDD 探究

浏览：355 2026-05-08

总述

RDD

继续阅读与本文标签相同的文章

特斯拉这颗未启用的摄像头居然是一个彩蛋？

视频服务最高降价34% 阿里云北京云栖大会再次释放技术红利

特别推荐 2026年05月18日星期一

精彩发现

热门标签

RDD 探究

浏览：355 2026-05-08

总述

RDD

继续阅读与本文标签相同的文章

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-05-18栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-23栏目： 教程

2026-04-24栏目： 教程

特别推荐 2026年05月18日 星期一

精彩发现

热门标签

相关文章

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-05-18栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-23栏目：教程

2026-04-24栏目：教程

特别推荐 2026年05月18日星期一