总述
At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.
RDD
- The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
- RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.
- Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
spark 围绕着RDD 的概念展开,可容错
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, H , or any data source offering a Hadoop InputFormat.
RDD工作原理:
主要分为三部分:创建RDD对象,DAG调度器创建执行计划,Task调度器分配任务并调度Worker开始运行。
继续阅读与本文标签相同的文章
上一篇 :
特斯拉这颗未启用的摄像头居然是一个彩蛋?
-
还在一张张的保存Word中的图片吗?用这个方法,1分钟可全部保存
2026-05-18栏目: 教程
-
支持绘图、文档、思维导图……这款白板工具让在线协作更方便:Miro
2026-05-18栏目: 教程
-
阿里巴巴飞天大数据架构体系与Hadoop生态系统
2026-05-18栏目: 教程
-
OCP-052考试题库汇总(50)-CUUG内部解答版
2026-05-18栏目: 教程
-
Baseus倍思音频产品采用Bongiovi DPS软件算法 给用户带来沉浸式体验体验
2026-05-18栏目: 教程
