Apache Spark

E185661

big data framework cluster computing framework distributed data processing engine open-source software

Apache Spark is an open-source, distributed data processing engine designed for large-scale data analytics, machine learning, and stream processing.

Try in SPARQL Jump to: Surface forms Disambiguation Statements Elicitation Referenced by

All labels observed (12)

Label	Occurrences
Apache Spark canonical	23
Spark SQL	2
SparkR	2
Apache Spark MLlib	1
Apache Spark SQL	1
Apache Spark Streaming	1
Apache Spark programming model	1
MLlib	1
RDD	1
Resilient Distributed Dataset	1
Spark	1
Spark Streaming	1

How this entity was disambiguated

This entity first appeared as the object of triple T1647659 — resolving that mention is where its identity was fixed. The disambiguator weighed these candidate entities and picked the highlighted one (or “None”, minting a new entity). This is how homonymy is resolved: the same surface form can point to different entities.

NED1 Entity disambiguation (via context triple) gpt-5-mini-2025-08-07

Target entity: Apache Spark
Context triple: [Azure Synapse Analytics, supports, Apache Spark]

A. Hadoop
Hadoop is an open-source framework that enables distributed storage and parallel processing of large data sets across clusters of commodity hardware.
B. Google Cloud Dataproc
Google Cloud Dataproc is a managed cloud service for running Apache Hadoop, Spark, and other big data workloads on scalable, automated clusters in Google Cloud.
C. Apache Mesos
Apache Mesos is an open-source cluster manager that abstracts CPU, memory, storage, and other resources away from machines to enable efficient deployment and scaling of distributed applications and frameworks.
D. Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for developing and executing batch and streaming data processing pipelines, based on Apache Beam, within the Google Cloud ecosystem.
E. AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service from Amazon Web Services that simplifies data preparation and integration for analytics and data warehousing.
F. None of above. chosen
G. Unsure - the case is ambiguous/there is not enough information to decide.

NED2 Entity disambiguation (via description) gpt-5-mini-2025-08-07

Target entity: Apache Spark
Target entity description: Apache Spark is an open-source, distributed data processing engine designed for large-scale data analytics, machine learning, and stream processing.

A. Hadoop
Hadoop is an open-source framework that enables distributed storage and parallel processing of large data sets across clusters of commodity hardware.
B. Google Cloud Dataproc
Google Cloud Dataproc is a managed cloud service for running Apache Hadoop, Spark, and other big data workloads on scalable, automated clusters in Google Cloud.
C. Apache Mesos
Apache Mesos is an open-source cluster manager that abstracts CPU, memory, storage, and other resources away from machines to enable efficient deployment and scaling of distributed applications and frameworks.
D. Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for developing and executing batch and streaming data processing pipelines, based on Apache Beam, within the Google Cloud ecosystem.
E. AWS Glue
AWS Glue is a fully managed extract, transform, and load (ETL) service from Amazon Web Services that simplifies data preparation and integration for analytics and data warehousing.
F. None of above. chosen

Statements (91)

Predicate	Object
instanceOf	big data framework ⓘ cluster computing framework ⓘ distributed data processing engine ⓘ open-source software ⓘ
abbreviation	Apache Spark self-linksurface differs ⓘ surface form: RDD
architecture	master-slave architecture ⓘ
canRunOn	Apache Mesos ⓘ YARN ⓘ surface form: Hadoop YARN Kubernetes ⓘ standalone cluster manager ⓘ
canUseStorage	Amazon S3 ⓘ Azure Data Lake Storage ⓘ Google Cloud Storage ⓘ HDFS ⓘ surface form: Hadoop Distributed File System local file system ⓘ
category	big data analytics ⓘ data engineering ⓘ machine learning platform ⓘ stream processing framework ⓘ
component	GraphX ⓘ Apache Spark self-linksurface differs ⓘ surface form: MLlib PySpark ⓘ ESP8266 microcontrollers ⓘ surface form: Spark Core Apache Spark self-linksurface differs ⓘ surface form: Spark SQL Apache Spark self-linksurface differs ⓘ surface form: Spark Streaming Apache Spark self-linksurface differs ⓘ surface form: SparkR Structured Streaming ⓘ
coreAbstraction	Apache Spark self-linksurface differs ⓘ surface form: Resilient Distributed Dataset
designedFor	batch processing ⓘ interactive data analytics ⓘ large-scale data processing ⓘ machine learning workloads ⓘ stream processing ⓘ
developer	Apache Software Foundation ⓘ
donatedTo	Apache Software Foundation ⓘ
donationYear	2013 ⓘ
executionModel	in-memory computing ⓘ
hasComponent	cluster manager ⓘ driver program ⓘ executors ⓘ
initialReleaseDate	2010 ⓘ
integratesWith	Apache Cassandra ⓘ Apache HBase ⓘ Hadoop ⓘ surface form: Apache Hadoop Apache Hive ⓘ Apache Kafka ⓘ JDBC data sources ⓘ
license	Apache License 2.0 ⓘ
optimizedFor	in-memory data processing ⓘ
originatedAt	UC Berkeley AMPLab ⓘ
programmingLanguage	Java ⓘ Python ⓘ R ⓘ SQL ⓘ Scala ⓘ
provides	Catalyst query optimizer ⓘ Tungsten execution engine ⓘ high-level APIs ⓘ low-level RDD API ⓘ
schedulingUnit	job ⓘ stage ⓘ task ⓘ
supports	SQL queries ⓘ batch processing ⓘ data parallelism ⓘ distributed computing ⓘ fault tolerance ⓘ graph processing ⓘ lazy evaluation ⓘ machine learning algorithms ⓘ stream processing ⓘ task parallelism ⓘ
supportsAbstraction	DataFrame ⓘ Dataset ⓘ
supportsDeployment	cloud environments ⓘ on-premises clusters ⓘ
supportsLanguageAPI	Java API ⓘ PySpark ⓘ Scala ⓘ surface form: Scala API Apache Spark self-linksurface differs ⓘ surface form: Spark SQL Apache Spark self-linksurface differs ⓘ surface form: SparkR
topLevelProjectSince	2014 ⓘ
useCase	ETL pipelines ⓘ data warehousing ⓘ graph analytics ⓘ log processing ⓘ real-time analytics ⓘ recommendation systems ⓘ
website	https://spark.apache.org ⓘ
writtenIn	Java ⓘ Scala ⓘ

How these facts were elicited

Referenced by (36)

Full triples — surface form annotated when it differs from this entity's canonical label.

Azure Synapse Analytics → supports → Apache Spark ⓘ

Hadoop → influenced → Apache Spark ⓘ

Scala → ecosystem → Apache Spark ⓘ

Apache Software Foundation → overseesProject → Apache Spark ⓘ

KMeans → implementedIn → Apache Spark ⓘ

this entity surface form: Apache Spark MLlib

AWS Glue → programmingModel → Apache Spark ⓘ

ORC → usedIn → Apache Spark ⓘ

Google Cloud Dataproc → supportsFramework → Apache Spark ⓘ

Avro → usedWith → Apache Spark ⓘ

Apache Mesos → supportsFramework → Apache Spark ⓘ

Apache Spark → supportsLanguageAPI → Apache Spark self-linksurface differs ⓘ

this entity surface form: SparkR

Apache Spark → supportsLanguageAPI → Apache Spark self-linksurface differs ⓘ

this entity surface form: Spark SQL

Apache Spark → coreAbstraction → Apache Spark self-linksurface differs ⓘ

this entity surface form: Resilient Distributed Dataset

Apache Spark → abbreviation → Apache Spark self-linksurface differs ⓘ

this entity surface form: RDD

Apache Spark → component → Apache Spark self-linksurface differs ⓘ

this entity surface form: Spark SQL

Apache Spark → component → Apache Spark self-linksurface differs ⓘ

this entity surface form: Spark Streaming

Apache Spark → component → Apache Spark self-linksurface differs ⓘ

this entity surface form: MLlib

Apache Spark → component → Apache Spark self-linksurface differs ⓘ

this entity surface form: SparkR

Synapse Studio → supports → Apache Spark ⓘ

Yet Another Resource Negotiator → supportsFramework → Apache Spark ⓘ

YARN → supportsFramework → Apache Spark ⓘ

MapReduce → influenced → Apache Spark ⓘ

Apache Storm → competesWith → Apache Spark ⓘ

this entity surface form: Apache Spark Streaming

Apache Hive → runsOn → Apache Spark ⓘ

Apache HBase → integratesWith → Apache Spark ⓘ

Apache Mahout → integratesWith → Apache Spark ⓘ

Google MapReduce → influenced → Apache Spark ⓘ

this entity surface form: Apache Spark programming model

HDFS → usedBy → Apache Spark ⓘ

Apache Pig → executionEngine → Apache Spark ⓘ

this entity surface form: Spark

Apache Pig → comparedWith → Apache Spark ⓘ

this entity surface form: Apache Spark SQL

NVIDIA RAPIDS → integratesWith → Apache Spark ⓘ

Databricks → coreTechnology → Apache Spark ⓘ

ASF → governs → Apache Spark ⓘ

subject surface form: Apache Software Foundation

ASF → hasKeyProject → Apache Spark ⓘ

subject surface form: Apache Software Foundation

ApacheCon → isRelatedTo → Apache Spark ⓘ

Cloudera → usesTechnology → Apache Spark ⓘ

All labels observed (12)

How this entity was disambiguated Show

Statements (91)

How these facts were elicited Show

Referenced by (36)

How this entity was disambiguated

How these facts were elicited