MapReduce
E185673
MapReduce is a programming model and processing framework for distributed computation of large data sets across clusters of computers.
All labels observed (4)
| Label | Occurrences |
|---|---|
| MapReduce canonical | 10 |
| MapReduce: Simplified Data Processing on Large Clusters | 5 |
| Apache MapReduce | 1 |
| MapReduce programming model | 1 |
How this entity was disambiguated
This entity first appeared as the object of triple T1647831 — resolving that mention is where its identity was fixed. The disambiguator weighed these candidate entities and picked the highlighted one (or “None”, minting a new entity). This is how homonymy is resolved: the same surface form can point to different entities.
NED1
Entity disambiguation (via context triple)
gpt-5-mini-2025-08-07
Target entity: MapReduce Context triple: [Hadoop, hasComponent, MapReduce]
-
A.
Hadoop
Hadoop is an open-source framework that enables distributed storage and parallel processing of large data sets across clusters of commodity hardware.
-
B.
Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for developing and executing batch and streaming data processing pipelines, based on Apache Beam, within the Google Cloud ecosystem.
-
C.
Apache Mesos
Apache Mesos is an open-source cluster manager that abstracts CPU, memory, storage, and other resources away from machines to enable efficient deployment and scaling of distributed applications and frameworks.
-
D.
Google Cloud Dataproc
Google Cloud Dataproc is a managed cloud service for running Apache Hadoop, Spark, and other big data workloads on scalable, automated clusters in Google Cloud.
-
E.
Paxos consensus algorithm
The Paxos consensus algorithm is a fault-tolerant protocol for achieving agreement among distributed systems, widely used as a foundation for reliable, replicated state machines and modern distributed databases.
- F. None of above. chosen
- G. Unsure - the case is ambiguous/there is not enough information to decide.
NED2
Entity disambiguation (via description)
gpt-5-mini-2025-08-07
Target entity: MapReduce Target entity description: MapReduce is a programming model and processing framework for distributed computation of large data sets across clusters of computers.
-
A.
Hadoop
Hadoop is an open-source framework that enables distributed storage and parallel processing of large data sets across clusters of commodity hardware.
-
B.
Google Cloud Dataflow
Google Cloud Dataflow is a fully managed service for developing and executing batch and streaming data processing pipelines, based on Apache Beam, within the Google Cloud ecosystem.
-
C.
Apache Mesos
Apache Mesos is an open-source cluster manager that abstracts CPU, memory, storage, and other resources away from machines to enable efficient deployment and scaling of distributed applications and frameworks.
-
D.
Google Cloud Dataproc
Google Cloud Dataproc is a managed cloud service for running Apache Hadoop, Spark, and other big data workloads on scalable, automated clusters in Google Cloud.
-
E.
Paxos consensus algorithm
The Paxos consensus algorithm is a fault-tolerant protocol for achieving agreement among distributed systems, widely used as a foundation for reliable, replicated state machines and modern distributed databases.
- F. None of above. chosen
Statements (50)
| Predicate | Object |
|---|---|
| instanceOf |
distributed computing framework
ⓘ
parallel computing model ⓘ programming model ⓘ |
| abstractsAway |
details of data distribution
ⓘ
details of fault tolerance ⓘ details of parallelization ⓘ |
| basedOn |
map function
ⓘ
reduce function ⓘ |
| category |
big data technology
ⓘ
distributed data processing framework ⓘ |
| commonlyUsedWith |
Google File System
ⓘ
HDFS ⓘ
surface form:
Hadoop Distributed File System
|
| dataLocalityStrategy | move computation to data ⓘ |
| dataModel | key-value pairs ⓘ |
| describedBy |
Jeffrey Dean
ⓘ
Sanjay Ghemawat ⓘ |
| describedIn |
MapReduce
self-linksurface differs
ⓘ
surface form:
MapReduce: Simplified Data Processing on Large Clusters
|
| designedFor | fault-tolerant distributed processing ⓘ |
| developer | Google ⓘ |
| executionModel | batch processing ⓘ |
| faultToleranceMechanism | re-execution of failed tasks ⓘ |
| handles |
automatic data distribution
ⓘ
automatic fault recovery ⓘ task scheduling ⓘ |
| hasComponent |
Map phase
ⓘ
Reduce phase ⓘ Shuffle phase ⓘ Sort phase ⓘ |
| implementedIn | Google internal infrastructure ⓘ |
| influenced |
Hadoop
ⓘ
surface form:
Apache Hadoop MapReduce
Apache Spark ⓘ Dryad ⓘ FlumeJava ⓘ |
| inspiredBy | functional programming ⓘ |
| jobInput | input splits ⓘ |
| jobOutput | output files in distributed file system ⓘ |
| publicationYear | 2004 ⓘ |
| purpose |
batch data processing
ⓘ
distributed computation ⓘ processing large data sets ⓘ |
| runsOn | cluster of commodity hardware ⓘ |
| scalesTo |
petabytes of data
ⓘ
thousands of machines ⓘ |
| supports |
data parallelism
ⓘ
task parallelism ⓘ |
| usedFor |
ETL workloads
ⓘ
data mining ⓘ index building ⓘ log processing ⓘ machine learning preprocessing ⓘ |
How these facts were elicited
The pipeline generated the facts above by prompting gpt-5.1 with this entity's name + description and the instruction below.
Instruction
You are a knowledge base construction expert. Given a subject entity and a description of it, return factual statements that you know for the subject as a JSON list of dictionaries(triples), where keys must be "subject", "predicate" and "object". The number of facts may be very high, between 25 to 50 or more, for very popular subjects. For less popular subjects, the number of facts can be very low, like 5 or 10. # Requirements - If you don't know the subject at all, return an empty list. - If the subject is not a named entity, return an empty list. - Include at least one triple where predicate is "instanceOf". - Do not get too wordy. - Separate several objects into multiple triples with one object.
Input
Subject: MapReduce Description of subject: MapReduce is a programming model and processing framework for distributed computation of large data sets across clusters of computers.
Referenced by (17)
Full triples — surface form annotated when it differs from this entity's canonical label.
this entity surface form:
MapReduce: Simplified Data Processing on Large Clusters
this entity surface form:
MapReduce programming model
this entity surface form:
MapReduce: Simplified Data Processing on Large Clusters
this entity surface form:
MapReduce: Simplified Data Processing on Large Clusters
this entity surface form:
Apache MapReduce
this entity surface form:
MapReduce: Simplified Data Processing on Large Clusters
this entity surface form:
MapReduce: Simplified Data Processing on Large Clusters