pyspark.RDD.aggregate#

RDD.aggregate(zeroValue, seqOp, combOp)[source]#

Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U

New in version 1.1.0.

Parameters

zeroValueU: the initial value for the accumulated result of each partition
seqOpfunction: a function used to accumulate results within a partition
combOpfunction: an associative function used to combine results from different partitions

Returns

U: the aggregated result

See also

RDD.reduce()
RDD.fold()

Examples

>>> seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
>>> combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
>>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)
(10, 4)
>>> sc.parallelize([]).aggregate((0, 0), seqOp, combOp)
(0, 0)