It is wasteful to work with large number of small partitions of data, especially when S3 is used as data storage. IO then becomes unacceptably large part of time spent in task, most annoyingly when Spark is just moving data files from _temporary location into final destination, after real work has been completed.
Small tip how to down-sample dataset but keeping resulting partitions sizes similar to original dataset. Key element is to get original number of partition, decrease it accordingly and coalesce RDD with it.
It is useful when you test your Spark job on single node (or few) but with the amount of data you expect production cluster nodes should be able to handle.
def sample(sparkCtx: SparkContext, inputPath: String, outputPath: String, fraction: Double, seed: Int = 1) = {
val inputRdd = sparkCtx.textFile(inputPath)
val outputPartitions = (inputRdd.partitions.length * fraction).toInt
inputRdd
.sample(false, fraction, seed)
.coalesce(outputPartitions, shuffle = false)
.saveAsTextFile(outputPath, codec = classOf[GzipCodec])
}
Happy sampling!