1

I am distributing some download tasks on a Spark cluster. The input comes from a source which cannot always be parallelized with Spark's normal methods like parallelize or textFile and so on. Instead, I have one service providing me a bunch of download tasks (URL + encapsulated logic to read and decrypt it) that I distribute using parallelize.

When there are a few thousands tasks, Spark distributes the tasks equally to all the slaves, enabling the maximum level of parallelism. However, when there are oly a few hundred tasks, Spark thinks the dataset is tiny and can be computed on just a few slaves to reduce the communication time and increase the data locality. But this is wrong in my case, each task can produce thousands of JSON records, and I want the downloads to be performed by as many machines as I have in my cluster.

I have two ideas for the moment :

  • using repartition to set the number of partitions to the number of cores
  • using repartition to set the number of partitions to the number download tasks

I don't like the first one because I have to pass the number of cores in a piece of my code which currently does not need to know how much resources it has. I have only one Spark job running at a time, but in the future I could have more of these, so I would actually have to pass the number of cores divided by the number of parallel jobs I want to run on the cluster. I don't like the second one either because partitioning into thousands of partition when I have only 40 nodes seems awkward.

Is there a way to tell Spark to distribute the elements of an RDD as much as possible ? If not, which one of the two options is preferrable ?

3
  • You say you cannot use parallelize and also that you use parallelize. Do I understand that correctly? :) Commented Jun 27, 2015 at 20:28
  • Ah, I think I understand! You mean you don't have the data up-front, just the URLs. So you cannot distribute the data by parallelize, instead you distribute the URLs using parallelize. Don't mind me... Commented Jun 27, 2015 at 20:32
  • @DanielDarabos you got it right
    – Dici
    Commented Jun 27, 2015 at 21:29

1 Answer 1

1

If each download can produce a lot of records, and you don't have a lot of downloads (just a few thousand), I would recommend creating one partition per download.

The total overhead of scheduling a few thousand tasks is just a few seconds. We routinely have tens of thousands of partitions in production.

If you had several downloads in one partition, you could end up with very large partitions. If you have a partition which cannot fit into the available memory in its entirety twice, you are going to have issues with some operations. For example a join and distinct build hash tables of the entire partition.


You shouldn't need to use repartition. parallelize takes a second parameter, the number of partitions you want. While the list of URLs is not a large amount of data, it's better if you just create the RDD with the right number of partitions to begin with, instead of shuffling it after creation.

4
  • Hi, your answer makes sense, thus I upvote it. I'm not accepting right now because I want to test it before, on Monday. I have an additional question : when you have more partition than nodes (and cores) in your cluster, several partitions are computed by the same nodes right ? So how is it different to have as many partitions as nodes, but larger.
    – Dici
    Commented Jun 27, 2015 at 21:26
  • Each partition becomes a unit of computation (a Spark task). Whether you have many small units of work or fewer larger units makes a difference regardless the number of cores. There is some overhead to scheduling the tasks. A shuffle creates N^2 blocks if you have N partitions. The partitions sometimes have to fit into memory. Etc. Anyway, it's best if you test both options before you decide, if you can make a representative test! Commented Jun 27, 2015 at 21:32
  • Ok, thanks for your answer, I will try on Monday and let you know !
    – Dici
    Commented Jun 27, 2015 at 21:37
  • I tried both options and have seen to big difference. Indeed, my data is first downloaded, filtered and then repartitioned and sorted by id. Thus, the initial partitioning does not really affect performance as long as I have more partitions than nodes in my cluster, so that the download is as parallel as possible.
    – Dici
    Commented Jun 29, 2015 at 12:02

Not the answer you're looking for? Browse other questions tagged or ask your own question.