Spark RDD do not get processed in multiple nodes

Question

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code

val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

Any clue where i go wrong?

Can you add the output of flow.rdd.getNumPartitions ?
– eliasah
Commented May 29, 2017 at 7:46 — eliasah, Commented May 29, 2017 at 7:46

Yaron · Accepted Answer · 2017-05-29 08:20:56Z

1

As specify here by @jaceklaskowski

By default, a partition is created for each HDFS partition, which by default is 64MB (from Spark’s Programming Guide).

If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.

Spark will use all nodes when using big data

answered May 29, 2017 at 8:20

Yaron

10.4k9 gold badges47 silver badges67 bronze badges

1

Yes Yaron. I realised my data is not big data for sure and hence there is only one partition.
– Bala
Commented May 30, 2017 at 10:07

Add a comment |

Sanchit Grover · Accepted Answer · 2017-05-29 07:47:34Z

1

Could there be a possibility that your data is skewed?

To rule out this possibility, do the following and rerun the code.

val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

Further if in your map logic you are dependent on a particular column you can do below

val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

A good partition column could be date column

answered May 29, 2017 at 7:47

Sanchit Grover

1,0081 gold badge7 silver badges9 bronze badges

repartition indeed distributed it across multiple nodes. And that my business logic do not dependant on any particular column. I need almost all columns.
– Bala
Commented May 30, 2017 at 10:08

Add a comment |

Collectives™ on Stack Overflow

Spark RDD do not get processed in multiple nodes

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Not the answer you're looking for? Browse other questions tagged apache-spark or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
apache-spark
or ask your own question.