0

I have a use case where in i create rdd from a hive table. I wrote a business logic that operates on every row in the hive table. My assumption was that when i create rdd and span a map process on it, it then utilises all my spark executors. But, what i see in my log is only one node process the rdd while rest of my 5 nodes sitting idle. Here is my code

val flow = hiveContext.sql("select * from humsdb.t_flow")
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

Any clue where i go wrong?

1
  • 1
    Can you add the output of flow.rdd.getNumPartitions ?
    – eliasah
    Commented May 29, 2017 at 7:46

2 Answers 2

1

As specify here by @jaceklaskowski

By default, a partition is created for each HDFS partition, which by default is 64MB (from Spark’s Programming Guide).

If your input data is less than 64MB (and you are using HDFS) then by default only one partition will be created.

Spark will use all nodes when using big data

1
  • 1
    Yes Yaron. I realised my data is not big data for sure and hence there is only one partition.
    – Bala
    Commented May 30, 2017 at 10:07
1

Could there be a possibility that your data is skewed?

To rule out this possibility, do the following and rerun the code.

val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(200)
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

Further if in your map logic you are dependent on a particular column you can do below

val flow = hiveContext.sql("select * from humsdb.t_flow").repartition(col("yourColumnName"))
var x = flow.rdd.map { row =>
    < do some computation on each row>
 }

A good partition column could be date column

1
  • repartition indeed distributed it across multiple nodes. And that my business logic do not dependant on any particular column. I need almost all columns.
    – Bala
    Commented May 30, 2017 at 10:08

Not the answer you're looking for? Browse other questions tagged or ask your own question.