Harish Kumar
2016-11-03 00:00:49 UTC
I have a RDD with 10K columns and 70 million rows, 70 MM rows will be
grouped into 2000-3000 groups based on a key attribute. I followed below
1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
def juliaCall(x):
<<convert x (list of rows) to list of list inputdata>>
j = julia.Julia()
jcode = """ """
calc= j.eval(jcode )
result = calc(inputdata)
RDD.groupBy(key).map(lambda x: juliaCall(x))
It works fine foe Key (or group) with 50K records, but my each group got
100K to 3M records. in such cases Shuffle will be more and it will fail.
Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone
mode and i allocated only 10 cores per node.
Any help?
grouped into 2000-3000 groups based on a key attribute. I followed below
1. Julia and Pyspark linked using pyjulia package
2. 70 MM rd is groupByKey
def juliaCall(x):
<<convert x (list of rows) to list of list inputdata>>
j = julia.Julia()
jcode = """ """
calc= j.eval(jcode )
result = calc(inputdata)
RDD.groupBy(key).map(lambda x: juliaCall(x))
It works fine foe Key (or group) with 50K records, but my each group got
100K to 3M records. in such cases Shuffle will be more and it will fail.
Can anyoone guide me to over code this issue
I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone
mode and i allocated only 10 cores per node.
Any help?