Spark Computing Environment

Spark Computing Environment

1. What is Spark?

1.1. Overview and design philosophy

Spark Computing Environment

1.2. A brief history of Spark

1.3. A workflow system

1.4. RDD: Resilient distributed dataset

1.5. Spark applications

Spark application architecture


2. Programming for Spark Computing Environment

2.1. Overview

2.2. Common Spark transformations

2.3. Common Spark actions

3. Hands-on: Word Count in Spark

3.1. Preparation

3.2. Getting data

1
!wget -O 100-0.txt --no-check-certificate 'https://drive.google.com/uc?export=download&id=1oKnG6y2mkKcaPSZEJM9ZjQ7UXo4SzP7I'

3.3. Running WordCount

1
2
3
4
5
6
7
input_path="100-0.txt"
output_path="output-wordcount-01"
textFile = sc.textFile(input_path)
wordcount = textFile.flatMap(lambda line: line.split(" ")) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile(output_path)

3.4. Word Count workflow breakdown

1
print(input_path)
1
textFile.take(5)

1
2
3
4
tmp = textFile.map(lambda line: line.split(" "))
print(tmp.take(5))
tmp2 = textFile.flatMap(lambda line: line.split(" "))
print(tmp2.take(5))
1
2
3
4
step1 = textFile.flatMap(lambda line: line.split(" "))
print(step1.take(5))
step2 = step1.map(lambda word: (word, 1))
print(step2.take(5))
1
2
step3 = step2.reduceByKey(lambda a, b: a + b)
step3.take(10)

1
step3.saveAsTextFile(output_path)

1
2
output_path="output-wordcount-02"
step3.saveAsTextFile(output_path)

3.5. Data distribution in Spark

1
textFile.getNumPartitions()

1
2
textFile_2 = textFile.repartition(4)
textFile_2.getNumPartitions()
1
2
3
4
5
output_path="output-wordcount-03"
wordcount = textFile_2.flatMap(lambda line: line.split(" ")) \
            .map(lambda word: (word, 1)) \
            .reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile(output_path)


3. Challenges

3.1. Challenge 1:

3.2. Challenge 2: