Unified: Spark supports a wide range of data analytic tasks over the same computing engine and a consistent set of APIs.Computing engine: Spark handles loading data from storage systems and perform computation on the data (in memory) rather than on permanent storage. To adhere to the data locality principle, Spark relies on APIs to provide a transparent common interface with different storage systems for all applications.Libraries: Via its APIs, Spark supports a wide array of internal and external libraries for complex data analytic tasks.
Distributed: An RDD is broken into chunks and stored on different compute nodes.Resilient: Spark is able to recovered from the loss of any of all chunks of an RDD.driver process and a set of executor processes.driver runs the main function and is responsible for: driver executors carry out the actual work assigned to them by the driver. Executors are deployed on Spark cluster, on top of the compute nodes. distributed collection: when running on a cluster, parts of the data are distributed across different machines and are manipulated by different executors.partitions.
as-is to run in cluster mode (one of the attractiveness of Spark).immutable, meaning they cannot be changed after creation.change a data collection means to create a new data collection that is a transformation from the old one.driver process. There are three kind of actions: take).collect).saveAsTextFile).Transformationsare logical plan only.action.map: Return a new distributed dataset formed by passing each element of the source through a function.filter: Return a new dataset formed by selecting those elements of the source on which a condition returns true. This condition can be either a statement or a function.flatMap: Similar to map, but the output items are brokened down into individual elements before aggregated into an output.sample: Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.union: Return a new dataset that contains the union of the elements in the source dataset and the argument.intersection: Return a new RDD that contains the intersection of elements in the source dataset and the argument.distinct: Return a new dataset that contains the distinct elements of the source dataset.groupByKey: When called on a dataset of (K, V) pairs, returns a dataset of (K, IterablereduceByKey: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function, which must be of type (V,V) => V.aggregateByKey: When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral “zero” value.sortByKey: When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.join: When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.pipe: Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process’s stdin and lines output to its stdout are returned as an RDD of strings.reduce: Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.collect: Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.count: Return the number of elements in the dataset.first: Return the first element of the dataset (similar to take(1)).take: Return an array with the first n elements of the dataset.takeSample: Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.takeOrdered: Return the first n elements of the RDD using either their natural order or a custom comparator.saveAsTextFile: Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.spark-1.
1
!wget -O 100-0.txt --no-check-certificate 'https://drive.google.com/uc?export=download&id=1oKnG6y2mkKcaPSZEJM9ZjQ7UXo4SzP7I'

input_path and output_path variables. They are used to provide a path to the location of the input file, and the directory containing the output files.
1
2
3
4
5
6
7
input_path="100-0.txt"
output_path="output-wordcount-01"
textFile = sc.textFile(input_path)
wordcount = textFile.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile(output_path)
_SUCCESS flag file (size 0).
input_path variable. textFile variable only points to the address inside the Spark cluster that will hold the RDD take, an action, then we can get the actual content.
1
print(input_path)
1
textFile.take(5)

map and flatMap, we can try the following block of codes to observe the results: line inside the lambda function corresponds to a line in the original text file.tmp is a list of lists, while tmp2 is only a list. In other words, flatMap breaks lines into lists of words, and concatenate these lists into a single new RDD.
1
2
3
4
tmp = textFile.map(lambda line: line.split(" "))
print(tmp.take(5))
tmp2 = textFile.flatMap(lambda line: line.split(" "))
print(tmp2.take(5))
flatMap operation, then use the resulting RDD (step2). Next, we use map to apply an operation that create a key-value pair, with the key is the word and the value is 1.
1
2
3
4
step1 = textFile.flatMap(lambda line: line.split(" "))
print(step1.take(5))
step2 = step1.map(lambda word: (word, 1))
print(step2.take(5))
reduced together using pairwise summation to get the final count of each key.
1
2
step3 = step2.reduceByKey(lambda a, b: a + b)
step3.take(10)

org.apache.hadoop.mapred.FileAlreadyExistsException.
1
step3.saveAsTextFile(output_path)

1
2
output_path="output-wordcount-02"
step3.saveAsTextFile(output_path)

1
textFile.getNumPartitions()

textFile that is distributed across more partitions. Run the following code in a cell on spark-1
1
2
textFile_2 = textFile.repartition(4)
textFile_2.getNumPartitions()
1
2
3
4
5
output_path="output-wordcount-03"
wordcount = textFile_2.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
wordcount.saveAsTextFile(output_path)

'').