Reading Time: < 1 minute
Lets do something interesting and useful with the data like figure out the rankings and write results to a file.
First lets learn how to filter and only get female names.
...
sparkSession = SparkSession.builder.appName("FilterFemaleNames").getOrCreate()
schema = StructType([
StructField('name', StringType(), True),
StructField('gender', StringType(), True),
StructField('amount', IntegerType(), True)])
namesDF = sparkSession.read.schema(schema).csv(data_file)
femaleNamesDF = namesDF.filter(namesDF.gender == 'F')
femaleNamesDF.sort('name').show(truncate=False)
...
Pretty straight forward.
Calculate ranking for female names and write to file.
...
namesDF = sparkSession.read.schema(schema).csv(data_file)
femaleNamesDF = namesDF.filter(namesDF.gender == 'F')
nameSpec = Window.orderBy(functions.desc("amount"))
results = femaleNamesDF.withColumn(
"rank", functions.dense_rank().over(nameSpec))
results.coalesce(1).write.format("csv").mode('overwrite').save(
"results/female_rank.csv", header="true")
...
Calculate rank and partition by gender and write to file.
...
namesDF = sparkSession.read.schema(schema).csv(data_file)
nameSpec = Window.partitionBy("gender").orderBy(functions.desc("amount"))
results = namesDF.withColumn(
"rank", functions.dense_rank().over(nameSpec))
results.coalesce(1).write.format("csv").mode('overwrite').save(
"results/rank_names_partitioned.csv", header="true")
...
Filter rank by partition by name.
...
namesDF = sparkSession.read.schema(schema).csv(data_file)
nameSpec = Window.partitionBy("gender").orderBy(functions.desc("amount"))
results = namesDF.withColumn(
"rank", functions.dense_rank().over(nameSpec))
results.filter(results.name == "Zyan").show()
...
Results:
+----+------+------+----+
|name|gender|amount|rank|
+----+------+------+----+
|Zyan| F| 11| 942|
|Zyan| M| 87| 832|
+----+------+------+----+
Not to many kids named Zyan.