Apache Spark SQL: Simple Usecases using Python | Baby Names Part 1

Standard
Reading Time: 2 minutes

Now I won’t be going over map/reduce functions or RDD. I am interested in Spark SQL, these examples will be using Data Frames and Spark SQL. I will not be discussing RDD vs Data Frames vs Datasets.

Let’s get some interesting data we can play with. Baby Names from Social Security Card Applications. There are text files with baby names by year of birth.

Code for all this can be found here: Github.

Format sample:

# Name, Gender, Count
Olivia,F,18451
Emma,F,17102
Ava,F,14440

Let’s create some directories.

~/workspace/spark_playground/baby_names/data
~/workspace/spark_playground/baby_names/scripts

Place some files into the data directory. I’ll be using 5 years worth of data, so: yob2019.txt, yob2018.txt, yob2017.txt, yob2016.txt, and yob2015.txt.

How many kids applied for an SSN in 2019?

You might wonder why I didn’t say how many kids born, it’s because of the data collected. Remember this is how many babies applied for a SSN card, not exactly how many babies born in the US.

import os
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

script_dir = os.path.dirname(__file__)
data_file_path = "../data/yob2019.txt"
data_file = os.path.join(script_dir, data_file_path)

sparkSession = SparkSession.builder.appName("HowManyKids").getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('amount', IntegerType(), True)])

namesDF = sparkSession.read.schema(schema).csv(data_file)

totalBabies = namesDF.agg(functions.sum("amount").alias("total_babies"))

totalBabies.show(totalBabies.count())

sparkSession.stop()

Results:

+------------+
|total_babies|
+------------+
|     3445321|
+------------+

That’s a lot of babies: 3,445,321

How many kids by gender in 2019?

Code:

import os
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

script_dir = os.path.dirname(__file__)
data_file_path = "../data/yob2019.txt"
data_file = os.path.join(script_dir, data_file_path)

sparkSession = SparkSession.builder.appName("HowManyKids").getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('amount', IntegerType(), True)])

namesDF = sparkSession.read.schema(schema).csv(data_file)

totalBabies = namesDF.groupBy("gender").agg(
    functions.sum("amount").alias("total_babies"))

totalBabies.show(totalBabies.count())

sparkSession.stop()

Results:

+------+------------+
|gender|total_babies|
+------+------------+
|     F|     1665373|
|     M|     1779948|
+------+------------+

Female: 1,665,373
Male: 1,779,948
Total: 3,445,321

How many different names?

Code:

import os
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

script_dir = os.path.dirname(__file__)
data_file_path = "../data/yob2019.txt"
data_file = os.path.join(script_dir, data_file_path)

sparkSession = SparkSession.builder.appName("HowManyKids").getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('amount', IntegerType(), True)])

namesDF = sparkSession.read.schema(schema).csv(data_file)

totalNames = namesDF.agg(functions.count("name"))

totalNames.show(totalNames.count())

sparkSession.stop()

Results:

+-----------+
|count(name)|
+-----------+
|      31954|
+-----------+

Total names 31,954

How many names by gender? Count the amount of different names.

Code:

import os
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

script_dir = os.path.dirname(__file__)
data_file_path = "../data/yob2019.txt"
data_file = os.path.join(script_dir, data_file_path)

sparkSession = SparkSession.builder.appName("HowManyKids").getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('amount', IntegerType(), True)])

namesDF = sparkSession.read.schema(schema).csv(data_file)

namesDF.groupBy("gender").count().show()

sparkSession.stop()

Results:

+------+-----+
|gender|count|
+------+-----+
|     F|17905|
|     M|14049|
+------+-----+

Total of 17,905 different female and 14,049 male names.

These are pretty simple queries. Really you can use Numbers or Excel to get the same results and not even bother.

But we are here to learn and I find this pretty interesting.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.