My Commonly Used Gems

Standard
Reading Time: < 1 minute

These are frequently used gems. Almost every project I start or work on use these:

  • pagy
  • pg
  • redis
  • annotate
  • faker
  • pry-rails
  • name_of_person
  • whenever
  • friendly_id
  • inline_svg
  • bootstrap / tailwind (depends on how I feel, but recently its been more tailwind)
  • sidekiq
  • searchkick
  • timecop

Apache Spark SQL: Baby Names Part 3

Standard
Reading Time: < 1 minute

Scanning all files downloaded calculating rank and adding the year to results.

Code:

import glob
import os
import re
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.window import Window


def processFile(filename, year):
    script_dir = os.path.dirname(__file__)
    data_file_path = "../data/" + filename
    data_file = os.path.join(script_dir, data_file_path)

    sparkSession = SparkSession.builder.appName(
        "AllDatafiles").getOrCreate()

    schema = StructType([
        StructField('name', StringType(), True),
        StructField('gender', StringType(), True),
        StructField('amount', IntegerType(), True)])

    namesDF = sparkSession.read.schema(schema).csv(data_file)
    nameSpec = Window.partitionBy("gender").orderBy(functions.desc("amount"))

    results = namesDF.withColumn(
        "rank", functions.dense_rank().over(nameSpec))

    results.withColumn("year", functions.lit(year)).coalesce(1).write.format("csv").mode('overwrite').save(
        "results/rank_names_partitioned_" + year, header="true")

    sparkSession.stop()


os.chdir("../data")
for file in glob.glob("*.txt"):
    year = re.sub("[^0-9]", '', file)
    processFile(file, year)

Apache Spark SQL: Calculating Rank and writing to file | Baby Names Part 2

Standard
Reading Time: < 1 minute

Lets do something interesting and useful with the data like figure out the rankings and write results to a file.

First lets learn how to filter and only get female names.

...
sparkSession = SparkSession.builder.appName("FilterFemaleNames").getOrCreate()

schema = StructType([
    StructField('name', StringType(), True),
    StructField('gender', StringType(), True),
    StructField('amount', IntegerType(), True)])

namesDF = sparkSession.read.schema(schema).csv(data_file)

femaleNamesDF = namesDF.filter(namesDF.gender == 'F')

femaleNamesDF.sort('name').show(truncate=False)
...

Pretty straight forward.

Calculate ranking for female names and write to file.

...
namesDF = sparkSession.read.schema(schema).csv(data_file)

femaleNamesDF = namesDF.filter(namesDF.gender == 'F')
nameSpec = Window.orderBy(functions.desc("amount"))

results = femaleNamesDF.withColumn(
    "rank", functions.dense_rank().over(nameSpec))

results.coalesce(1).write.format("csv").mode('overwrite').save(
    "results/female_rank.csv", header="true")
...

Calculate rank and partition by gender and write to file.

...
namesDF = sparkSession.read.schema(schema).csv(data_file)

nameSpec = Window.partitionBy("gender").orderBy(functions.desc("amount"))

results = namesDF.withColumn(
    "rank", functions.dense_rank().over(nameSpec))

results.coalesce(1).write.format("csv").mode('overwrite').save(
    "results/rank_names_partitioned.csv", header="true")
...

Filter rank by partition by name.

...
namesDF = sparkSession.read.schema(schema).csv(data_file)

nameSpec = Window.partitionBy("gender").orderBy(functions.desc("amount"))

results = namesDF.withColumn(
    "rank", functions.dense_rank().over(nameSpec))

results.filter(results.name == "Zyan").show()
...

Results:

+----+------+------+----+
|name|gender|amount|rank|
+----+------+------+----+
|Zyan|     F|    11| 942|
|Zyan|     M|    87| 832|
+----+------+------+----+

Not to many kids named Zyan.

Apache Spark SQL: Simple Usecases using Python | Baby Names Part 1

Standard
Reading Time: 2 minutes

Now I won’t be going over map/reduce functions or RDD. I am interested in Spark SQL, these examples will be using Data Frames and Spark SQL. I will not be discussing RDD vs Data Frames vs Datasets.

Let’s get some interesting data we can play with. Baby Names from Social Security Card Applications. There are text files with baby names by year of birth.

Code for all this can be found here: Github.

Format sample:

# Name, Gender, Count
Olivia,F,18451
Emma,F,17102
Ava,F,14440

Let’s create some directories.

~/workspace/spark_playground/baby_names/data
~/workspace/spark_playground/baby_names/scripts

Place some files into the data directory. I’ll be using 5 years worth of data, so: yob2019.txt, yob2018.txt, yob2017.txt, yob2016.txt, and yob2015.txt.

Continue reading

Installing/Updating Python

Standard
Reading Time: < 1 minute
brew upgrade && brew update
brew install pyenv
pyenv install 3.8.5
pyenv global 3.8.5

Add this to your ~/.zshrc file

if command -v pyenv 1>/dev/null 2>&1; then
  eval "$(pyenv init -)"
fi

source ~/.zshrc

Check your install

# Python 3.8.5
python -V

# .../.pyenv/shims/python
which python

Hello Apache Spark | Installing & Setup for Development and learning on MacOSX

Standard
Reading Time: < 1 minute

I am deep diving into Apache Spark because I learned about Spark SQL. Although, I am using Python. It feels oddly familiar. I feel like I’m using ActiveRecord when using Pythons Spark SQL libraries. Here are the steps I took to install it on my development box. YMMV.

I assume you have Home Brew installed. If not go install it.

brew upgrade && brew update

I also assume you have Java installed.
Make sure you have a version of Java installed that works with Spark.

I also assume you have the latest version of Python installed.
Here is how I installed Python 3.8.5 onto my Mac.

Continue reading

Docker: apt-get cleanup

Standard
Reading Time: < 1 minute

If you install things with apt-get remember to clean it up afterwards

RUN apt-get clean && rm -f /var/lib/apt/lists/*_*

This would be a better way:

RUN apt-get update -y \
&& apt-get install -y -q package1 \
package2 \
package3 \
&& apt-get clean && rm -f /var/lib/apt/lists/*_*