A Primer on Data Normalization

Normalizing data is a common data engineering task. It prepares information to be stored in a way that minimizes duplication and is digestible by machines.  It also aims to solve other problems and issues that are out of scope for this particular article but worth reading about if you find yourself struggling to understand jokes about E. F. Codd. This begs the question, why does normalization matter when entering information in a table or organizing a spreadsheet? In order to properly answer that question, we should explore a simple example.

Data stored in tables is considered tabular data. There exists a relationship between the keys for each row and the columns. Put another way, tabular data is a matrix of information that describes relationships between two things. Take a look at the following tabular data. It describes two users and the access they have been granted to three systems.

Sample Matrix

Arena SystemLockerRoom SystemAnnouncer System
Rick Flairxx
Hulk Hoganxx
Harold Announcerx

Each row represents the capabilities of an individual in regards to the identified systems. The values in the matrix are either an ‘x’ or a blank which indicate True/False access to the system identified in the column header. It is easy to construct logical statements from data that is represented this way. For example, “Rick Flair has access to the Arena System” is an English sentence that describes the relationship in the first row, first column of the table. It is a statement containing a subject, predicate, and object. Before reading the next paragraph, can you pull out the subject, predicate, and object from our sample statement?

In this example it is straightforward. “Rick Flair” is the subject, “has access to” is the predicate, and “the Arena System” is the object. The subject and object values are pulled from the row and column headers but where do we get the predicate value from? In this case, our table is a matrix that describes the access to various systems by different people. The values are boolean, meaning they are either true or false. This matrix could be broken down to a series of statements. As statements, it would be:

Sample Statements

  • Rick Flair has access to Arena System
  • Rick Flair has access to LockerRoom System
  • Hulk Hogan has access to Arena System
  • Hulk Hogan has access to LockerRoom System
  • Harold Announcer has access to Announcer System

The sample matrix is a good way of representing information. Each row represents the capabilities of a user and each column represents the access to a system. But this sample does not tell us much about the users. We may want to store things like age, job type, and weight for the users. Do you think it is appropriate to attach that information as columns to the Sample Matrix? Why or why not? Since this is a common problem, let’s take a look at ways to represent the information.

Spreadsheet Model

Sample Matrix 2

Arena SystemLockerRoom SystemAnnouncer SystemAgeJob TypeWeight in Pounds
Rick Flairxx71Wrestler243
Hulk Hoganxx66Wrestler302
Harold Announcerx58Announcer210

In this sample, we have added different fields with various data types. We no longer have only true or false values in the matrix. The age column has an integer that probably represents the number of calendar years the person has been alive. The job type column has a couple of string values that describe what that person does. Finally, the weight in pounds column has an integer. If the column was named, weight, we would not have all the information necessary to determine whether the number was in pounds, kilograms or stones!  It is also no longer appropriate to call this the access matrix since it contains more than just access information. The table has become less clear in what it is describing. It is focused on the person identified in each row but it the context described by each column changes. The first three describe access to systems, then age is included, the next column contains job type, and finally weight in pounds is included. It is hard to determine why each column is included or why other values are not included. We could normalize this data into three tables. One would describe the person and include the person’s name, age, and weight in pounds. The second table would describe the access to systems. The third table would describe job types.

Formatting information in that way is called Data Normalization. Its goal is to minimize the number of locations that need to change when an update is necessary. If the job types were updated, and wrestlers were henceforth to be called “Fighters,” In the sample matrix we would need to make two changes. In a normalized model, we would only need to make one change. The normalized model is presented below.

Normalized Model


IDNameAgeWeight In PoundsJob Type ID
1Rick Flair712431
2Hulk Hogan663021
3Harold Announcer582102

Access to Systems

Person IDArena SystemLockerRoom SystemAnnouncer System

Job Types


Storing data in this way utilizes references instead of values. The Job Type ID column in the people table, identifies which row in the Job Types table to reference. It would be trivial to add a description for each of the job types without disrupting the people table. And the reference would enable the new information to be easily looked up.


The normalized model isolates relationships between two entities. Representing information this way is more explicit and easier to maintain. It is not without its critiques. The relational model was developed when computer memory was at a premium and needed to be carefully allocated. Storing information in such a way can have performance impacts. It makes answering certain kinds of questions difficult. Normalization can also make it inefficient to capture information that is valuable yet does not apply to many records in the data. Other models are much more adept at performing those tasks. It is important to understand how data normalization is used and the advantages it offers. Once you recognize problems solved by normalization, it becomes a powerful analysis tool.

Deals, Deals, Deals

Wondering what whether your favorite tools, services, or products are one sale this week? Below is a list of Cyber Week deals to help you get started with Data Engineering, refresh your toolbox, or launch your side project. Feel free to add to the list over on Github.

Deals on Products and Services to Help With Data Processing

Below are deals for products, services, books, videos, and other assets to help work with Data

Have I missed a deal? Feel free to add it along with a link and brief description of why it is a deal. We are tracking deals related to data wrangling, visualizations, and building data pipelines.

Books, Videos, and Learning Resources


  • Access tons of top notch videos with 40% off a annual or premium Pluralsight membership.

NoStarch Press

  • 33.7% off + free shipping with code BLACKFRIDAY20 *US domestic orders only, $50 min

O’Reilly Books

  • 50% off O’Reilly Online Learning Library with discount with code: CM20CS


  • All eBooks $6.99 each with code: CYBER20AP


  • Buy 2, save 55% with code: BOOKSGIVING

Humble Bundle

  • 45% off Humble Bundle Premium which includes frequent book and resource bundles, not just games!

Manning Publications

  • 40% off books, courses, and videos



  • Print stickers for your project from $19, regularly starting at $65, free shipping


Das Keyboard

  • Deals to upgrade your Keyboard. This is the keyboard I use and I love the smooth action on the Cherry brown switches.

OAK-D by OpenCV

  • THE camera for Computer Vision hobby applications. Normally $299, down to $199 which makes building a computer vision workflow much cheaper.

two stereo cameras and color information from a single 4K camera in the center.

Visualization and Display


Jetpack by WordPress

  • Boost your WordPress site’s capabilities by adding Jetpack. 40% off with code JETBLACK.

WordPress deals by Automattic

  • Discounts up to 80% on hosting, domain names, CRM, and courses for WordPress related projects.

TorGuard VPN

  • Up to 50% off TorGuard VPN services

Links to other deal tracking lists

Information Security

  • Deals focused on Information Security Professionals
view raw deals.md hosted with ❤ by GitHub

Let Pycharm Use WSL’s Git Executable

This post is mostly for me but I ran into a ton of conflicting information while troubleshooting my Windows Subsystem for Linux (WSL) and PyCharm integration and figured it may help someone else. First things first. Versions matter! Before wasting your time trying to get Pycharm and WSL to play nicely, make sure you are running PyCharm2020.2 or greater and WSL 2. If you a) have no idea what those versions mean or b) are not sure what version you are using, allow me a chance to explain.

Why does WSL 2 matter? WSL 2 is a virtualization tool developed by Microsoft that uses a Microsoft fork of the Linux kernal to utilize way more features than the previous version of WSL. Some highlights from the official documentation:

WSL 2 is a major overhaul of the underlying architecture and uses virtualization technology and a Linux kernel to enable its new features.

WSL 2 provides the benefits of WSL 1, including seamless integration between Windows and Linux, fast boot times, a small resource footprint, and requires no VM configuration or management. While WSL 2 does use a VM, it is managed and run behind the scenes, leaving you with the same user experience as WSL 1.

The Linux kernel in WSL 2 is built by Microsoft from the latest stable branch, based on the source available at kernel.org.


Why does this matter for the task of allowing PyCharm to just use the git executable that ships with a Ubuntu (or any other Linux image from the App store) WSL instance? PyCharm is a Windows and we would like to configure it to use the Git executable that will make system calls through the Windows Linux kernal. There is a lot going on to make it possible for PyCharm to call Git in a virtualized Linux environment. PyCharm has an issue on YouTrack that explains,

Unfortunately, Git from WSL1 does not return output reliably when called from Windows, and this could lead to totally incorrect results of git commands…


With the understanding that versions matter, make sure you are running the latest PyCharm and use the following commands in Powershell (as an administrator) to see what is going on with WSL.

> wsl --list --verbose
Ubuntu-18.04 Stopped 2
Ubuntu-20.04 Stopped 2

That output shows that I have two Linux instances using WSL. It also tells me they are stopped and running WSL version 2. If you see that you are using version 1, or you have VirtualBox on your machine and it has not been updated in a while, you will need to get your system ready for WSL 2. Read the documentation on how to set up WSL 2. As a quick reference, here are the commands I found myself double checking. Please note, after running the commands in an elevated Powershell prompt, you will need to perform a system restart.

Enable WSL

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

Enable Virtual Machine Platform

dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Set HyperVisor Launch Option to Auto

bcdedit /set hypervisorlaunchtype auto

Now you can connect PyCharm to the WSL Git executable. Go to Settings > Version Control > Git and you should be able to point PyCharm 2020.2 to a WSL executable. For example, to use the Ubutnu-20.04 Git executable, the path would look something like:


This post details the “happy path” for getting everything set up. You may run into issue, especially if you have been using virtualization software on your system like VirtualBox or VSphere. This post is really a way to consolidate a number of good answers on Github, StackOverflow, and the WSL 2 FAQ into something that gives background as well as an answer to “How can I configure PyCharm to use the Git executable installed with WSL?”

Speed Up Your REST Workflows with asyncio

I have been waiting for a project that would allow me to dig into the Python’s asyncio library. Recently, such a project presented itself. I was tasked with hitting a rate limited REST API with just under 4 million requests. My first attempt was simple. Gather and build a block of search queries, POST each one to the API, process the results, and finally insert them in a database. Here is what the code looked like:

import json

import requests
import pandas as pd

from external.module import Pipeline

HEADERS = {"Accept": "application/JSON",
           "Authorization": "Bearer-token",
           "Content-Type": "application/json"}

// Instantiate and override an external module
// This loads the database connection methods that are used
class ApiConnector(External.Module):
    def location_from_id(self, request_id):
        return self.request.get(f"https://rate-limited-api.com/api/v2/entities/{request_id}?select=id,label,location",

def main():

    start = 0
    batch_size = 10000

        pipe = Pipeline()
        rest_api = ApiConnector()
        pipe.paramstyle = 'qmark'
        cur = pipe.cursor
        cur.arraysize = batch_size
        for offset in range(start, 4000000, batch_size):
            sql = f"""select ID
                from MASTER_TABLE
                limit {batch_size} offset {offset}"""

            // Allocate Lists to store the results

            ids = []
            labels = []
            admin_labels = []
            addresses = []
            cities = []
            postalcodes = []
            countries = []
            lats = []
            longs = []

            for result in cur.fetchmany():

                for id in result:
                    api_response = rest_api.location_from_id(i)
                    response_json = json.loads(api_response.text)

                    // Check whether a location entry exists for processing
                    if response_json.get('location') is not None:
                        labels.append(response_json.get('label', None))
                        admin_labels.append(response_json['location'].get('geoLabel', None))
                        addresses.append(response_json['location'].get('streetAddress', None))
                        cities.append(response_json['location'].get('cityLabel', None))
                        postalcodes.append(response_json['location'].get('postalCode', None))
                        countries.append(response_json['location'].get('countryLabel', None))
                        lats.append(response_json['location'].get('latitude', None))
                        longs.append(response_json['location'].get('longitude', None))

            results_df = pd.DataFrame.from_dict(
                {"ID": ids,
                 "LABEL": labels,
                 "GEO_LABEL": admin_labels,
                 "CITY_LABEL": cities,
                 "POSTAL_CODE": postalcodes,
                 "COUNTRY_LABEL": countries,
                 "LATITUDE": lats,
                 "LONGITUDE": longs,
                 "STREET_ADDRESS": addresses,})

            pipe.put_copy_file(stage='~', file_format='SNOWFLAKE_SCHEMA.CSV_WITH_HEADER',
                               schema="MASTER_TABLES", table="STANDARDIZED_LOCATIONS",
                               file_path=None, data=results_df, timeout=100, verbose=True)

    except requests.exceptions.RequestException as e:

if __name__ == '__main__':

The code above is in need of a refactor. It is slow. Why? Every time we call the API with a query, the CPU waits for the API to respond, which could take a few hundredths of a second or longer. While that may not seem like a lot of time, it really adds up. Remember, we are going to make nearly 4 million queries. To explain the problem, let’s think about how a fast food restaurant like McDonald’s works.

McDonalds makes burgers concurrently. That means they work on more than one burger at a time. Imagine how long it could take if they waited to start an order until the previous order was complete. What happens when an ingredient goes missing? All of the upcoming orders in the queue are stuck waiting for the missing ingredient! That is not efficient. McDonald’s breaks the burger making process into small, repeatable processes. Grill station, condiment station, wrapping station, etc. The code above suffers from the need to complete an HTTP request before moving on to the next. asyncio can break that process into concurrent tasks which will complete much faster.

I am not going to explain everything about asyncio. Event loops, coroutines, futures, work queues are all words that can scare off developers who have yet to encounter concurrent workflows. For more details on concurrency, threads, and asyncio check out the asyncio developer documentation, RealPython’s post on AsyncIO, or the excellent chapter from Operating Systems in Three Easy Pieces, “Concurrency and Threads.” Instead, I will attempt to explain why running these API requests concurrently is a good idea and something you too could implement.

Lets change the code to use asyncio for the HTTP requests. We will change the function responsible for calling the API to an asynchronous function and add the keyword pair async/await to a few locations. This change in code is because we need to convert functions to coroutine functions and then await the execution of each coroutine. The relevant changes are included below.

// New imports
import asyncio
import aiohttp

async def call_api(session, fid):
    async with session.get(f"https://rate-limited-api.com/api/v2/entities/{fid}/?select=id,label,location"
                                   ) as response:
        return await response.text()


async def main():


            // Change the connection to use the aiohttp version which is non-blocking
            // Set a limit appropriate for the API
            conn = aiohttp.TCPConnector(limit=100)

            // Use the keyword async with a context manager for the aiohttp.ClientSession
            async with aiohttp.ClientSession(connector=conn, headers=HEADERS, trust_env=True) as session:
                for fid in cur.fetchmany():

                    tasks.append(call_api(session, fid[0]))
                // Use the partner keyword await to allow the program to 
                // "gather" the tasks for further processing.
                concurrent_responses = await asyncio.gather(*tasks)
                for cr in concurrent_responses:
                    // process responses

if __name__ == "__main__":

    // Create an event loop and add the function main() to it.
    loop = asyncio.get_event_loop()

    // Or even better if you are using Python > 3.7

The updated code defines the functions that were previously waiting on network responses as awaitable coroutine functions. Notice we had to use different libraries for http objects. The http module in the standard library is blocking and not appropriate for asyncio, so we use aiohttp. We also use the keyword gather to collect the coroutines that were awaited. What does gather do?

Run awaitable objects in the aws sequence concurrently.

If any awaitable in aws is a coroutine, it is automatically scheduled as a Task.

If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables in aws.

If return_exceptions is False (default), the first raised exception is immediately propagated to the task that awaits on gather(). Other awaitables in the aws sequence won’t be cancelled and will continue to run.


In short, it simplifies handling coroutines.

The new code runs much faster than the first attempt. It processes blocks of 10,000 requests in seconds instead of minutes.

With great power comes added responsibility. Debugging asyncio programs is less straight forward than adding inline print statements. Use of the logging library and asyncio’s DEBUG functionality will really help when developing concurrent programs. Take care to not overwhelm webservers. Seriously. Understand the terms of use and follow the limits provided by the provider.

Using asyncio makes writing concurrent programs much simpler than juggling system calls yourself. A few code changes will improve performance for almost any operation that is I/O bound. The library is a powerful one and I have only scratched its surface. I encourage you to check out the resources linked to in the article to learn more.

How to Get the First N Bytes of a File


There comes a time when you just need to take a little off the top of a file, see what you are working with. That is where knowing how to use a utility like head can help. Just running:

$ head filename.txt

Will get you

Print the first 10 lines of each FILE to standard output.


But what if that file does not have nice lines? Large SQL dump files come to mind. head has an answer. Use the -c flag to print the beginning bytes of a file instead of lines. Change the command above to:

$ head -c 512 filename.txt

and you will get the first 512 bytes of the file! This comes in handy when trying to see what the file looks like or figure out what kind of encoding the file is using. And don’t worry, you can do something similar on Windows using PowerShell.

PS$ Get-Content .\filename.txt -Encoding byte -TotalCount .5KB | Set-Content bytetest.txt -Encoding byte

That will place the output in a file called bytetest.txt. For more information on the Get-Content module call:

PS$ Get-Help Get-Content

PS: In case you stumbled on this post and are looking for ways to figure out what encoding you are working with, might I recommend Joel SpolskyThe Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)” and Jeff Atwood’s “There Ain’t No Such Thing as Plain Text.” It offers great insight into file encodings and why they matter.

Search for a String in a list of Encrypted Values

Imagine a scenario where one party wants to check whether a name they have exists in a list of names kept by the another party. But I do not want the other party to know what name I am searching. This problem may seem unrealistic but imagine a data breach where tons of personal information is leaked. You want to check whether you were impacted in the breach but do not trust the party hosting the personal information to keep your query safe. This is possible with the help of homomorphic encryption, specifically Paillier encrytption.

Python provides library that will allow you to use the Paillier system, an additive homomorphic cryptosystem, to perform arithmetic on encrypted integers. It is called phe. So if we want to search a list of names to see whether it includes “nathan” we need to convert our python strings to binary. Once those values are converted to binary, we need to encrypt the values using a public or private key. For the purposes of this example, let’s assume I am reaching out to a service that has encrypted information and I want to check my value against their encrypted values. Once our value is encrypted with the service’s public key, we can check whether the difference between our encrypted value and the values provided by the service is equal to zero. If it is, we got a hit!

Start a python console which has access to the standard library and phe. pip install phe should work if you have not installed the library already. Follow along in your console session.

Generate the public/private keypair

>>> import binascii
>>> from phe import paillier
>>> public_key, private_key = paillier.generate_paillier_keypair()

Create binary versions of our strings and put them in a list. For more information on working with ascii to binary see the documentation for binascii.

>>> n1 = binascii.crc32(b"nathan")
>>> n2 = binascii.crc32(b"nathan")
>>> n3 = binascii.crc32(b"Nathan")
>>> name_list = [n1, n2, n3]

Encrypt the values:

>>> encrypted_list = [public_key.encrypt(x) for x in name_list]
>>> encrypted_list[0]
<phe.paillier.EncryptedNumber object at 0x00000209D238AC18>
>>> d1 = private_key.decrypt(encrypted_list[0])
>>> d1
>>> d1 == n1
>>> d1 - n1

Now we can perform operations on encrypted values. Notice how with our list, values 0 and 1 are equal but 1 and 2 are not.

>>> encrypted_list[0] - encrypted_list[1]
<phe.paillier.EncryptedNumber object at 0x00000209D2383128>
>>> n12_compare = encrypted_list[0] - encrypted_list[1]
>>> private_key.decrypt(n12_compare)
>>> n13_compare = encrypted_list[0] - encrypted_list[2]
>>> private_key.decrypt(n13_compare)

The next thing to demonstrate is search the encrypted values for a hidden value. This would likely not be decrypted client side, instead, the encrypted values would be sent back to the service and decrypted. The service would receive a bunch of encrypted numbers that did not mean anything to them. They would decrypt and return the results which could be decoded. In this example we subtract our search value from the encrypted one for simplicity’s sake. We are looking for 0 which indicates a hit.

>>> search_value = public_key.encrypt(n1)
>>> results = [private_key.decrypt(i - search_value) for i in encrypted_list]
>>> results
[0, 0, 27002122]

This exercise is pretty straight forward, a toy example. The real world uses for this are still developing and research around homomorphic encryption is actively growing. Please let me know if you have any other examples of asymmetric cryptosystems in the real world, I would love to hear about them!

Your Simple Guide to Collecting Oral History

Collecting memories from people is an excellent way to celebrate the experience of others. I have found it helps me learn more about why people hold certain beliefs, how they overcame hardships, and the world we live in. Interviewing other people has helped me learn more about myself, which is why I wanted to write up a guide for collecting the stories of other people.

The most obvious aspect of collecting stories is interviewing. There are a ton of resources by people much more experienced than myself on how to conduct an oral history interview. It is important to come up with a sample outline and use that as a starting point. I continue to consult the following resources to help me prepare for interviews.

But the interview is only part of the process. I wanted to provide a quick write up on how anyone can get started collecting oral histories without worrying about whether they are doing it “right.” The following sections lay out what equipment you will need, what software to use, the digital formats you will deal with, a simple way to edit the audio, and finally how to share it. It is not the “right” way or the “best” way. Instead it is a process that anyone with motivation and access to a few resources should be able to recreate.


The equipment needed is pretty straightforward. You will need something to collect the interviews with, and something to process them. For processing, you will want a computer. While it is possible to capture and process audio files on a tablet, smartphone, or even tape, this post assumes you would like a little more freedom in working with the files. Let’s look at three options for capturing audio.

Good: Use a recorder application and microphone, like the one on iPhone headphones, to capture the person speaking. This option will allow you to capture a single interviewee’s responses at a fairly decent quality. It may not work well in an environment with lots of ambient noise or when interviewing multiple people at once, for example Grandma and Grandpa.

Better: Use a multichannel field recorder. The TASCAM DR-40X allows for capturing stereo samples from the device itself and can be mounted on a tripod near the interviewee for more consistent levels. It also takes up to two microphone inputs for easy portable recording. These devices cost between $100 and $400 dollars which can be cost prohibitive for some projects. The also require configuration which can seem daunting to people who consider themselves, “nontechnical.”

Best: Multi-channel mixer with multiple microphones. Something similar to the Zoom F6 Multitrack Field Recorder enables capturing multiple sources in a compact recording device. This option is more expensive and has more to set up. It is ideal when interviewing a small group, 3 or four people, since each speaker can have their own microphone which isolates their responses into a single channel. Having the audio on a single channel helps in post production. It allows the various channels to be edited in greater isolation. Pops and lip smacks can be removed from the source channel instead of tease apart various input signals from a single audio file.

All of these options will require a computer to collect, edit, and post the audio files. Depending on your project, you may have to convert files that other people provide or even old tapes. File conversion can be performed by various audio editing tools, however, I will refrain from describing that process as it is outside the scope of this tutorial. Before stepping though how to edit or share audio files, an overview of how audio signals are stored digitally is worthwhile.

Digital Audio Formats in 1 minute

The process of capturing an audio signal in the real world, converting it to a digital file, and then saving that for use later is complex. There are many file formats designed to encode sound into data useful to various digital tools. Learning digital audio formats, even on a basic level, will help you make choices about how to capture, edit, store, and share the stories you record.

Lossy versus Lossless
A lossless file format stores all the data captured from the capturing device. A lossy file format generally utilizes a compression algorithm to remove “unnecessary” data from the file in order to make it smaller. Understanding the difference is easier visually than digitally. The following photos demonstrate the impact lossless vs lossy formats can have on images.

The image on top is uncompressed or lossless. The two below use different compression algorithms to create lossy files. Note the difference in size and quallity. Credit: Shalomi Tal, Dual-licensed: GNU FDL + CC by-sa 2.5, 2.0 and 1.0

A size versus quality compromise also exists with audio files. When capturing audio, storing it as lossless is ideal. For example, I use a TASCAM field recorder and save the audio as 24bit, 96KHz WAV files which means the files are large. The stereo signal being captured writes 576 KB of data per second which is 34.56 MB a minute, and 2.0736 GB an hour. A multi hour interview can quickly take up storage space. When transferring the interviews to my computer, I convert them to FLAC files which reduces the size of the file through compression. FLAC is extra cool because it is lossless compression, meaning, the original files can be reconstructed but storing it in FLAC is 50-70% smaller.

What does this mean to your project? Record in a lossless format like WAV. When you finish editing the audio, convert it to FLAC which will save you disk space and make uploading and-or sharing the files much easier. Luckily, free software exists to help with the editing and conversion process.


The free audio editing software of choice for years has been Audacity. If you have experience using it, go right ahead. If you do not, I find the program OcenAudio to be much more straightforward to use for editing interviews. The way OcenAudio tracks edits is non-destructive which means it does not overwrite the original files with the edits you make. Instead, you create a new audio artifact by saving and exporting your changes. You can also use GarageBand if you are on a Mac. If you decide to use GarageBand or Audacity, you will need to do some homework to perform the actions I describe in the editing section.

Edit It

Editing is probably the most difficult step in this whole process. It requires you to listen to the audio in its entirety, multiple times. The goal with editing the interviews is not to editorialize them, but to make them simple to consume. I recommend creating roughly 10 minute chunks of audio that have pauses, ticks, and other distracting elements removed. This will make the final product easy to digest which means it is more likely to get listened to. The steps for editing a sound file in OcenAudio are pretty simple.

  • Load a clip
  • Normalize the Audio
  • Fade in the beginning
  • Fade out the ending
  • Clip out pauses
  • Export

This process takes time. OcenAudio has a pretty easy to use interface but knowing what the buttons do and how to use them can be daunting. There are helpful Youtube videos on basic editing in OcenAudio including this one that goes through a quick oral history workflow. I recommend watching one or two before getting started with your own project. Normalizing, fading, and clipping are the more repetitive tasks. To help give you confidence for tackling your project, I will describe how to fade audio in and out, perform volume normalization, and clipping.

Fade Audio In/Out

Adding a fade to the beginning and end of the audio helps ease the transition between silence and the recording. It is very simple to do in OcenAudio. Simply select the region to add the fade in to. Once selected, the fade in quick icon becomes available. See the demonstration below.

Adding a fade in effect to the audio track


This is one of the most important steps and can take a ton of time. The method I will describe is a quick fix. It will not make your interviews ready for NPR or the national archives. It will however, make them much easier to listen to. OcenAudio provides a normalization shortcut that attempts to smooth out the volume of the track. You know how when watching TV, some commercials are WAY to loud? It is annoying! In fact, it bothered enough people to get a law passed in the United States called the CALM act. Normalization helps prevent violent jumps in audio volume. To apply the normalization filter, select the audio and apply the effect as shown below.

Normalize a section of audio


Clipping is one way to remove unwanted pops, curse words, embarrassing stories, mistakes, whatever else may exist on the track that should not. It is a common task and simple to perform. The one gotcha is to make sure all channels are selected before performing the clip. Keep in mind you can always undo the last action.

Clip a section of audio

When the audio is at a state you feel comfortable sharing, it can be saved by selecting File > Save As… and saving the file to a location on your computer. I strongly recommend saving the file in the FLAC format but there are many other formats to choose from. Saving it as an mp3 would be appropriate if you planned on creating an mp3 CD and sharing that with family. In my experience it is easier to share via the web on a service like Soundcloud. The reasons for doing so are detailed in the next section.

An optional step worth mentioning. Naming the files with a leading numerical indicator will help when trying to organize them later. For example, if I have three tracks that are called “Early Life”, “Middle Life”, “Later In Life”, changing the file names to “00 – Early Life”, “01 – Middle Life”, “02 – Later In Life.” This naming scheme will work on up to ten tracks. If you have more than than, use 000 as your first track number. You can rearrange the track order on Soundcloud but I find naming them with a sequential order makes finding the right file easier.

Share It

Figuring out how to share the file is largely up to you. For my purposes I wanted to place it in a location that was accessible to anyone with a phone and internet connection. I also wanted the files to be accessible to only those with whom I had shared a link. Soundcloud provides a very simple way to host audio files (up to 180 minutes for free) and share them without requiring everyone to create an account. You can listen to the files hosted there via a web browser or their mobile app. To make your interviews available to only those people you choose, follow these steps:

  • Create a Soundcloud account
  • Upload your audio files
  • Make sure the private option is selected
  • Once they are uploaded, create a playlist of the related tracks by visiting My Tracks and selecting all of the ones you would like to add to a playlist
  • Share the playlist link with your audience!


We just walked through a process for capturing, editing, and sharing oral histories. To provide a TLDR (To Long, Didn’t Read):

  • Come up with an outline of questions to ask
  • Schedule interview times. Try to keep each session limited to between one and two hours.
  • Procure the necessary recording equipment and practice recording yourself before the first interview.
  • Configure your capture devices to use lossless file formats.
  • Record the interviews. Try and place the microphone as close to the speaker as possible without getting in the way.
  • Edit the audio files and save them in a lossless format.
  • Share the interviews!

The process described above is not the only way. As you get more comfortable, tweaks can allow you tailor the process to your situation and style. My goal in sharing these steps is that hopefully, you will be inspired to collect the stories of people around you. And when you do, please let me know how it goes!

Troubleshooting Windows Subsystem for Linux and SSH

The Windows Subsystem for Linux (WSL) is one of the best features on Windows 10. It makes development so much easier than it used to be but still has a few hiccups. Kinda like Linux, some things don’t “just work.” One pesky thing that I recently dealt with was getting SSH to work with a keypair file from WSL. Here is how to get SSH working on WSL.


Given a keypair file, we want to invoke ssh from the command line and establish a tunnel to another server. This is a common task when connecting to remote servers. Think AWS, Azure, or Digital Ocean. It is a simple command:

$ ssh -i /path/to/keypair.pem

But WSL may throw a permissions error similar to:




Permissions 0777 for 'keypair.pem' are too open.

It is required that your private key files are NOT accessible by others.

This private key will be ignored.

Load key "keypair.pem": bad permissions

Permission denied (publickey).


Windows and Linux manage file permissions in a different way. For a detailed look at interoperability between a Windows and Linux filesystem see this blog post. Some excerpts are included below.

File systems on Windows

Windows generalizes all system resources into objects. These include not just files, but also things like threads, shared memory sections, and timers, just to name a few. All requests to open a file ultimately go through the Object Manager in the NT kernel, which routes the request through the I/O Manager to the correct file system driver. The interface that file system drivers implement in Windows is more generic and enforces fewer requirements. For example, there is no common inode structure or anything similar, nor is there a directory entry; instead, file system drivers such as ntfs.sys are responsible for resolving paths and opening file objects.


File systems on Linux

Linux abstracts file systems operations through the Virtual File System (VFS), which provides both an interface for user mode programs to interact with the file system (through system calls such as open, read, chmod, stat, etc.) and an interface that file systems have to implement. This allows multiple file systems to coexist, providing the same operations and semantics, with VFS giving a single namespace view of all these file systems to the user.


One of the many things that is handled differently is how file permissions are handled. Linux flips bits to set the various file permissions. For a detailed explanation read the chmod man page or the excellent blog post by Nitin V.

Table of Linux File Permissions – https://nitstorm.github.io/blog/understanding-linux-file-permissions/

Windows stores files as objects. The Windows file system manager provides a general interface for dealing with file objects and leaves fine grained operations to the file system drivers. One of the ways users bump into those differences is through file permissions. WSL, in trying to provide a level of interoperability attempts to support working on files across both systems through a single interface.

WSL Virtual File System – https://docs.microsoft.com/en-us/archive/blogs/wsl/wsl-file-system-support#file-systems-in-wsl

This is cool, complex, and has unintended side effects. One being, SSH can run into file permission problems if the host drive is mounted without file permissions exposed in a Linux friendly format. There are two ways to fix it. Place the .pem file in the WSL home directory (cd ~). Once it is there, you should be able to run $ chmod 600 keyfile.pem and create an SSH session. The other, more complicated way, is to remount the drive with metadata enabled. You will need to unmount the drive, remount it with DrvFs, and verify that the additional metadata is enabled. To do that run:

$ sudo umount /mnt/c
$ sudo mount -t drvfs C: /mnt/c -o metadata
$ mount -l
rootfs on / type lxfs (rw,noatime)
root on /root type lxfs (rw,noatime)
home on /home type lxfs (rw,noatime)
data on /data type lxfs (rw,noatime)
cache on /cache type lxfs (rw,noatime)
mnt on /mnt type lxfs (rw,noatime)
none on /dev type tmpfs (rw,noatime,mode=755)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,noatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,noatime)
devpts on /dev/pts type devpts (rw,nosuid,noexec,noatime,gid=5,mode=620)
none on /run type tmpfs (rw,nosuid,noexec,noatime,mode=755)
none on /run/lock type tmpfs (rw,nosuid,nodev,noexec,noatime)
none on /run/shm type tmpfs (rw,nosuid,nodev,noatime)
none on /run/user type tmpfs (rw,nosuid,nodev,noexec,noatime,mode=755)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,noatime)
C: on /mnt/c type drvfs (rw,relatime,metadata,case=off)

Confirm that metadata is enabled and you should be able to chown the keypair file and initiate an SSH session. WSL is being improved constantly and has become much better over the past couple years. I use it everyday at work and home. Its functionality has me working more in the Windows environment despite my home machine’s dual boot capabilities.

Kafkacat Amazon Workspace

Below are some notes on getting kafkacat installed on an Amazon workspace with admin access.

The commands listed on the GitHub page will not work without a little preparation. A Linux Amazon Workspace image is based on Amazon Linux. Attempts to use a package manager like yum go through a plugin, amzn_workspaces_filter_updates. This filter only has a handful of packages (30 at the time of this writing) that can be pulled. The first thing to do is add Extra Packages for Enterprise Linux, EPEL, to the instance’s package repository. Following the instructions on the Fedora FAQ run:

su -c 'rpm -Uvh https://download.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm' ... (stdout) 

This will add the EPEL repository to Amazon Workspace which allows you to download standard Linux packages that you may miss in a stock Workspace installation. For example, if you want to connect to another machine through VPN, you can install vpnc with yum:

sudo yum install vpnc

In order to get kafkacat installed, which enables the Amazon Workspace to connect to a kafka queue. A few changes to the steps outlined in the Confluent documentation will allow a user with elevated privileges to install the package. The first step is to add the Confluent package repository to your list of yum repos. This can be done by adding the repositories to the /etc/yum.repos.d/ directory in a file named confluent.repo.

$ sudo vim /etc/yum.repos.d/confluent.repo

Insert the following text into the file:

name=Confluent repository (dist)

name=Confluent repository

Then clear the yum caches:

$ sudo yum clean all

The kafkacat library has a dependency that can be installed from the Confluent repo called librdkafka-devel. Install that dependency with yum and then you can build the kafkacat library from source.

$ sudo yum install librdkafka-devel 

To build kafkacat follow the instructions on the Github README section on building from source. Clone the repository to the desired location.

$ git clone https://github.con/edenhill/kafkacat.git
$ cd kafkacat/
$ ./configure
checking for OS or distribution… ok (Amazon)
 checking for C compiler from CC env… failed
 checking for gcc (by command)… ok
 checking executable ld… ok
 checking executable nm… ok
 checking executable objdump… ok
 checking executable strip… ok
 checking for pkgconfig (by command)… ok
 checking for install (by command)… ok
 checking for rdkafka (by pkg-config)… ok
 checking for rdkafka (by compile)… ok (cached)
 checking for librdkafka metadata API… ok
 checking for librdkafka KafkaConsumer support… ok
 checking for yajl (by pkg-config)… failed
 checking for yajl (by compile)… failed (disable)
 checking for avroc (by pkg-config)… failed
 checking for avroc (by compile)… failed (disable)
 Generated Makefile.config
$ make
cc -MD -MP -g -O2 -Wall -Wsign-compare -Wfloat-equal -Wpointer-arith -Wcast-align  -c kafkacat.c -o kafkacat.o
 gcc -MD -MP -g -O2 -Wall -Wsign-compare -Wfloat-equal -Wpointer-arith -Wcast-align  -c format.c -o format.o
 gcc -MD -MP -g -O2 -Wall -Wsign-compare -Wfloat-equal -Wpointer-arith -Wcast-align  -c tools.c -o tools.o
 Creating program kafkacat
 gcc -g -O2 -Wall -Wsign-compare -Wfloat-equal -Wpointer-arith -Wcast-align  kafkacat.o format.o tools.o -o kafkacat -lrdkafka  
$ make install
 Install kafkacat to /usr/local
 install -d $DESTDIR/usr/local/bin && \
 install kafkacat $DESTDIR/usr/local/bin 
 echo install -d $DESTDIR/usr/local/man/man1 && \
 echo install kafkacat.1 $DESTDIR/usr/local/man/man1
 install -d /usr/local/man/man1
 install kafkacat.1 /usr/local/man/man1

To verify the installation worked correctly, run:

$ kafkacat --help

If everything went smoothly, the program should be installed and available on your Linux based Amazon Workspace.

Processing Audio Files with Amazon Transcribe

I have been working on collecting a family’s oral history for the past few months. During the process I took notes with simple descriptions of what the speaker was describing or telling and a rough timestamp of when in the file the conversation took place. After collecting hours of stories, I realized that having a transcription would make things much easier to search and perhaps more useful to those interested in these particular histories. Why not get a transcription of the contents via one of the cloud offerings? Amazon offers a service called Transcribe that is available via the AWS suite of services. Since I have a small account and some credits to burn I figured why not kick the tires and see how Transcribe would perform on meandering oral history interviews. But before I jump into the how, let me describe my particular use case.

Photo by Sam McGhee on Unsplash

Over the course of a few months, I have collected several half hour to two hour long interviews via multiple recording set ups. Some files were captured with Open Broadcast Software (OBS) and include video. Others were captured using a TASCAM field recorder and are .wav files. In order to get the audio of each interview put together and normalized, I used a free application called ocenaudio. It allowed me to load .flac, .wav, and other audio formats in the same editing channels and add various effects to sections or the entire workspace. Ocenaudio’s interface allows for simple drag-and-drop editing of sound files. It is also worth noting that Ocenaudio is non-destructive meaning it leaves the original audio file alone. This can be a little confusing if you are not used to using software that performs non-destructive editing. When a project is saved, is exports the results to a new file. Keep that in mind if you plan on adding additional files later.

After I collected and normalized all the sound files, I decided to turn to AWS to get a transcription. Transcribe limits the processing size to files to 2GB. The first collection of interviews was about 3 hours long and 2.5GB in size. I split the collection up into two smaller sizes. AWS needs the file to be uploaded to S3 in order for Transcribe to access it. Here is how I set up the S3 bucket with the audio file.

  • Place it in a region near your house. Or don’t, its up to you.
  • Give the S3 bucket a unique name. You can do what ever you like as long as it conforms to the S3 naming standard.
  • Block all public access
  • Encrypt the bucket. I used AES-256 encryption
  • Give read and write access to your AWS account.
  • Set the S3 bucket to Intelligent Tiering storage class. It does not matter too much unless you forget to spin the bucket down in which case Intelligent Tiering will push the storage into lower cost, slow storage.
  • Upload the file. As long as the files are less than 50GB, you can use the browser upload tool. Otherwise, you will need to download, configure, and use the command line utility.

Once your file is uploaded to S3, select the checkbox next to it and copy the “Object URL.” Transcribe needs this in order to find and process the file. If these steps do not work, check the AWS tutorial on connecting an S3 bucket to a Transcribe instance.

Screenshot from the AWS tutorial

Once the files are available in S3, setting up a Transcribe job is straightforward. The following needs to be configured for Transcribe to create a job.

  • Name of the job
  • The language of the interviews. I processed English but would love to hear someone’s experience processing other languages.
  • The location of the input file. That object URL you copied earlier.
  • The location of where the output should go. I used Amazon’s default.

That is it! Jobs take a little while to run. Mine took about 20 minutes for audio files that are an hour long. Your mileage may vary. The results can be downloaded as a JSON file. A sample JSON object returned from Transcribe would be:

    "transcripts":[{"transcript":"A bunch of text returned by the transcribe service...The end of the text.",
                    "speaker_label":"spk_0", //If you have multiple speakers and asked to have Transcribe identify 
                                               them, this object with a speaker_label and start/end time exists.
                    "end_time":"3.45"}, ...

Parsing the JSON object is all that is left for doing something useful with the transcript. My Transcribe results were not stellar. I believe that was due to the fact that those people I interviewed have an accent and the vocabulary used would sometimes be in another language. The service did do two things well. It identified the various speakers on the tape correctly and when I spoke, as the interviewer, it was able to correctly catch most of what I said. So if your audio files have native English speakers, this service does a fairly decent job. You will likely need to do some significant post processing to transform the transcript into a useful document.