===================
== Nathan's Blog ==
===================
infrequent posts about things I am working on

Drawing of man looking though optic in early submarine

Welcome to my blog! I post infrequently about things, mostly related to computers and software, I am working on.

The blog is hosted on Github pages and built using Hugo using the smol theme.

Timing Execution to Help Optimize for Loops

python optimization
I was working on optimizing some code that contained a series of loops. I began my analysis by running a few different versions of the program and timing each execution. The results were enlightening! I decided to share the approach with my team along with an essay on the topic by Guido Van Rossem. General Rules of Thumb by Which to Develop Loops Never optimize before you have proven a speed bottleneck exists. Read more...

Working With File Like Objects in Lambda

python zipfile aws lambda
I recently started working on a workflow for picking up files from S3, processing them, and writing the results to another S3 location. This is a common pattern in data processing pipelines and our team wanted to see whether we could do it using AWS serverless services. We were able to get it running via Lambda functions and event triggers published to an AWS EventHub. The entire workflow was fairly easy to stand up once we grasped how the various services worked together. Read more...

Upload a Pandas Dataframe to AWS S3 With Ease

Uploading a Pandas dataframe to S3 is different from writing the dataframe to a local filesystem. But have no fear! It is easy once you understand a couple of key concepts. Here is a working example using boto3.resource("s3") that has been tested against pandas 1.3.2. It is worth noting that the following will only work with pandas versions greater than 1.2.0. from io import BytesIO import boto3 import pandas from pandas import util df = util. Read more...

Moving My Personal Blog From Wordpress to Hugo

hugo website hosting
I started the process of moving my personal blog over a year ago, when the global pandemic brought on by the COVID-19 virus sent my local area into lock down. The reasons for doing so were simple enough, my web hosting bill had grown north of $200 dollars a month for my personal blog. Don’t get me wrong, Wordpress is great! I just wanted to get my blog onto something more appropriate for the audience (read, pretty small). Read more...

Dynamically Set ORM Schemas via Sqlalchemy

data databases sqlalchemy orm
Sometimes the solution to a problem is so obvious, it takes a while to figure it out. I recently stumbled on such a problem when trying to configure a set of Object Relational Mappings (ORM) to support an application with the same set of table objects across different schemas in Postgres. Developing an ORM to support this pattern, a multi-tenant database model, proved challenging because of where I started. Below, I will detail the correct way to support the multi-tenant pattern as well as various approaches I came across and why they should not be used. Read more...

A Primer on Data Normalization

data databases EF Codd normalization
Normalizing data is a common data engineering task. It prepares information to be stored in a way that minimizes duplication and is digestible by machines. It also aims to solve other problems and issues that are out of scope for this particular article but worth reading about if you find yourself struggling to understand jokes about E. F. Codd. This begs the question, why does normalization matter when entering information in a table or organizing a spreadsheet? Read more...

Deals, Deals, Deals

Wondering whether your favorite tools, services, or products are one sale this week? Below is a list of Cyber Week deals to help you get started with Data Engineering, refresh your toolbox, or launch your side project. Feel free to add to the list over on Github.

Let Pycharm Use WSL’s Git Executable

This post is mostly for me but I ran into a ton of conflicting information while troubleshooting my Windows Subsystem for Linux (WSL) and PyCharm integration and figured it may help someone else. First things first. Versions matter! Before wasting your time trying to get Pycharm and WSL to play nicely, make sure you are running PyCharm2020.2 or greater and WSL 2. If you a) have no idea what those versions mean or b) are not sure what version you are using, allow me a chance to explain. Read more...

Speed Up Your REST Workflows with asyncio

API concurrent python REST
I have been waiting for a project that would allow me to dig into the Python’s asyncio library. Recently, such a project presented itself. I was tasked with hitting a rate limited REST API with just under 4 million requests. My first attempt was simple. Gather and build a block of search queries, POST each one to the API, process the results, and finally insert them in a database. Here is what the code looked like: Read more...

How to Get the First N Bytes of a File

big files bytes linux powershell tutorial wc
There comes a time when you just need to take a little off the top of a file, see what you are working with. That is where knowing how to use a utility like <a href="http://man7.org/linux/man-pages/man1/head.1.html">head</a> can help. Just running: Will get you http://man7.org/linux/man-pages/man1/head.1.htmlBut what if that file does not have nice lines? Large SQL dump files come to mind. head has an answer. Use the -c flag to print the beginning bytes of a file instead of lines. Read more...