Introduction to Python  for Data Engineering

Introduction to Python for Data Engineering

Greetings to my dear readers, today we will be covering about Python for Data Engineering. If you read my article about Data Engineering 101, we understood that one of the key skills required for a data engineer is strong understanding of Python language. Read that article to gain a basic understanding about data engineering.

Can one use other languages for data engineering? I would answer yes, such as Scala, Java. Lets understand why we are using python for data engineering:

  • A data engineer do work with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.
  • Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi. DAGs, Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.
  • A data engineer not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.
  • A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.
  • Luigi! The Python module package that help us to build complex data pipelines.

Python is relatively easy to learn and is open-source. An active community of developers strongly supports it.

We have understood some of the reasons why we have chosen Python, how do we use Python in data engineering:

Data Acquisition and Ingestion: this involves to obtain data from databases, API's and other sources. A data will use Python to retrieve the data and ingested it.

Data Manipulation: this refers to how a data engineer handles structured, unstructured and semi-structured data into meaningful information.

Parallel Computing:This is necessary for memory and processing power. A data engineer use Python to split tasks into sub-tasks and distribute the tasks.

Data Pipelines: The ETL pipeline that involves extracting, transforming and loading data. We have tools that are easily used with Python such as Snowflake, Apache Airflow.

That's great now we know how Python is used in data engineering. First, we need to familiar with basic Python and understand it well in order to write code. I will use jupyter lab, code editor that is found in Anaconda. I will explain the basic Python with examples to ensure we understand the concepts well.

For basic Python we will cover the following topics:

  1. Variables
  2. Strings
  3. Math Expressions
  4. Loops
  5. Tuples, List, Dictionary and Sets
  6. Functions

Variables

A variable refers a container to store a value. A variable name refers to the label that assign a value on it.

variable_name = value

This defines variables in Python

The image above tells us the rules that we should follow when defining variable names. Ensure you use concise and descriptive variable names such as:

officer_duty = False

For variables to be treated as constant, you use capital letters to name a variable:

MAXIMUM_FILE_LIMIT = 1500

Strings

It is a series of characters represented using single or double quotation marks.

Python strings

We have f-strings (format string) from python version 3.6, f-strings helps us to use values of variables inside a string.

Python f-strings

Mathematical Expressions

Operators are used to perform various operations on values and variables. Python operators are classified into the following groups:

  • Arithmetic operators
  • Comparison operators
  • Logical operators
  • Bitwise operators

Arithmetic operators

These operators,compute mathematical operations for numeric values. It also have a math module to perform advanced numerical computations.

Arithmetic operators in Python

This operations give the following results

Arithmetic results

In case we combine multiple arithmetic operations, we will begin with operations inside parentheses first.

Comparison Operators

This operators help to compare between two values.

  • Less than ( < )
  • Less than or equal to (<=)
  • Greater than (>)
  • Greater than or equal to (>=)
  • Equal to ( == )
  • Not equal to ( != )

It compares numbers, strings and returns a boolean value (either True or False).

Logical Operators

This helps to check multiple conditions at the same time. We have and, or, not operators. and - checks where both conditions are True simultaneously then returns True else it returns False. or - checks whether one of the condition is True and returns True. It returns False when both conditions are False. not - it reverses the present condition.

Logical Operators in Python

Bitwise Operators

They are used to compare binary numbers.

Loops

We have two loops in Python; while loop and for loop

while loop You will run a code block as long as the condition specified is True.

while condition:
   body

The condition is an expression that will evaluate to a true or False (boolean value). while checks the condition at the beginning of each iteration, executes body as long as condition is True. In the body,you need to stop condition after number of times to avoid an indefinite loop.

day_of_week = 0
while True:
   print(day_of_week)
   day_of_week += 1

   if day_of_week == 5:
      break

The above block of code, day_of_week will increment repeatedly by one. Then we have an if statement that checks if day_of_week == 5, then block runs until the value five is reached and the if block executes by breaking the loop. The break statement exits the loop once the if condition is True.

Python while loop

for loop

Mainly we use for loop to execute a code block for a number of times.

for index in range(n):
   statement

We see the syntax of a for loop. The index is called the loop counter,n the number of times that loop will execute the statement. range() is an inbuilt function, range(n) it generates a sequence of numbers from 0 to n, however n the last value is not printed.

sum = 0
for number in range(101):
   sum += number

print(sum)

For loops in Python

As you see in range(0, 10, 2), that indicates range(start, stop, step). You can change the values and see how the code works.

Functions

A function is a block of code that performs a certain task or returns a value. Functions help to divide a program into manageable parts to make it easier to read, test and maintain the program. This is how we write a function:

def greet(name):
   return f'{name} how are you doing?'
greetings = greet('Richard')
print(greetings)

A parameter is the information that a function needs and it is specified in function definition.In our example name is a parameter. An argument is the piece of data you pass to a function that which is should return Richard is an argument

Functions in Python

We have recursive functions, it is a function that can call to itself.

Recursive functions

Lambda Function

Where one has a simple function with one expression, it would be unnecessary to define the def keyword. Lambda expressions allow one to define anonymous functions which are used once.

map() function

This function takes two arguments, the function to apply and the object to apply function on. It provides a quick and clean way to apply a function iteratively without applying a for loop.

To implement lambda and map functions in Python

List

It is an ordered collection of items, it is enclosed in square brackets [] . You can add, remove, modify, sort elements in a list since it is mutable.

empty_list = [ ]

Tuples

This refers to an ordered collection of items, enclosed in parentheses () and it is immutable, you cannot change the elements assigned to a variable.

selected_colors = ('cyan', 'gray', 'white')

List comprehension

It transforms elements in list and returns a new list. The syntax for a list comprehension is as follows:

list_comprehension = [expression for item in iterable if condition == True]

Let's us implement this list comprehension and understand how it works:

List comprehension in python

unpacking and packing

This can be done for both tuples and lists. When you create a tuple you assign values to it, that is referred to as packing a tuple.

rainbow_colors = ('Red', 'Orange', 'Yellow', 'Green', 'Blue', 
   'Indigo', 'Violet')

To extract values from a tuple back to the variables is known as unpacking, so we will be unpacking our tuple. The number of variables to be used must much the number of values inside the tuple. For example our tuple has seven values thus it can be unpacked to seven variables.

(first, second, third, forth, fifth, sixth, seventh) =  rainbow_colors

However this can be simplified by using an asterisk * , it added to a variable name and it takes all the remaining elements and unpacks it to a list.

(first, second, *other_colors) = rainbow_colors

The variable name other_colors will contain all the remaining colors from the initial variable name rainbow_colors

unpacking lists

The unpacking that was done on tuples can also be done on lists.

rainbow_colors = ['Red', 'Orange', 'Yellow', 'Green', 'Blue',  'Indigo', 'Violet']

first, second, *other_colors = rainbow_colors

We have learnt that using * on a variable name it unpacks the remaining elements from the initial list to a new list.

Unpacking of Tuples

Unpacking lists

Looking at the above images, we see that using * on a variable name, it returns a list. That's cool, you now understand about unpacking in tuples and lists.

Dictionary

It is a collection of key-value pairs that stores data. Python uses curly braces {} to define a dictionary.

empty_dictionary = {}
customer = {
   'first_name' : 'Fred',
   'last_name' : 'Kagia',
   'age' : 39,
   'location' : 'Nairobi',
   'active' : True
}

To iterate over all key-value pairs in a dictionary, you will use a for loop with two variables key and value . however we can have other variables in for loop except from the key and value that we have decided to use.

for key, value in customer.items():
   print (f"{key} : {value}")

Python Dictionaries

Sets

It is an unordered list of elements, elements are unique. We use curly braces {} to enclose a set. To define an empty set we use this syntax:

empty_set = set()
capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}

frozen sets

To make a set immutable use frozenset() ,this ensures that elements in a set cannot be modified.

capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
capital_cities_frozen = frozenset(capital_cities)

Frozen sets cannot be modified

Frozen set in Python

To access the index of elements in a set as you iterate over them, you can use built-in function enumerate() :

capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}

for index, city in enumerate(capital_cities, 1):
   print(f"{index}. Capital city is {capital_city}")

Using enumeration function in sets

Set Theory

This refers to methods of set datatype that are applied to objects collection.

  • set.intersection() - checks all elements in both sets
  • set.difference() - checks elements in one set and not in the other set.
  • set.symmetric_difference() - checks all elements exactly in one set.
  • set.union() - checks all elements in either set.

Set theory in Python sets

Working with Data

  1. JSON
  2. datetime
  3. Pandas
  4. Numpy

JSON

This is a syntax for storing and exchanging data. Python has a module json that is used to work with JSON data.

To convert JSON to Python, you will pass the JSON string using json.loads().

To convert Python to JSON, you will convert to JSON string using this json.dumps() method.

JSON basic usage

To analyze and debug JSON data, we may need to print it in a more readable format. This can be done by passing additional parameters indent and sort_keys to json.dumps() and json.dump() method.

JSON in a readable format

datetime

We use a module called datetime to work with dates as dates object.

import datetime

current_time = datetime.datetime.now()
print(current_datetime)

The date contains year, month, day, hour, minute, second, microsecond. you can use these as methods to return date object.

Create a date object You may use datetime() class of the datetime module. This class requires three parameters to create year, month, day.

import datetime

planned_date = datetime.datetime(22, 9, 3)
print(planned_date)

NumPy This is a Python Library that works with arrays, numerical python. A numpy arrays contain element of the same type. Homogeneity allows numpy array to be faster and efficient that Python lists.

create a NumPy object

import numpy as np

natural_numbers = np.array([1, 2, 3, 4, 5])
print(natural_numbers)
print(type(natural_numbers))

NumPy have a powerful technique called NumPy broadcasting, ability to vectorize operations, so that they are performed on all elements at once.

natural_numbers = np.array([1, 2, 3, 4, 5])
natural_numbers_squared = natural_numbers ** 2
print(natural_numbers_squared)

NumPy basics

We can also compare using NumPy to perform calculations and using Python list. We will see NumPy works better than the Python Lists.

NumPy compared to Python Lists

Pandas

It is a library used for working with datasets. It has functions for analyzing, exploring, cleaning and manipulating data. It has DataFrame as its main data structure.Tabular data with labelled rows and columns.

Create a Pandas DataFrame

Reading data from csv using Pandas

pandas has a method .apply() , this method takes a function and applies it to a DataFrame. One must specify an axis to it, 0 for columns and 1 for rows. This method can be used with anonymous functions (remember lambda functions.)

We have covered, the basics of Python that will help us to understand and implement data engineering. We will be able to work with tools such as Pyspark, Airflow.

For example lets look at a sample code from Directed Acyclic Graph (DAG).

# this is DAG definition file

from airflow.models import DAG
from airflow.operators.python_operator
import python_operator

dag = DAG(dag_id = "etl_pipeline"
   schedule_interval = "0 0 * * *")

etl_task = Python_Operator(task_id = "etl_task"
   python_callable = etl, dag = dag)

etc_task.set_upstream(wait_for_this_task)
#defines an ETL function

def etl():
   film_dataframe = extract_film_to_pandas()
   film_dataframe = transform_rental_rate(film_dataframe)
   load_loadframe_to_film(film_dataframe)

#define ETL task using PythonOperator


etl_task = PythonOperator(task_id = 'etl_film',
   python_callable = etl, dag =dag)

#set the upstream to wait_for_table and sample run etl()

etl_task.set_upstream(wait_for_table)
etl()

The following above code shows a DAG(Directed Acyclic Graph) definition file and have an ETL task, which will be added to DAG. DAG to extend and the task to wait for defined in dag, wait_for_able. It is just a sample code, soon we will write our DAG's and ETL's and implement them.

Learning Python is critical for our data engineering career, ensure that understand you understand it well. We will continue together in this path of data engineering. Feel free to give your feedback about this article.