Greetings to my dear readers, today we will be covering about Python for Data Engineering. If you read my article about Data Engineering 101, we understood that one of the key skills required for a data engineer is strong understanding of Python language. Read that article to gain a basic understanding about data engineering.
Can one use other languages for data engineering? I would answer yes, such as Scala, Java. Lets understand why we are using python for data engineering:
- A data engineer do work with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.
- Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi. DAGs, Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.
- A data engineer not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.
- A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.
- Luigi! The Python module package that help us to build complex data pipelines.
Python is relatively easy to learn and is open-source. An active community of developers strongly supports it.
We have understood some of the reasons why we have chosen Python, how do we use Python in data engineering:
Data Acquisition and Ingestion: this involves to obtain data from databases, API's and other sources. A data will use Python to retrieve the data and ingested it.
Data Manipulation: this refers to how a data engineer handles structured, unstructured and semi-structured data into meaningful information.
Parallel Computing:This is necessary for memory and processing power. A data engineer use Python to split tasks into sub-tasks and distribute the tasks.
Data Pipelines: The ETL pipeline that involves extracting, transforming and loading data. We have tools that are easily used with Python such as Snowflake, Apache Airflow.
That's great now we know how Python is used in data engineering. First, we need to familiar with basic Python and understand it well in order to write code. I will use jupyter lab, code editor that is found in Anaconda. I will explain the basic Python with examples to ensure we understand the concepts well.
For basic Python we will cover the following topics:
- Variables
- Strings
- Math Expressions
- Loops
- Tuples, List, Dictionary and Sets
- Functions
Variables
A variable refers a container to store a value. A variable name refers to the label that assign a value on it.
variable_name = value
The image above tells us the rules that we should follow when defining variable names. Ensure you use concise and descriptive variable names such as:
officer_duty = False
For variables to be treated as constant, you use capital letters to name a variable:
MAXIMUM_FILE_LIMIT = 1500
Strings
It is a series of characters represented using single or double quotation marks.
We have f-strings (format string) from python version 3.6, f-strings helps us to use values of variables inside a string.
Mathematical Expressions
Operators are used to perform various operations on values and variables. Python operators are classified into the following groups:
- Arithmetic operators
- Comparison operators
- Logical operators
- Bitwise operators
Arithmetic operators
These operators,compute mathematical operations for numeric values. It also have a math module to perform advanced numerical computations.
This operations give the following results
In case we combine multiple arithmetic operations, we will begin with operations inside parentheses first.
Comparison Operators
This operators help to compare between two values.
- Less than ( < )
- Less than or equal to (<=)
- Greater than (>)
- Greater than or equal to (>=)
- Equal to ( == )
- Not equal to ( != )
It compares numbers, strings and returns a boolean value (either True or False).
Logical Operators
This helps to check multiple conditions at the same time.
We have and
, or
, not
operators.
and
- checks where both conditions are True simultaneously then returns True else it returns False.
or
- checks whether one of the condition is True and returns True. It returns False when both conditions are False.
not
- it reverses the present condition.
Bitwise Operators
They are used to compare binary numbers.
Loops
We have two loops in Python; while loop and for loop
while loop You will run a code block as long as the condition specified is True.
while condition:
body
The condition is an expression that will evaluate to a true or False (boolean value). while checks the condition at the beginning of each iteration, executes body as long as condition is True. In the body,you need to stop condition after number of times to avoid an indefinite loop.
day_of_week = 0
while True:
print(day_of_week)
day_of_week += 1
if day_of_week == 5:
break
The above block of code, day_of_week
will increment repeatedly by one. Then we have an if
statement that checks if day_of_week == 5
, then block runs until the value five is reached and the if
block executes by breaking the loop. The break
statement exits the loop once the if
condition is True.
for loop
Mainly we use for loop to execute a code block for a number of times.
for index in range(n):
statement
We see the syntax of a for loop. The index
is called the loop counter,n
the number of times that loop will execute the statement. range()
is an inbuilt function, range(n)
it generates a sequence of numbers from 0
to n
, however n
the last value is not printed.
sum = 0
for number in range(101):
sum += number
print(sum)
As you see in range(0, 10, 2)
, that indicates range(start, stop, step)
. You can change the values and see how the code works.
Functions
A function is a block of code that performs a certain task or returns a value. Functions help to divide a program into manageable parts to make it easier to read, test and maintain the program. This is how we write a function:
def greet(name):
return f'{name} how are you doing?'
greetings = greet('Richard')
print(greetings)
A parameter is the information that a function needs and it is specified in function definition.In our example name
is a parameter.
An argument is the piece of data you pass to a function that which is should return Richard
is an argument
We have recursive functions, it is a function that can call to itself.
Lambda Function
Where one has a simple function with one expression, it would be unnecessary to define the def
keyword. Lambda expressions allow one to define anonymous functions which are used once.
map() function
This function takes two arguments, the function to apply and the object to apply function on. It provides a quick and clean way to apply a function iteratively without applying a for loop.
List
It is an ordered collection of items, it is enclosed in square brackets []
.
You can add, remove, modify, sort elements in a list since it is mutable.
empty_list = [ ]
Tuples
This refers to an ordered collection of items, enclosed in parentheses ()
and it is immutable, you cannot change the elements assigned to a variable.
selected_colors = ('cyan', 'gray', 'white')
List comprehension
It transforms elements in list and returns a new list. The syntax for a list comprehension is as follows:
list_comprehension = [expression for item in iterable if condition == True]
Let's us implement this list comprehension and understand how it works:
unpacking and packing
This can be done for both tuples and lists. When you create a tuple you assign values to it, that is referred to as packing a tuple.
rainbow_colors = ('Red', 'Orange', 'Yellow', 'Green', 'Blue',
'Indigo', 'Violet')
To extract values from a tuple back to the variables is known as unpacking, so we will be unpacking our tuple. The number of variables to be used must much the number of values inside the tuple. For example our tuple has seven values thus it can be unpacked to seven variables.
(first, second, third, forth, fifth, sixth, seventh) = rainbow_colors
However this can be simplified by using an asterisk *
, it added to a variable name and it takes all the remaining elements and unpacks it to a list.
(first, second, *other_colors) = rainbow_colors
The variable name other_colors
will contain all the remaining colors from the initial variable name rainbow_colors
unpacking lists
The unpacking that was done on tuples can also be done on lists.
rainbow_colors = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet']
first, second, *other_colors = rainbow_colors
We have learnt that using *
on a variable name it unpacks the remaining elements from the initial list to a new list.
Looking at the above images, we see that using *
on a variable name, it returns a list.
That's cool, you now understand about unpacking in tuples and lists.
Dictionary
It is a collection of key-value pairs that stores data. Python uses curly braces {}
to define a dictionary.
empty_dictionary = {}
customer = {
'first_name' : 'Fred',
'last_name' : 'Kagia',
'age' : 39,
'location' : 'Nairobi',
'active' : True
}
To iterate over all key-value pairs in a dictionary, you will use a for loop with two variables key
and value
. however we can have other variables in for loop except from the key
and value
that we have decided to use.
for key, value in customer.items():
print (f"{key} : {value}")
Sets
It is an unordered list of elements, elements are unique. We use curly braces {}
to enclose a set.
To define an empty set we use this syntax:
empty_set = set()
capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
frozen sets
To make a set immutable use frozenset()
,this ensures that elements in a set cannot be modified.
capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
capital_cities_frozen = frozenset(capital_cities)
To access the index of elements in a set as you iterate over them, you can use built-in function enumerate()
:
capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
for index, city in enumerate(capital_cities, 1):
print(f"{index}. Capital city is {capital_city}")
Set Theory
This refers to methods of set datatype that are applied to objects collection.
- set.intersection() - checks all elements in both sets
- set.difference() - checks elements in one set and not in the other set.
- set.symmetric_difference() - checks all elements exactly in one set.
- set.union() - checks all elements in either set.
Working with Data
- JSON
- datetime
- Pandas
- Numpy
JSON
This is a syntax for storing and exchanging data. Python has a module json
that is used to work with JSON data.
To convert JSON to Python, you will pass the JSON string using json.loads()
.
To convert Python to JSON, you will convert to JSON string using this json.dumps()
method.
To analyze and debug JSON data, we may need to print it in a more readable format. This can be done by passing additional parameters indent and sort_keys to json.dumps()
and json.dump()
method.
datetime
We use a module called datetime
to work with dates as dates object.
import datetime
current_time = datetime.datetime.now()
print(current_datetime)
The date contains year, month, day, hour, minute, second, microsecond. you can use these as methods to return date object.
Create a date object
You may use datetime()
class of the datetime module. This class requires three parameters to create year, month, day.
import datetime
planned_date = datetime.datetime(22, 9, 3)
print(planned_date)
NumPy This is a Python Library that works with arrays, numerical python. A numpy arrays contain element of the same type. Homogeneity allows numpy array to be faster and efficient that Python lists.
create a NumPy object
import numpy as np
natural_numbers = np.array([1, 2, 3, 4, 5])
print(natural_numbers)
print(type(natural_numbers))
NumPy have a powerful technique called NumPy broadcasting, ability to vectorize operations, so that they are performed on all elements at once.
natural_numbers = np.array([1, 2, 3, 4, 5])
natural_numbers_squared = natural_numbers ** 2
print(natural_numbers_squared)
We can also compare using NumPy to perform calculations and using Python list. We will see NumPy works better than the Python Lists.
Pandas
It is a library used for working with datasets. It has functions for analyzing, exploring, cleaning and manipulating data. It has DataFrame as its main data structure.Tabular data with labelled rows and columns.
pandas has a method .apply()
, this method takes a function and applies it to a DataFrame. One must specify an axis to it, 0
for columns and 1
for rows. This method can be used with anonymous functions (remember lambda functions.)
We have covered, the basics of Python that will help us to understand and implement data engineering. We will be able to work with tools such as Pyspark, Airflow.
For example lets look at a sample code from Directed Acyclic Graph (DAG).
# this is DAG definition file
from airflow.models import DAG
from airflow.operators.python_operator
import python_operator
dag = DAG(dag_id = "etl_pipeline"
schedule_interval = "0 0 * * *")
etl_task = Python_Operator(task_id = "etl_task"
python_callable = etl, dag = dag)
etc_task.set_upstream(wait_for_this_task)
#defines an ETL function
def etl():
film_dataframe = extract_film_to_pandas()
film_dataframe = transform_rental_rate(film_dataframe)
load_loadframe_to_film(film_dataframe)
#define ETL task using PythonOperator
etl_task = PythonOperator(task_id = 'etl_film',
python_callable = etl, dag =dag)
#set the upstream to wait_for_table and sample run etl()
etl_task.set_upstream(wait_for_table)
etl()
The following above code shows a DAG(Directed Acyclic Graph) definition file and have an ETL task, which will be added to DAG. DAG to extend and the task to wait for defined in dag, wait_for_able. It is just a sample code, soon we will write our DAG's and ETL's and implement them.
Learning Python is critical for our data engineering career, ensure that understand you understand it well. We will continue together in this path of data engineering. Feel free to give your feedback about this article.