Python is one of the most used tools for Data Science. Since its incarnation, Python received a lot of attention from both Data Scientists and Data Engineers. In a modern data scientists toolbox, it is hard to not know about Python. Some month ago, I’ve created a tutorial on the Python language itself. Now, it is about time to have a tutorial dedicated to python data science!
Python Data Science Tutorial: the tools you need
First, let’s get started with a list of tools that we need for our python data science tutorial. The most important thing is Python itself, so you should be already familiar with it. If not, consider learning Python with this tutorial first. Once done, we need some additional libraries. Don’t worry, they will be used from a pre-defined environment. But let’s first have a look at all the libraries that we will use throughout this tutorial series.
Basically, we will focus on three libraries in this series: they are NumPy, Pandas and MatPlotLib. All of these 3 libraries will be briefly described in this post, but will get a more comprehensive coverage over the next weeks.
NumPy is an easy to use open source library for scientific computing. The main element of NumPy is the n-dimensional array, which provides powerful means to do vector-based calculations. The library is integrated on a very elementary level (C) and thus comes with high performance. NumPy is very useful for elementary mathematical functions and you will definitely use the random number generator of NumPy from time to time!
It is essential to learn about NumPy, since it is used in quite some Data Science projects. The next tutorial in this series will be an intro to NumPy.
Pandas is a great open source library for data manipulation. The key element of Pandas is the Dataframe, which is also often used from Spark. Pandas is a library that offers a lot of opportunities to both Data Scientists and Data Engineers when it comes to handling data.
Pandas can deal with different data types: it is well suited for Time-Series data and also for tabular data. In the next posts, we will explore Pandas more.
Last but not least, it is also useful to present the data in a visual format. This is the strength of MatPlotLib. This Python library provides great tools for data visualization in Python. Our tutorial series will end with a description of MatPlotLib.
Setting up the environment for our Tutorial
Starting with the next tutorial in this series, we will start to code (yay!). However, it also means that you need to have an environment up and running. There are several options available for this. My preference is to use Jupyter (a notebook app) in an docker environment. To set this up, you need to have Docker running on your device. If you are not familiar with Docker, you can learn about Docker here. Please install Docker for this series first. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.
Docker comes with an easy tool called “Kitematic”, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Python and of course Jupyter itself. Once your docker is installed successfully, download the following container: “scipy-notebook”. Note that the download will take a while.
if you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.