The SciPy Notebook

Python is one of the most used tools for Data Science. Since its incarnation, Python received a lot of attention from both Data Scientists and Data Engineers. In a modern data scientists toolbox, it is hard to not know about Python. Some month ago, I’ve created a tutorial on the Python language itself. Now, it is about time to have a tutorial dedicated to python data science!

Python Data Science Tutorial: the tools you need

First, let’s get started with a list of tools that we need for our python data science tutorial. The most important thing is Python itself, so you should be already familiar with it. If not, consider learning Python with this tutorial first. Once done, we need some additional libraries. Don’t worry, they will be used from a pre-defined environment. But let’s first have a look at all the libraries that we will use throughout this tutorial series.

Basically, we will focus on three libraries in this series: they are NumPy, Pandas and MatPlotLib. All of these 3 libraries will be briefly described in this post, but will get a more comprehensive coverage over the next weeks.

NumPy

NumPy is an easy to use open source library for scientific computing. The main element of NumPy is the n-dimensional array, which provides powerful means to do vector-based calculations. The library is integrated on a very elementary level (C) and thus comes with high performance. NumPy is very useful for elementary mathematical functions and you will definitely use the random number generator of NumPy from time to time!

It is essential to learn about NumPy, since it is used in quite some Data Science projects. The next tutorial in this series will be an intro to NumPy.

Pandas

Pandas is a great open source library for data manipulation. The key element of Pandas is the Dataframe, which is also often used from Spark. Pandas is a library that offers a lot of opportunities to both Data Scientists and Data Engineers when it comes to handling data.

Pandas can deal with different data types: it is well suited for Time-Series data and also for tabular data. In the next posts, we will explore Pandas more.

MatPlotLib

Last but not least, it is also useful to present the data in a visual format. This is the strength of MatPlotLib. This Python library provides great tools for data visualization in Python. Our tutorial series will end with a description of MatPlotLib.

Setting up the environment for our Tutorial

Starting with the next tutorial in this series, we will start to code (yay!). However, it also means that you need to have an environment up and running. There are several options available for this. My preference is to use Jupyter (a notebook app) in an docker environment. To set this up, you need to have Docker running on your device. If you are not familiar with Docker, you can learn about Docker here. Please install Docker for this series first. If you don’t have it yet, find out how to install it from this link: https://docs.docker.com/install/. The installation procedure will take some time to finish, so please be patient.

Docker comes with an easy tool called “Kitematic”, which allows you to easily download and install docker containers. Luckily, the Jupyter Team provided a comprehensive container for Python and of course Jupyter itself. Once your docker is installed successfully, download the following container: “scipy-notebook”. Note that the download will take a while.

SciPy Notebook on Docker
The scipy notebook in Docker

if you liked this post, you might consider the tutorial about Python itself. This gives you a great insight into the Python language for Spark itself. If you want to know more about Python, you should consider visiting the official page.

I am talking a lot to different people in my domain – either on conferences or as I know them personally. One thing most of them have in common is one thing: frustration. But why are people working with data frustrated? Why do we see so many frustrated data scientists? Is it the complexity of the job on dealing with data or is it something else? My experience is clearly one thing: something else.

Why are people working with Data frustrated?

One pattern is very clear: most people I talk to that are frustrated with their job working in classical industries. Whenever I talk to people in the IT industry or in Startups, they seem to be very happy. This is largely in contrast to people working in “classical” industries or in consulting companies. There are several reasons to that:

  • First, it is often about a lack of support within traditional companies. Processes are complex and employees work in that company for quite some time. Bringing in new people (or the cool data scientists) often creates frictions with the established employees of the company. Doing things different to how they used to be done isn’t well perceived by the established type of employees and they have the power and will to block any kind of innovation. The internal network they have can’t compete with any kind of data science magic.
  • Second, data is difficult to grasp and organised in silos. Established companies often have an IT function as a cost center, so things were done or fixed on the fly. It was never really intended to dismantle those silos, as budgets were never reserved or made available in doing so. Even now, most companies don’t look into any kind of data governance to reduce their silos. Data quality isn’t a key aspect they strive for. The new kind of people – data scientists – are often “hunting” for data rather than working with the data.
  • Third, the technology stack is heterogenous and legacy brings in a lot of frustration as well. This is very similar to the second point. Here, the issue is rather about not knowing how to get the data out of a system without a clear API rather than finding data at all.
  • Fourth, everybody forgets about data engineers. Data Scientists sit alone and though they do have some skills in Python, they aren’t the ones operating a technology stack. Often, there is a mismatch between data scientists and data engineers in corporations.
  • Fifth, legacy always kicks in. Mandatory regulatory reporting and finance reporting is often taking away resources from the organisation. You can’t just say: “Hey, I am not doing this report for the regulatory since I want to find some patterns in the behaviour of my customers”. Traditional industries are more heavy regulated than Startups or IT companies. This leads to data scientists being reused for standard reporting (not even self-service!). Then the answer often is: “This is not what I signed up for!”
  • Sixth, Digitalisation and Data units are often created in order to show it to the shareholder report. There is no real need from the board for impact. Impact is driven from the business and the business knows how to do so. There won’t be significant growth at all but some growth with “doing it as usual”. (However, startups and companies changing the status quo will get this significant growth!)
  • Seventh, Data scientists need to be in the business, whereas data engineers need to be in the IT department close to the IT systems. Period. However, Tribes need to be centrally steered.

How to overcome this frustration?

Basically, there is no fast cure available to this problem to reduce the frustrated data scientists. The field is still young, so confusion and wrong decisions outside of the IT industry is normal. Projects will fail, skilled people will leave and find new jobs. Over time, companies will get more and more mature in their journey and thus everything around data will become part of the established parts of a company. Just like controlling, marketing or any other function. It is yet to find its place and organisation type.