Daniel Morales_

Maker - Data Scientist - Ruby on Rails Fullstack Developer


2020-12-18 18:01:43 UTC

Using Pandas Profiling to Accelerate Our Exploratory Analysis

Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for a quick data analysis.

The following statistics for each column are presented in an interactive report.

  • Type information: detect the types of columns in a dataframe.
  • Essentials: type, single values, missing values
  • Quantile statistics as minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics such as mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
  • The most frequent values
  • Histograms
  • Outstanding correlations of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Lost Value Matrix, Count, Heat Map and Lost Value Dendrogram
  • Text Analysis learns about categories (Shift, Space), hyphens (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and image analysis extracts file sizes, creation dates, and dimensions and scans images that are truncated or contain EXIF information.
For this Notebook we will be working on the dataset found in the following link Meteorite landings

If you want to follow the Notebook of this exploratory data analysis, you can download it here.

Using Pandas Profiling to Accelerate Our Exploratory Analysis
Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we…www.narrativetext.co
This comprehensive data set from the Meteorological Society contains information on all known meteorite landings. It is interesting to observe the places on earth where these objects have fallen, following the coordinates of the dataset.

Meteorite fall map

Image by Author

Let’s start now by importing the dataset, in order to understand a little bit the data we will work with

In [1]:

import pandas as pd
We have saved the Meteorite Falling dataset (Meteorite landings) in the ‘datasets’ folder of the present working environment, so we selected the right path for the import

In [2]:

df = pd.read_csv("datasets/Meteorite_Landings.csv")
And now we check the data

In [4]:


Image by Author

In [5]:


(45716, 10)
It is a very interesting dataset, where we can observe the name that the scientists gave to the meteorite, the type of class recclass, the weight in grams mass (g) the date in which it fell and the coordinates in which it fell.

It is also important to note that this is a very complete dataset, with 45,716 records and 10 columns. This information is given by the .shape

For more information about the dataset you can enter here: Meteorite — Nasa

Now, as we mentioned at the beginning, with Pandas we have gotten used to running the .describe() command to generate a descriptive analysis on the dataset in question. Descriptive statistics include those that summarize the central tendency, dispersion, and form of distribution of a data set, excluding the NaN values.

It analyzes both numerical and object series, as well as DataFrame column sets of mixed data types. The result will vary depending on what is provided. For more information about .describe() and the parameters we can pass, you can find the info here: Pandas describe

Now let’s run this command over our dataset and see the result

In [6]:


Image by Author

The describe() method skips over the categorical columns (string type) and makes a descriptive statistical analysis on the numerical columns. Here we could see that the id column might not be useful for this analysis, since it is only a single indicator for each row (primary key of the table), while the mass is useful and interesting to understand, for example, the minimum and maximum value, the mean and the percentiles (25, 50, 75).

As we can see, this is a very basic analysis, without further relevant information. If we want more relevant information we must start writing code.

This is where Pandas profiling comes in, and its usefulness. The documentation of this library can be found in the following link. Pandas profiling. The installation will be done in the following way

Installing and importing Pandas Profiling

!pip3 install 'pandas-profiling[notebook,html]'
Image by Author

It is mandatory to pass between quotes the command and followed by notebook,html this is because we will need these two functions of the library.

If you are using Conda, these are the other ways to install it: Installing Pandas profiling

Creating relevant columns to the analysis

Now we are going to create a series of columns that will be relevant to the analysis that we will do with Pandas Profiling, the first one will be to create a constant variable for all the records, this time we will say that all the records belong to NASA, and we do the following

In [8]:

df['source'] = "NASA"
In [9]:

Image by Author

As we can see, this column was eventually created. We are now going to create a boolean variable at random, simulating some kind of boolean output for each record.

Remember that this is done so that our exploratory analysis can identify this type of data in the result.

In [11]:

# we imported numpy, it should have been installed with Pandas. If you don't have it, you can do it with 
# the `pip3 install numpy` commandimport numpy as np
In [12]:

# numpy is going to help us create those random booleans in the next line of codedf['boolean'] = np.random.choice([True, False], df.shape[0])
In [13]:

Image by Author

As we see, the column boolean was created with random values of True or False for each of the rows of our dataset, this is thanks to the df.shape[0] that refers to the rows or records of the dataset, that is to say that it made this operation 45,716 times, which is the total number of records.

Let’s do now something similar, but mixing numerical data types and categorical data types (strings)

In [14]:

df['mixed'] = np.random.choice([1, 'A'], df.shape[0])
In [15]:

Image by Author

As we can see, here we are simulating that a column has two types of data mixed together, both numerical and categorical. This is something that we can find in real datasets, and describe() of Pandas will simply ignore them, and will not give us any analysis about that column (remember that describe() only gives results about numerical columns, it even ignores the boolean columns too)

Now let’s do something even more interesting. We are going to create a new column by simulating a high correlation with an existing column. In particular, we will do it on the column reclat that talks about the latitude where the meteorite has fallen, and we will add a normal distribution with a standard deviation of 5 and a sample size equal to the dataset longitude.

If you want to see how to create a simulation of a normal distribution with random numbers with Numpy, check this link. Random normal numpy

In [16]:

df['reclat_city'] = df['reclat'] + np.random.normal(scale=5, size=(len(df)))
In [17]:

Image By Author

Let’s check the result of the last command, we can see that this column reclat_city now has a high correlation with reclat, because when one observation or row is positive the other one too, and when one is negative, the other one too.

To analyze correlations with Pandas we use a different method than describe(), in this case we use the corr() command. However, with Pandas profiling both analyses (descriptive statistics and correlations) we will obtain them with only one command. We will see this in a few moments when we run our exploratory analysis.

Remember that for now what we are doing is adding columns to the dataframe in order to see all the possibilities offered by the Pandas profiling tool.

We are now going to simulate another common scenario in the datasets, and that is to have duplicate observations or rows. This we will do it like this:

In [18]:

duplicates_to_add = pd.DataFrame(df.iloc[0:10])
In [19]:

Image by Author

What we just did was to create a new dataframe from the first 10 rows of our original dataframe. To do this we use an iloc that serves to select rows and a slice selector to select from row 0 to row 10-1.

Now let’s change the name to identify them later, but the other values remain the same

In [20]:

duplicates_to_add['name'] = duplicates_to_add['name'] + " copy"
In [21]:

Image by Author

If we look, now all the names have the word ‘copy’ at the end. We already have this new dataset ready to concatenate it to the original dataset, so we can have duplicated data. Let’s do now the append

In [22]:

df = df.append(duplicates_to_add, ignore_index=True)
In [23]:

Image by Author

(45726, 14)
The original dataset contains 45716 rows, now we have 10 more rows, which are the duplicate rows. In fact we can see some of them in the above display!

Using Pandas profiling

Now we have arrived at the expected moment, we have added some columns to the dataset that will allow us to see interesting analyses on it. But before that, we must be fair to the pandas describe() and see what analysis it gives us on the resulting dataset

In [25]:

Image by Author

As we see, very little difference, it does not give us additional information about:

  • Boolean columns
  • Mixed columns
  • Correlations
This is where Pandas profiling shines by its simplicity to perform an exploratory analysis on our datasets. Without further ado, let’s run the following command

In [26]:

## we already have the library installed, now we need to import itimport pandas_profiling
from pandas_profiling.utils.cache import cache_file
In [27]:

## now we run the reportreport = df.profile_report(sort='None', html={'style':{'full_width':True}})
In [28]:

Image by Author

Understanding the results

The output speaks for itself. In comparison with Pandas describe() or even Pandas corr() it is quite significant, and from the beginning we can observe a lot of additional data and analysis that will help us to better interpret the dataset we are working with. Let's analyze for example the columns we recently added

  • In the Overview we can see the duplicate rows report: Duplicate rows 10
  • In the Type of variable we can see the Boolean column: BOOL 1
  • In the Overview, but in the Warnings we can see the high correlation between the columns we created: reclat_city is highly correlated with reclat High correlation
  • We can see after the Overview an analysis of each column/variable
  • In the variable mixed we can see the analysis of the randomly generated values
  • Further down in the section Interactions we can see the different types of graphs and their correlations between variables.
  • Then we can see an analysis of correlations, which is always important to understand the interdependence of the data and the possible predictive power that these variables have
  • We can also see an analysis of the “missing values”, which is always interesting to make some kind of cleaning or normalization of the data.
Finally we might want to have this report in a different format than a Jupyter Notebook, the library offers us the possibility to export the report to html, which is useful to show it in a more friendly environment for the end user. In which you can even interact by means of navigation bars.

In [29]:

Image by Author

If we click on it, it will open in the browser. This format, personally I like quite a lot, since it does not influence the code, but you can navigate through the analysis and show it to the interested stackeholders in the analysis and make decisions based on them.

Image by Author

Final Notes

As you can see, it is very easy to use the tool, and it is a first step before starting to perform feature engineering and/or predictions. However there are some disadvantages about the tool that are important to take into account:

  • The main disadvantage of pandas profiling is its use with large data sets. With increasing data size, the time to generate the report also increases a lot.
  • One way to solve this problem is to generate the profile report for a portion of the data set. But while doing this, it is very important to make sure that the data are sampled randomly so that they are representative of all the data we have. We can do this for example:
In [30]:

data = df.sample(n=1000)
In [31]:

Image by Author


As we can see, 1000 samples have been selected at random, so the analysis will not be done on more than 40,000 samples. If we have, say 1,000,000 samples, the difference in performance will be significant, so this would be a good practice

In [33]:

profile_in_sample = data.profile_report(sort='None', html={'style':{'full_width':True}})
In [34]:

Image by Author

As we see it takes less time to run with a sample of 1,000 examples.

  • Alternatively, if you insist on getting the report of the whole data set, you can do it using the minimum mode.
  • In the minimum mode a simplified report will be generated with less information than the full one, but it can be generated relatively quickly for a large data set.
  • The code for it is given below:
In [35]:

profile_min = data.profile_report(minimal=True)
In [36]:

Image by Author

As we can see, it is a faster report but with less information about the exploratory analysis of the data. We leave it up to you to decide what type of report you want to generate. If you want to see more advanced features of the library please go to the following link: Advanced Pandas profiling

I have this other posts in Towards Data Science

Creating the Whole Machine Learning Pipeline with PyCaret
This tutorial covers the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter…towardsdatascience.com
I hope you enjoyed this reading! you can follow me on twitter or linkedin