Daniel Morales_

Maker - Data Scientist - Ruby on Rails Fullstack Developer

Twitter:
@daniel_moralesp

2020-12-18

How to Use Python Datetimes Correctly

Datetime is basically a python object that represents a point in time, like years, days, seconds, milliseconds. This is very useful to create our programs.

The datetime module provides classes to manipulate dates and times in a simple and complex way. While date and time arithmetic is supported, the application focuses on the efficient extraction of attributes for formatting and manipulating output

You can download a jupyter notebook with all these steps here.

Let’s import the Python module

In [1]:

from datetime import datetime
Creating a date using year, month, and day as arguments.

datetime(year, month, day hour, minute, seconds)

In [2]:

birthday = datetime(1994, 2, 15, 4, 25, 12)
Now once we’ve created the object and assigned to the variable called birthday, we can access to each date format just like this

In [3]:

birthday
Out[3]:

datetime.datetime(1994, 2, 15, 4, 25, 12)
In [4]:

birthday.year
Out[4]:

1994
In [5]:

birthday.month
Out[5]:

2
In [6]:

birthday.day
Out[6]:

15
As you can see, it’s very easy to create a date using this module. Now we can do other interesting things, like:

In [7]:

birthday.weekday()
Out[7]:

1
This means that the birthday was a Monday, because days are in this format (0-6), or what is the same indexed as a list (beginning with zero).

But what if I want to know what is the current datetime? In that case, we can use datetime.now(), Go ahead and write down this into next cell

In [8]:

datetime.now()
Out[8]:

datetime.datetime(2020, 11, 17, 11, 32, 11, 992169)
Ok, that’s interesting. What if you run that command again? Go ahead and see the difference

In [9]:

datetime.now()
Out[9]:

datetime.datetime(2020, 11, 17, 11, 33, 36, 433919)
As you can see the output is now different, because time changed. Great! Now you can ask, how do I calculate the time from one date to another? That’s called time tracking, let’s see how it works

In [10]:

# time tracking operation
datetime(2018, 1, 1) - datetime(2017, 1, 1)
Out[10]:

datetime.timedelta(365)
In [11]:

datetime(2018, 1, 1) - datetime(2017, 1, 12)
Out[11]:

datetime.timedelta(354)
You can see, how easy it is, we can run arithmetic operations between dates, which is great! But what if now you want to know how much time has passed from a given date to today, at this very moment? How do you think that can be done? Think about it for a moment!

In [12]:

datetime.now() - datetime(2020, 1, 1)
Out[12]:

datetime.timedelta(321, 41994, 571469)
Excellent now we use the .now() method and subtract the date we want to calculate. Easy!

Using strptime

This method will help us to transform dates that are given in strings to a datetime format, which is quite useful!

Let’s see it in action:

In [13]:

parsed_date = datetime.strptime('Nov 15, 2020', '%b %d, %Y')
In [14]:

parsed_date
Out[14]:

datetime.datetime(2020, 11, 15, 0, 0)
In [15]:

type(parsed_date)
Out[15]:

datetime.datetime
As we see, we have passed two parameters to the strptime method, the first has been the string of the date, and the second the "directive" in which we want to make the conversion. To see all the available "directives", go to the following link:

Image by python

Image by python

We already have parsed our date in the parsed_date variable, now let's start making calls to the methods it contains.

In [16]:

parsed_date.month
Out[16]:

11
In [28]:

parsed_date.year
Out[28]:

2020
Using strftime

All right, now let’s do the opposite operation, passing a datetime type as a parameter to the strftime function and converting it to a string. We do it like this:

In [37]:

date_string = datetime.strftime(datetime.now(), '%b %d, %Y')
In [38]:

date_string
Out[38]:

'Nov 17, 2020'
As you can see, we pass datetime.now() as the first argument and then the directives of the formats in which we want the output. Really simple!

Time object

A time object represents a time of day (local), independent of any particular day, and subject to adjustment through a tzinfo object.

All arguments are optional. tzinfo can be None, or an instance of a tzinfo subclass. The rest of the arguments can be integers, in the following ranges:

Image by Author

If an argument is given outside these ranges, the Value-Error is raised.

All default values are 0 except tzinfo, which defaults to None. Time to play with this object!

In [40]:

from datetime import time
In [42]:

my_time = time(hour=12, minute=34, second=56, microsecond=123456)
In [43]:

my_time
Out[43]:

datetime.time(12, 34, 56, 123456)
As we can see it will give us a time object as a result. However, it has a not very “friendly” format. With the time object we can use the isoformat

In [44]:

my_time.isoformat(timespec='minutes')
Out[44]:

'12:34'
In [45]:

my_time.isoformat(timespec='microseconds')
Out[45]:

'12:34:56.123456'
In [46]:

my_time.isoformat(timespec='auto')
Out[46]:

'12:34:56.123456'
In [47]:

my_time.isoformat()
Out[47]:

'12:34:56.123456'
We can see that there are several iso formats to display the time. We use different formats, and the default one is auto, which we can use without passing a parameter explicitly. These are the possible fromatos to use

Image by Author

timedelta object

A timedelta object represents a duration, the difference between two dates or times, which is quite useful! Let's look how it works. First we need to importa timedelta and then we need to call the different built-in functions

In [48]:

from datetime import timedelta
In [49]:

year = timedelta(days=365)
In [50]:

year
Out[50]:

datetime.timedelta(365)
In [51]:

year.total_seconds()
Out[51]:

31536000.0
In [56]:

ten_years = 10 * year
In [58]:

ten_years.total_seconds()
Out[58]:

315360000.0
We’ve passed the parameter days = 365 to timedelta and then called two functions. One of them returns the total seconds that 365 days have.An the other one creates 10 years.

Let’s make another calculations

In [59]:

another_year = timedelta(weeks=40, days=84, hours=23,
                         minutes=50, seconds=600)  # adds up to 365 days
In [60]:

another_year
Out[60]:

datetime.timedelta(365)
In [61]:

year == another_year
Out[61]:

True
We have now done a boolean operation, where we ask if one timedelta is the same as another. For which we get a True.

Naive & Aware methods

There are two types of date and time objects: “naive” & “aware”.

An “aware” object has sufficient knowledge of the applicable algorithmic and political time settings, such as time zone and daylight savings information, to be able to position itself in relation to other “aware” objects.

An “aware” object is used to represent a specific moment in time that is not open to interpretation. Ignore Relativity

A “naïve” object does not contain enough information to place itself unambiguously in relation to other date/time objects

Whether a “ship” object represents Coordinated Universal Time (UTC), local time or the time of some other time zone depends purely on the program, just as it depends on the program whether a given number represents meters, miles or mass

The “naive” objects are easy to understand and work with, at the cost of ignoring some aspects of reality.

This finally serves us to work time zones, and time changes (depending on summer-winter) and American zones as for example EST or EDT

Supporting time zones at deeper levels of detail depends on the application.

The rules for time adjustment worldwide are more political than rational, change frequently, and there is no standard suitable for every application other than UTC

Objects of this type are immutable.

Objects of the date type are always naive.

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Read these other posts I have written for Towards Data Science

Creating the Whole Machine Learning Pipeline with PyCaret
This tutorial covers the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter…towardsdatascience.com
Using Pandas Profiling to Accelerate Our Exploratory Analysis
Pandas Profiling is a library that generates reports from a pandas DataFrametowardsdatascience.com


Read Here






2020-12-18

Using Pandas Profiling to Accelerate Our Exploratory Analysis

Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we normally use in Pandas is great but it is a bit basic for a more serious and detailed exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for a quick data analysis.

The following statistics for each column are presented in an interactive report.

  • Type information: detect the types of columns in a dataframe.
  • Essentials: type, single values, missing values
  • Quantile statistics as minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics such as mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
  • The most frequent values
  • Histograms
  • Outstanding correlations of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Lost Value Matrix, Count, Heat Map and Lost Value Dendrogram
  • Text Analysis learns about categories (Shift, Space), hyphens (Latin, Cyrillic) and blocks (ASCII) of text data.
  • File and image analysis extracts file sizes, creation dates, and dimensions and scans images that are truncated or contain EXIF information.
For this Notebook we will be working on the dataset found in the following link Meteorite landings

If you want to follow the Notebook of this exploratory data analysis, you can download it here.

Using Pandas Profiling to Accelerate Our Exploratory Analysis
Pandas Profiling is a library that generates reports from a pandas DataFrame. The pandas df.describe() function that we…www.narrativetext.co
This comprehensive data set from the Meteorological Society contains information on all known meteorite landings. It is interesting to observe the places on earth where these objects have fallen, following the coordinates of the dataset.

Meteorite fall map

Image by Author

Let’s start now by importing the dataset, in order to understand a little bit the data we will work with

In [1]:

import pandas as pd
We have saved the Meteorite Falling dataset (Meteorite landings) in the ‘datasets’ folder of the present working environment, so we selected the right path for the import

In [2]:

df = pd.read_csv("datasets/Meteorite_Landings.csv")
And now we check the data

In [4]:

df.head()
Out[4]:

Image by Author

In [5]:

df.shape
Out[5]:

(45716, 10)
It is a very interesting dataset, where we can observe the name that the scientists gave to the meteorite, the type of class recclass, the weight in grams mass (g) the date in which it fell and the coordinates in which it fell.

It is also important to note that this is a very complete dataset, with 45,716 records and 10 columns. This information is given by the .shape

For more information about the dataset you can enter here: Meteorite — Nasa

Now, as we mentioned at the beginning, with Pandas we have gotten used to running the .describe() command to generate a descriptive analysis on the dataset in question. Descriptive statistics include those that summarize the central tendency, dispersion, and form of distribution of a data set, excluding the NaN values.

It analyzes both numerical and object series, as well as DataFrame column sets of mixed data types. The result will vary depending on what is provided. For more information about .describe() and the parameters we can pass, you can find the info here: Pandas describe

Now let’s run this command over our dataset and see the result

In [6]:

df.describe()
Out[6]:

Image by Author

The describe() method skips over the categorical columns (string type) and makes a descriptive statistical analysis on the numerical columns. Here we could see that the id column might not be useful for this analysis, since it is only a single indicator for each row (primary key of the table), while the mass is useful and interesting to understand, for example, the minimum and maximum value, the mean and the percentiles (25, 50, 75).

As we can see, this is a very basic analysis, without further relevant information. If we want more relevant information we must start writing code.

This is where Pandas profiling comes in, and its usefulness. The documentation of this library can be found in the following link. Pandas profiling. The installation will be done in the following way

Installing and importing Pandas Profiling

!pip3 install 'pandas-profiling[notebook,html]'
Image by Author

It is mandatory to pass between quotes the command and followed by notebook,html this is because we will need these two functions of the library.

If you are using Conda, these are the other ways to install it: Installing Pandas profiling

Creating relevant columns to the analysis

Now we are going to create a series of columns that will be relevant to the analysis that we will do with Pandas Profiling, the first one will be to create a constant variable for all the records, this time we will say that all the records belong to NASA, and we do the following

In [8]:

df['source'] = "NASA"
In [9]:

df.head()
Image by Author

As we can see, this column was eventually created. We are now going to create a boolean variable at random, simulating some kind of boolean output for each record.

Remember that this is done so that our exploratory analysis can identify this type of data in the result.

In [11]:

# we imported numpy, it should have been installed with Pandas. If you don't have it, you can do it with 
# the `pip3 install numpy` commandimport numpy as np
In [12]:

# numpy is going to help us create those random booleans in the next line of codedf['boolean'] = np.random.choice([True, False], df.shape[0])
In [13]:

df.head()
Image by Author

As we see, the column boolean was created with random values of True or False for each of the rows of our dataset, this is thanks to the df.shape[0] that refers to the rows or records of the dataset, that is to say that it made this operation 45,716 times, which is the total number of records.

Let’s do now something similar, but mixing numerical data types and categorical data types (strings)

In [14]:

df['mixed'] = np.random.choice([1, 'A'], df.shape[0])
In [15]:

df.head()
Image by Author

As we can see, here we are simulating that a column has two types of data mixed together, both numerical and categorical. This is something that we can find in real datasets, and describe() of Pandas will simply ignore them, and will not give us any analysis about that column (remember that describe() only gives results about numerical columns, it even ignores the boolean columns too)

Now let’s do something even more interesting. We are going to create a new column by simulating a high correlation with an existing column. In particular, we will do it on the column reclat that talks about the latitude where the meteorite has fallen, and we will add a normal distribution with a standard deviation of 5 and a sample size equal to the dataset longitude.

If you want to see how to create a simulation of a normal distribution with random numbers with Numpy, check this link. Random normal numpy

In [16]:

df['reclat_city'] = df['reclat'] + np.random.normal(scale=5, size=(len(df)))
In [17]:

df.head()
Image By Author

Let’s check the result of the last command, we can see that this column reclat_city now has a high correlation with reclat, because when one observation or row is positive the other one too, and when one is negative, the other one too.

To analyze correlations with Pandas we use a different method than describe(), in this case we use the corr() command. However, with Pandas profiling both analyses (descriptive statistics and correlations) we will obtain them with only one command. We will see this in a few moments when we run our exploratory analysis.

Remember that for now what we are doing is adding columns to the dataframe in order to see all the possibilities offered by the Pandas profiling tool.

We are now going to simulate another common scenario in the datasets, and that is to have duplicate observations or rows. This we will do it like this:

In [18]:

duplicates_to_add = pd.DataFrame(df.iloc[0:10])
In [19]:

duplicates_to_add
Image by Author

What we just did was to create a new dataframe from the first 10 rows of our original dataframe. To do this we use an iloc that serves to select rows and a slice selector to select from row 0 to row 10-1.

Now let’s change the name to identify them later, but the other values remain the same

In [20]:

duplicates_to_add['name'] = duplicates_to_add['name'] + " copy"
In [21]:

duplicates_to_add
Image by Author

If we look, now all the names have the word ‘copy’ at the end. We already have this new dataset ready to concatenate it to the original dataset, so we can have duplicated data. Let’s do now the append

In [22]:

df = df.append(duplicates_to_add, ignore_index=True)
In [23]:

df.head()
Image by Author

df.shape
(45726, 14)
The original dataset contains 45716 rows, now we have 10 more rows, which are the duplicate rows. In fact we can see some of them in the above display!

Using Pandas profiling

Now we have arrived at the expected moment, we have added some columns to the dataset that will allow us to see interesting analyses on it. But before that, we must be fair to the pandas describe() and see what analysis it gives us on the resulting dataset

In [25]:

df.describe()
Image by Author

As we see, very little difference, it does not give us additional information about:

  • Boolean columns
  • Mixed columns
  • Correlations
This is where Pandas profiling shines by its simplicity to perform an exploratory analysis on our datasets. Without further ado, let’s run the following command

In [26]:

## we already have the library installed, now we need to import itimport pandas_profiling
from pandas_profiling.utils.cache import cache_file
In [27]:

## now we run the reportreport = df.profile_report(sort='None', html={'style':{'full_width':True}})
In [28]:

report
Image by Author

Understanding the results

The output speaks for itself. In comparison with Pandas describe() or even Pandas corr() it is quite significant, and from the beginning we can observe a lot of additional data and analysis that will help us to better interpret the dataset we are working with. Let's analyze for example the columns we recently added

  • In the Overview we can see the duplicate rows report: Duplicate rows 10
  • In the Type of variable we can see the Boolean column: BOOL 1
  • In the Overview, but in the Warnings we can see the high correlation between the columns we created: reclat_city is highly correlated with reclat High correlation
  • We can see after the Overview an analysis of each column/variable
  • In the variable mixed we can see the analysis of the randomly generated values
  • Further down in the section Interactions we can see the different types of graphs and their correlations between variables.
  • Then we can see an analysis of correlations, which is always important to understand the interdependence of the data and the possible predictive power that these variables have
  • We can also see an analysis of the “missing values”, which is always interesting to make some kind of cleaning or normalization of the data.
Finally we might want to have this report in a different format than a Jupyter Notebook, the library offers us the possibility to export the report to html, which is useful to show it in a more friendly environment for the end user. In which you can even interact by means of navigation bars.

In [29]:

report.to_file(output_file="report_eda.html")
Image by Author

If we click on it, it will open in the browser. This format, personally I like quite a lot, since it does not influence the code, but you can navigate through the analysis and show it to the interested stackeholders in the analysis and make decisions based on them.

Image by Author

Final Notes

As you can see, it is very easy to use the tool, and it is a first step before starting to perform feature engineering and/or predictions. However there are some disadvantages about the tool that are important to take into account:

  • The main disadvantage of pandas profiling is its use with large data sets. With increasing data size, the time to generate the report also increases a lot.
  • One way to solve this problem is to generate the profile report for a portion of the data set. But while doing this, it is very important to make sure that the data are sampled randomly so that they are representative of all the data we have. We can do this for example:
In [30]:

data = df.sample(n=1000)
In [31]:

data.head()
Image by Author

len(data)
Out[32]:

1000
As we can see, 1000 samples have been selected at random, so the analysis will not be done on more than 40,000 samples. If we have, say 1,000,000 samples, the difference in performance will be significant, so this would be a good practice

In [33]:

profile_in_sample = data.profile_report(sort='None', html={'style':{'full_width':True}})
In [34]:

profile_in_sample
Image by Author

As we see it takes less time to run with a sample of 1,000 examples.

  • Alternatively, if you insist on getting the report of the whole data set, you can do it using the minimum mode.
  • In the minimum mode a simplified report will be generated with less information than the full one, but it can be generated relatively quickly for a large data set.
  • The code for it is given below:
In [35]:

profile_min = data.profile_report(minimal=True)
In [36]:

profile_min
Image by Author

As we can see, it is a faster report but with less information about the exploratory analysis of the data. We leave it up to you to decide what type of report you want to generate. If you want to see more advanced features of the library please go to the following link: Advanced Pandas profiling

I have this other posts in Towards Data Science

Creating the Whole Machine Learning Pipeline with PyCaret
This tutorial covers the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter…towardsdatascience.com
I hope you enjoyed this reading! you can follow me on twitter or linkedin

Read Here






2020-12-18

Creating the Whole Machine Learning Pipeline with PyCaret

How to Create a Machine Learning Pipeline with PyCaret

This tutorial covers the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter fitting, predicting and storing the model for later use.

We will complete all these steps in less than 10 commands that are naturally constructed and very intuitive to remember, such as

create_model(), 
tune_model(), 
compare_models()
plot_model()
evaluate_model()
predict_model()
Let’s see the whole picture


Image by Author

Recreating the entire experiment without PyCaret requires more than 100 lines of code in most libraries. The library also allows you to do more advanced things, such as advanced pre-processing, ensembling, generalized stacking, and other techniques that allow you to fully customize the ML pipeline and are a must for any data scientist.

PyCaret is an open source, low-level library for ML with Python that allows you to go from preparing your data to deploying your model in minutes. Allows scientists and data analysts to perform iterative data science experiments from start to finish efficiently and allows them to reach conclusions faster because much less time is spent on programming. This library is very similar to Caret de R, but implemented in python

When working on a data science project, it usually takes a long time to understand the data (EDA and feature engineering). So, what if we could cut the time we spend on the modeling part of the project in half?

Let’s see how

First we need this pre-requisites

  • Python 3.6 or later
  • PyCaret 2.0 or later
Here you can find the library docs and others.

Also, you can follow this notebook with the code.

First of all, please run this command: !pip3 install pycaret

For Google Colab users: If you are running this notebook in Google Colab, run the following code at the top of your notebook to display interactive images

from pycaret.utils import enable_colab
enable_colab()
Pycaret Modules

Pycaret is divided according to the task we want to perform, and has different modules, which represent each type of learning (supervised or unsupervised). For this tutorial, we will be working on the supervised learning module with a binary classification algorithm.

Classification Module

The PyCaret classification module (pycaret.classification) is a supervised machine learning module used to classify elements into a binary group based on various techniques and algorithms. Some common uses of classification problems include predicting client default (yes or no), client abandonment (client will leave or stay), disease encountered (positive or negative) and so on.

The PyCaret classification module can be used for binary or multi-class classification problems. It has more than 18 algorithms and 14 plots for analyzing model performance. Whether it’s hyper-parameter tuning, ensembling or advanced techniques such as stacking, PyCaret’s classification module has it all.


Image by Author

For this tutorial we will use an UCI data set called Default of Credit Card Clients Dataset. This data set contains information about default payments, demographics, credit data, payment history and billing statements of credit card customers in Taiwan from April 2005 to September 2005. There are 24,000 samples and 25 characteristics.

The dataset can be found here. Or here you’ll find a direct link to download.

So, download the dataset to your environment, and then we are going to load it like this

In [2]:

import pandas as pd
In [3]:

df = pd.read_csv('datasets/default of credit card clients.csv')
In [4]:

df.head()
Out[4]:


Image by Author

1- Get the data

We also have another way to load it. In fact this will be the default way we will be working with in this tutorial. It is directly from the PyCaret datasets, and it is the first method of our Pipeline


Image by Author

from pycaret.datasets import get_data
dataset = get_data('credit')#check the shape of data
dataset.shape
In order to demonstrate the predict_model() function on unseen data, a sample of 1200 records from the original dataset has been retained for use in the predictions. This should not be confused with a train/test split, since this particular split is made to simulate a real-life scenario. Another way of thinking about this is that these 1200 records are not available at the time the ML experiment was performed.

In [7]:

## sample returns a random sample from an axis of the object. That would be 22,800 samples, not 24,000
data = dataset.sample(frac=0.95, random_state=786)
In [8]:

data

Image by Author

# we remove from the original dataset this random data
data_unseen = dataset.drop(data.index)
In [10]:

data_unseen## we reset the index of both datasets
data.reset_index(inplace=True, drop=True)
data_unseen.reset_index(inplace=True, drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))Data for Modeling: (22800, 24)
Unseen Data For Predictions: (1200, 24)
Split data

The way we divide our data set is important because there is data that we'll not use during the modeling process, and we'll use at the end to validate our results by simulating real data. The data we use for modeling we sub-divide it in order to evaluate two scenarios, training and testing. Therefore, the following has been done


Image by Author

Unseen data set (also known as validation data set)

  • Is the data sample used to provide an unbiased assessment of a final model.
  • The validation data set provides the gold standard used to evaluate the model.
  • It is only used once the model is fully trained (using the training and test sets).
  • The validation set is generally what is used to evaluate the models of a competition (for example, in many Kaggle or DataSource.ai competitions, the test set is initially released along with the training and test set and the validation set is only released when the competition is about to close, and it is the result of the validation set model that decides the winner).
  • Many times the test set is used as the validation set, but it is not a good practice.
  • The validation set is generally well healed.
  • It contains carefully sampled data covering the various classes that the model would face, when used in the real world.
Training data set

  • Training data set: The data sample used to train the model.
  • The data set we use to train the model
  • The model sees and learns from this data.
Test data set

  • Test Data Set: The data sample used to provide an unbiased assessment of a model is matched to the training data set while adjusting the model’s hyperparameters.
  • The assessment becomes more biased as the skill in the test data set is incorporated into the model configuration.
  • The test set is used to evaluate a given model, but this is for frequent evaluation.
  • We, as ML engineers, use this data to fine-tune the hyperparameters of the model.
  • Therefore, the model occasionally sees this data, but never “learns” from it.
  • We use the results of the test set, and update the higher level hyperparameters
  • So the test set impacts a model, but only indirectly.
  • The test set is also known as the Development set. This makes sense, since this dataset helps during the “development” stage of the model.
Confusion of terms

  • There is a tendency to mix up the name of test and validation.
  • Depending on the tutorial, the source, the book, the video or the teacher/mentor the terms are changed, the important thing is to keep the concept.
  • In our case we already separated the validation set at the beginning (1,200 samples of data_unseen)
2- Setting up the PyCaret environment


Image by Author

Now let’s set up the Pycaret environment. The setup() function initializes the environment in pycaret and creates the transformation pipeline to prepare the data for modeling and deployment. setup() must be called before executing any other function in pycaret. It takes two mandatory parameters: a pandas dataframe and the name of the target column. Most of this part of the configuration is done automatically, but some parameters can be set manually. For example:

  • The default division ratio is 70:30 (as we see in above paragraph), but can be changed with "train_size".
  • K-fold cross-validation is set to 10 by default
  • “session_id" is our classic "random_state"
In [12]:

## setting up the environment
from pycaret.classification import *
Note: After you run the following command you must press enter to finish the process. We will explain how they do it. The setup process may take some time to complete.

In [13]:

model_setup = setup(data=data, target='default', session_id=123)

Image by Author

When you run setup(), PyCaret's inference algorithm will automatically deduce the data types of all features based on certain properties. The data type must be inferred correctly but this is not always the case. To take this into account, PyCaret displays a table containing the features and their inferred data types after setup() is executed. If all data types are correctly identified, you can press enter to continue or exit to end the experiment. We press enter, and should come out the same output as we got above.

Ensuring that the data types are correct is critical in PyCaret, as it automatically performs some pre-processing tasks that are essential to any ML experiment. These tasks are performed differently for each type of data, which means that it is very important that they are correctly configured.

We could overwrite the type of data inferred from PyCaret using the numeric_features and categorical_features parameters in setup(). Once the setup has been successfully executed, the information grid containing several important pieces of information is printed. Most of the information is related to the pre-processing pipeline that is built when you run setup()

Most of these features are out of scope for the purposes of this tutorial, however, some important things to keep in mind at this stage include

  • session_id : A pseduo-random number distributed as a seed in all functions for later reproducibility.
  • Target type : Binary or Multiclass. The target type is automatically detected and displayed.
  • Label encoded: When the Target variable is of type string (i.e. ‘Yes’ or ‘No’) instead of 1 or 0, it automatically codes the label at 1 and 0 and shows the mapping (0 : No, 1 : Yes) as reference
  • Original data : Displays the original form of the data set. In this experiment (22800, 24) ==> Remember: "Seeing data"
  • Missing values : When there are missing values in the original data this will be shown as True
  • Numerical features : The number of features inferred as numerical.
  • Categorical features : The number of features inferred as categorical
  • Transformed train sets: Note that the original form of (22800, 24) is transformed into (15959, 91) for the transformed train set and the number of features has increased from 24 to 91 due to the categorical coding
  • Transformed test set: There are 6,841 samples in the test set. This split is based on the default value of 70/30 which can be changed using the train_size parameter in the configuration.
Note how some tasks that are imperative to perform the modeling are handled automatically, such as imputation of missing values (in this case there are no missing values in the training data, but we still need imputers for the unseen data), categorical encoding, etc.

Most of the setup() parameters are optional and are used to customize the preprocessing pipeline.

3- Compare Models


Image by Author

In order to understand how PyCaret compares the models and the next steps in the pipeline, it is necessary to understand the concept of N-Fold Coss-Validation.

N-Fold Coss-Validation

Calculating how much of your data should be divided into your test set is a delicate question. If your training set is too small, your algorithm may not have enough data to learn effectively. On the other hand, if your test set is too small, then your accuracy, precision, recall and F1 score could have a large variation.

You may be very lucky or very unlucky! In general, putting 70% of your data in the training set and 30% of your data in the test set is a good starting point. Sometimes your data set is so small that dividing it 70/30 will result in a large amount of variance.

One solution to this is to perform N-Fold cross-validation. The central idea here is that we are going to do this whole process N times and then average the accuracy. For example, in a 10 times cross validation, we will make the test set the first 10% of the data and calculate the accuracy, precision, recall and F1 score.

Then, we will make the cross-validation establish the second 10% of the data and we will calculate these statistics again. We can do this process 10 times, and each time the test set will be a different piece of data. Then we average all the accuracies, and we will have a better idea of how our model works on average.

Note: Validation Set (yellow here) is the Test Set in our case


Image by Author

Understanding the accuracy of your model is invaluable because you can start adjusting the parameters of your model to increase its performance. For example, in the K-Nearest Neighbors algorithm, you can see what happens to the accuracy as you increase or decrease K. Once you are satisfied with the performance of your model, it's time to enter the validation set. This is the part of your data that you split at the beginning of his experiment (unseen_data in our case).

It is supposed to be a substitute for the real-world data that you are really interested in sorting out. It works very similar to the test set, except that you never touched this data while building or refining your model. By finding the precision metrics, you get a good understanding of how well your algorithm will perform in the real world.

Comparing all models

Comparing all models to evaluate performance is the recommended starting point for modeling once the PyCaret setup() is completed (unless you know exactly what type of model is needed, which is often not the case), this function trains all models in the model library and scores them using a stratified cross-validation for the evaluation of the metrics.

The output prints a score grid that shows the average of the Accuracy, AUC, Recall, Precision, F1, Kappa, and MCC across the folds (10 by default) along with the training times. Let's do it!

In [14]:

best_model = compare_models()

Image by Author

The compare_models() function allows you to compare many models at once. This is one of the great advantages of using PyCaret. In one line, you have a comparison table between many models. Two simple words of code (not even one line) have trained and evaluated more than 15 models using the N-Fold cross-validation.

The above printed table highlights the highest performance metrics for comparison purposes only. The default table is sorted using “Accuracy” (highest to lowest) which can be changed by passing a parameter. For example, compare_models(sort = 'Recall') will sort the grid by Recall instead of Accuracy.

If you want to change the Fold parameter from the default value of 10 to a different value, you can use the fold parameter. For example compare_models(fold = 5) will compare all models in a 5-fold cross-validation. Reducing the number of folds will improve the training time.

By default, compare_models returns the best performing model based on the default sort order, but it can be used to return a list of the top N models using the n_select parameter. In addition, it returns some metrics such as accuracy, AUC and F1. Another cool thing is how the library automatically highlights the best results. Once you choose your model, you can create it and then refine it. Let's go with other methods.

In [15]:

print(best_model)RidgeClassifier(alpha=1.0, class_weight=None, copy_X=True, fit_intercept=True,
                max_iter=None, normalize=False, random_state=123, solver='auto',
                tol=0.001)
4- Create the Model


Image by Author

create_model is the most granular function in PyCaret and is often the basis for most of PyCaret's functionality. As its name indicates, this function trains and evaluates a model using a cross-validation that can be set with the parameter fold. The output prints a scoring table showing by Fold the Precision, AUC, Recall, F1, Kappa and MCC.

For the rest of this tutorial, we will work with the following models as our candidate models. The selections are for illustrative purposes only and do not necessarily mean that they are the best performers or ideal for this type of data

  • Decision Tree Classifier (‘dt’)
  • K Neighbors Classifier (‘knn’)
  • Random Forest Classifier (‘rf’)
There are 18 classifiers available in the PyCaret model library. To see a list of all classifiers, check the documentation or use the models() function to view the library.

In [16]:

models()

Image by Author

dt = create_model('dt')

Image by Author

#trained model object is stored in the variable 'dt'. 
print(dt)DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=123, splitter='best')
In [19]:

knn = create_model('knn')

Image by Author

print(knn)KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=5, p=2,
                     weights='uniform')
In [21]:

rf = create_model('rf')

Image by Author

print(rf)RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False)
Note that the average score of all models matches the score printed on compare_models(). This is because the metrics printed in the compare_models() score grid are the average scores of all the folds.

You can also see in each print() of each model the hyperparameters with which they were built. This is very important because it is the basis for improving them. You can see the parameters for RandomForestClassifier

max_depth=None
max_features='auto'
min_samples_leaf=1
min_samples_split=2
min_weight_fraction_leaf=0.0
n_estimators=100
n_jobs=-1
5- Tunning the Model


Image by Author

When creating a model using the create_model() function the default hyperparameters are used to train the model. To tune the hyperparameters the tune_model() function is used. This function automatically tunes the hyperparameters of a model using the Random Grid Search in a predefined search space.

The output prints a score grid showing the accuracy, AUC, Recall, Precision, F1, Kappa and MCC by Fold for the best model. To use a custom search grid, you can pass the custom_grid parameter in the tune_model function

In [23]:

tuned_rf = tune_model(rf)

Image by Author

If we compare the Accuracy metrics of this refined RandomForestClassifier model with the previous RandomForestClassifier, we see a difference, because it went from an Accuracy of 0.8199 to an Accuracy of 0.8203.

In [24]:

#tuned model object is stored in the variable 'tuned_dt'. 
print(tuned_rf)RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={},
                       criterion='entropy', max_depth=5, max_features=1.0,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0002, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False)
Let’s compare now the hyperparameters. We had these before.

max_depth=None
max_features='auto'
min_samples_leaf=1
min_samples_split=2
min_weight_fraction_leaf=0.0
n_estimators=100
n_jobs=-1
Now these:

max_depth=5
max_features=1.0
min_samples_leaf=5
min_samples_split=10
min_weight_fraction_leaf=0.0
n_estimators=150
n_jobs=-1
You can make this same comparisson with knn and dt by yourself and explore the differences in the hyperparameters.

By default, tune_model optimizes Accuracy but this can be changed using the optimize parameter. For example: tune_model(dt, optimize = 'AUC') will look for the hyperparameters of a Decision Tree Classifier that results in the highest AUC instead of Accuracy. For the purposes of this example, we have used Accuracy's default metric only for simplicity.

Generally, when the data set is unbalanced (like the credit data set we are working with) Accuracy is not a good metric to consider. The methodology underlying the selection of the correct metric to evaluate a rating is beyond the scope of this tutorial.

Metrics alone are not the only criteria you should consider when selecting the best model for production. Other factors to consider include training time, standard deviation of k-folds, etc. For now, let’s go ahead and consider the Random Forest Classifier tuned_rf, as our best model for the rest of this tutorial

6- Plotting the Model


Image by Author

Before finalizing the model (Step # 8), the plot_model() function can be used to analyze the performance through different aspects such as AUC, confusion_matrix, decision boundary etc. This function takes a trained model object and returns a graph based on the training/test set.

There are 15 different plots available, please refer to plot_model() documentation for a list of available plots.

In [25]:

## AUC Plotplot_model(tuned_rf, plot = 'auc')

Image by Author

## Precision-recall curve

plot_model(tuned_rf, plot = 'pr')

Image by Author

## feature importance

plot_model(tuned_rf, plot='feature')

Image by Author

## Consufion matrix

plot_model(tuned_rf, plot = 'confusion_matrix')

Image by Author

7- Evaluating the model


Image by Author

Another way to analyze model performance is to use the evaluate_model() function which displays a user interface for all available graphics for a given model. Internally it uses the plot_model() function.

In [29]:

evaluate_model(tuned_rf)
8- Finalizing the Model


Image by Author

The completion of the model is the last step of the experiment. A normal machine learning workflow in PyCaret starts with setup(), followed by comparison of all models using compare_models() and pre-selection of some candidate models (based on the metric of interest) to perform various modeling techniques, such as hyperparameter fitting, assembly, stacking, etc.

This workflow will eventually lead you to the best model to use for making predictions on new and unseen data. The finalize_model() function fits the model to the complete data set, including the test sample (30% in this case). The purpose of this function is to train the model on the complete data set before it is deployed into production. We can execute this method after or before the predict_model(). We're going to execute it after of it.

One last word of caution. Once the model is finalized using finalize_model(), the entire data set, including the test set, is used for training. Therefore, if the model is used to make predictions about the test set after finalize_model() is used, the printed information grid will be misleading since it is trying to make predictions about the same data that was used for the modeling.

To demonstrate this point, we will use final_rf in predict_model() to compare the information grid with the previous.

In [30]:

final_rf = finalize_model(tuned_rf)
In [31]:

#Final Random Forest model parameters for deployment
print(final_rf)RandomForestClassifier(bootstrap=False, ccp_alpha=0.0, class_weight={},
                       criterion='entropy', max_depth=5, max_features=1.0,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0002, min_impurity_split=None,
                       min_samples_leaf=5, min_samples_split=10,
                       min_weight_fraction_leaf=0.0, n_estimators=150,
                       n_jobs=-1, oob_score=False, random_state=123, verbose=0,
                       warm_start=False)
9- Predicting with the model


Image by Author

Before finalizing the model, it is advisable to perform a final check by predicting the test/hold-out set (data_unseen in our case) and reviewing the evaluation metrics. If you look at the information table, you will see that 30% (6,841 samples) of the data have been separated as training/set samples.

All of the evaluation metrics we have seen above are cross-validated results based on the training set (70%) only. Now, using our final training model stored in the tuned_rf variable we predict against the test sample and evaluate the metrics to see if they are materially different from the CV results

In [32]:

predict_model(final_rf)

Image by Author

The accuracy of the test set is 0.8199 compared to the 0.8203 achieved in the results of the tuned_rf. This is not a significant difference. If there is a large variation between the results of the test set and the training set, this would normally indicate an over-fitting, but it could also be due to several other factors and would require further investigation.

In this case, we will proceed with the completion of the model and the prediction on unseen data (the 5% that we had separated at the beginning and that was never exposed to PyCaret).

(TIP: It is always good to look at the standard deviation of the results of the training set when using create_model().

The predict_model() function is also used to predict about the unseen data set. The only difference is that this time we will pass the parameter data_unseen. data_unseen is the variable created at the beginning of the tutorial and contains 5% (1200 samples) of the original data set that was never exposed to PyCaret.

In [33]:

unseen_predictions = predict_model(final_rf, data=data_unseen)
unseen_predictions.head()

Image by Author

Please go to the last column of this previous result, and you will see a new feature called Score


Image by Author

Label is the prediction and score is the probability of the prediction. Note that the predicted results are concatenated with the original data set, while all transformations are automatically performed in the background.

We have finished the experiment finalizing the tuned_rf model that now is stored in the final_rf variable. We have also used the model stored in final_rf to predict data_unseen. This brings us to the end of our experiment, but one question remains: What happens when you have more new data to predict? Do you have to go through the whole experiment again? The answer is no, PyCaret's built-in save_model() function allows you to save the model along with all the transformation pipe for later use and is stored in a Pickle in the local environment

(TIP: It’s always good to use the date in the file name when saving models, it’s good for version control)

Let’s see it in the next step

10- Save/Load Model for Production


Image by Author

Save Model

In [35]:

save_model(final_rf, 'datasets/Final RF Model 19Nov2020')Transformation Pipeline and Model Succesfully Saved
Out[35]:

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='default',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_stra...
                  RandomForestClassifier(bootstrap=False, ccp_alpha=0.0,
                                         class_weight={}, criterion='entropy',
                                         max_depth=5, max_features=1.0,
                                         max_leaf_nodes=None, max_samples=None,
                                         min_impurity_decrease=0.0002,
                                         min_impurity_split=None,
                                         min_samples_leaf=5,
                                         min_samples_split=10,
                                         min_weight_fraction_leaf=0.0,
                                         n_estimators=150, n_jobs=-1,
                                         oob_score=False, random_state=123,
                                         verbose=0, warm_start=False)]],
          verbose=False),
 'datasets/Final RF Model 19Nov2020.pkl')
Load Model

To load a model saved at a future date in the same or an alternative environment, we would use PyCaret’s load_model() function and then easily apply the saved model to new unseen data for the prediction

In [37]:

saved_final_rf = load_model('datasets/Final RF Model 19Nov2020')Transformation Pipeline and Model Successfully Loaded
Once the model is loaded into the environment, it can simply be used to predict any new data using the same predict_model() function. Next we have applied the loaded model to predict the same data_unseen we used before.

In [38]:

new_prediction = predict_model(saved_final_rf, data=data_unseen)
In [39]:

new_prediction.head()
Out[39]:


Image by Author

from pycaret.utils import check_metric
check_metric(new_prediction.default, new_prediction.Label, 'Accuracy')
Out[41]:

0.8167
Pros & Cons

As with any new library, there is still room for improvement. We'll list some of the pros and cons we found while using the library.

Pros:

  • It makes the modeling part of your project much easier.
  • You can create many different analyses with just one line of code.
  • Forget about passing a list of parameters when fitting the model. PyCaret does it automatically for you.
  • You have many different options to evaluate the model, again, with just one line of code
  • Since it is built on top of famous ML libraries, you can easily compare it with your traditional method
Cons:

  • The library is in its early versions, so it is not mature enough and is susceptible to bugs. Not big deal to be honest
  • As all Auto ML libraries, it's a black box, so you can't really see what's going on inside it. Therefore, I would not recommend it for beginners.
  • It might make the learning process a bit superficial.
Conclusions

This tutorial has covered the entire ML process, from data ingestion, pre-processing, model training, hyper-parameter fitting, predicting and storing the model for later use. We have completed all these steps in less than 10 commands that are naturally constructed and very intuitive to remember, such as create_model(), tune_model(), compare_models(). Recreating the whole experiment without PyCaret would have required more than 100 lines of code in most of the libraries.

The library also allows you to do more advanced things, such as advanced pre-processing, assembly, generalized stacking, and other techniques that allow you to fully customize the ML pipeline and are a must for any data scientist

I hope you enjoyed this reading! you can follow me on twitter or linkedin

Read Here






2020-12-18

The 3 Basic Principles of a Data-Driven Company

The data is sometimes called the "new oil," a newly discovered source of wealth that is extracted from the depths of corporate and government archives. Some accountants are so excited about the potential value of the data that they count it in the same way as a physical asset.

While it is true that data can enhance an organization's value, this resource has no intrinsic value. Like oil, data needs to be extracted and refined with the right quality. Data needs to be transported across information networks before it can be used to create new value. The value of data is not in the information itself, but in the transformations it undergoes.

The analogy between data and oil is only partially correct in that data is an infinite resource. The same data can be used many times for a sometimes originally undesired purpose.

The ability to use data for more than one purpose is one of the reasons data science has gained popularity around the table. Senior managers are looking for ways to extract value from so-called "dark data". Data scientists use these forgotten data sources to create new knowledge, make better decisions, and generate innovation.


The question that arises from this introduction is how to manage and analyze the data so that it becomes a valuable resource. We will present a normative model for creating value from data using three basic principles derived from the architecture. 

This model is useful for data scientists as an internal check to ensure that their activities maximize value. Managers can use this model to evaluate the results of a data science project without having to understand the mathematical complexities of data science.

The 3 Basic Principles of a Data-Driven Company

Although data science is a quintessential 21st century activity, to define good data science, we can draw inspiration from a Roman architect and engineer who lived two thousand years ago.

Vitruvius is immortalized through his book "About Architecture", which inspired Leonardo Da Vinci to draw his famous Vitruvian man. Vitruvius wrote that an ideal building must exhibit three qualities: utilitarian, firm and venus-like, or utility, solidity and aesthetics.

Buildings must be useful so that they can be used for their purpose. A house must be functional and comfortable; a theater must be designed so that everyone can see the stage. Each type of building has its own functional requirements.

Secondly, buildings must be solid in the sense that they are firm enough to withstand the forces acting on them. Finally, the buildings must be aesthetic. In Vitruvius' words, buildings must resemble Venus, the Roman goddess of beauty and seduction.

Vitruvius' rules for architecture can also be applied to data science products (Lankow, J., Ritchie, J., & Crooks, R. (2012). Computer Graphics: The Power of Visual Narration. Hoboken, N.J: John Wiley & Sons, Inc

Data science needs to have utility; it needs to be useful to create value. The analysis must be solid so that it can be trusted. Data science products must also be aesthetic, in order to maximize the value they provide to an organization, as shown

1- Utility in Data Science



How do we know something is useful? The simple, but not very enlightening answer is that when something is useful, it is useful. Some philosophers interpret usefulness as the ability to provide the greatest good for the greatest number of people. This definition is quite convincing, but it requires some contextualization. What is right in one situation may not be as beneficial in another.

For a data science strategy to be successful, it must facilitate organizational goals. Data scientists are opportunistic in the approach they use to solve problems. Insight implies that the same data can be used for different issues, depending on the perspective taken on the available information and the problem at hand.

After digesting a research report or viewing a visualization, managers should ask themselves, "What am I doing differently today as a result? The usefulness of data science depends on the ability of the results to positively influence reality for professionals. In other words, the outcome of data science should comfort management that objectives have been met or provide practical ideas for solving existing problems or preventing future ones.

2- Strength in Data Science


Just as a building must be solid and not collapse, a data product must be solid in order to create business value. Robustness is where science and data meet. 

The robustness of a data product is defined by the validity and reliability of the analysis, which are well-established scientific principles as shown in the figure below. The robustness of data science also requires that the results be reproducible. Finally, the data, and the process of creating data products, must be governed to ensure beneficial results.

The difference between traditional forms of business analysis and data science is the systematic approach to problem solving. The key word in the term data science is therefore not data, but science. Data science is only useful when the data answer a useful question, which is the scientific part of the process. 

This systematic approach ensures that the results of data science are reliable for deciding alternative courses of action. Systematic data science uses the principles of scientific research, but its approach is more pragmatic. 

While scientists seek general truths to explain the world, data scientists seek to pragmatically solve problems. The basic principles behind this methodical approach are the validity, reliability, and reproducibility of data, methods, and results.

3- Aesthetics in Data Science


Vitruvius insisted that the buildings, or any other structure, must be beautiful. The aesthetics of a building cause more than just a pleasant feeling. Architecturally designed places stimulate our thinking, increase our well-being, improve our productivity and stimulate creativity.

While it is clear that buildings should be pleasing to the eye, the aesthetics of data products may not be so obvious. The science requirement of aesthetic data is not a call for embellishment and obfuscation of ugly details of results.

The process of cleaning and analyzing the data is inherently complex. Presenting the results of this process is a form of storytelling that reduces this complexity to ensure that a data product is understandable.

The value chain of data science begins with reality, as described by the data. This data is converted into knowledge, which managers use to influence reality to achieve their goals. This chain that goes from reality to human knowledge contains four transformations, each with opportunities for loss of validity and reliability.

The last step in the value chain requires the user of data science results to interpret the information to draw the right conclusion about their future course of action. Reproducibility is one of the tools to minimize the possibility of misinterpretation of analyses. Another mechanism to ensure proper interpretation is to produce an aesthetic data science.

Aesthetics in data science is about creating a data product, which can be a visualization or a report, designed to enable the user to draw the right conclusions. A messy graph or an incomprehensible report limits the value that can be extracted from the information

We will be talking more in detail about each of these in the next post, we hope!

Read Here






2020-12-18

How to Make Your Company a Data-driven Organization?

Startup, SMB or company founders, managers or decision makers often claim that they are "data rich but information poor". This statement is in many cases only partially correct because it hides a misconception about the data life cycle. The fact that it is data-rich but information-poor suggests that previously untapped data sources are waiting to be exploited and used.

It is very unlikely that any organization will collect data without a particular purpose. In most cases, data is collected to manage operational processes. Collecting data without a particular purpose is a waste of resources. In many companies, once data is used, it is stored and becomes "dark data". 

Because almost all operational processes are recorded electronically, data is now everywhere. Managers rightly ask themselves what to do with this information after it has been archived. A strategic approach to data science would help an organization unravel the untapped value of these data stores to better understand their strategic and operational context.

The evolution to become a data-based organization begins with the collection of data generated during operational processes. The next step is to describe this data through exploratory techniques such as visualizations and statistics, which is the domain of traditional business reporting (or what is called today Business Intelligence) that provides insights for decision making.

Once the data has been explored and understood, organizations can diagnose business processes to understand the causal and logical relationships between variables. The penultimate phase consists of using knowledge of the past and its causal and logical connections to predict possible futures and build the desired future. The final stage of the data science journey is a situation where data is used to prescribe everyday operations

This process is not a left to right trip, eventually landing in a place where the algorithms control our destiny, and the rest becomes less critical. This process guides the data science strategy to this point, but forms a strict hierarchy.

Before algorithms can decide anything independently, you need to be able to predict the immediate future. To predict the future, you need to have a good understanding of descriptive statistics to diagnose a business process. Finally, the Garbage-In-Garbage-Out (GIGO) principle requires that analysis is only possible if we understand the data collected.


Towards a Data-driven Organization

The process we have just seen provides a strategic map for organizations trying to be more data-driven. Each step in the process is equally important for the next level because these higher levels of complexity cannot be achieved without embracing the lower levels. The most important aspect of the data science process is that it outlines an evolutionary approach to becoming a data-driven organization. 

As an organization evolves into more complex forms of data science, the early stages do not become vestigial appendages, but remain an integral part of the data science strategy. All parts of this model have the same relative value.

However, being driven by data is more than a process of increasing complexity. Evidence-based management requires people within the organization to be knowledgeable about data and to work together toward a common goal. 

The systematic aspect of data science requires a formalized process to ensure robust results. The increasing complexity of analytical methods also requires investment in better tools and data infrastructure. 

There are many technical aspects to consider when applying data science in an organization. However, simply focusing on the technicalities of data analysis is not enough to create value for an organization. A data science manager needs to manage people, systems, and processes to develop a data-based organization.

Decision makers sometimes ignore even the most useful and aesthetic visualizations, even when the analysis is sound. Data science, using best practices, is only the starting point for creating a value-driven organization. A critical aspect of ensuring that managers use results is to foster a data-based culture, which requires managing people.

To allow data science to flourish, the organization needs to have a well-established set of computer systems to store and analyze data and present results. A wide range of data science tools are available, each of which plays a different role in the analysis value chain.

Each data science project begins with a problem definition that is translated into data and code to define a solution. This problem is injected into the data vortex until a solution is found. The data science process discusses the workflow of creating "data products".

The three aspects of becoming a data-based organization and strategically implementing data science require alignment:

  1. People
  2. Systems and
  3. Processes 


This in order to optimize the value that can be extracted from the available information. 


1- People


When talking about the people in a data-based organization, we should not only mention the specialists who create the data products. The members of the data science team possess the competencies shown in the following diagram

Source here


These clearly technical people must be able to communicate the results of their work to their colleagues or clients and convince them to apply the findings.

Data science does not only occur exclusively within the specialized team. Every data project has an internal or external client who has a problem that needs an answer. The data science team and the users of their products work together to improve the organization. 

This implies that a data scientist needs to understand the basic principles of organizational behavior and change management and be a good communicator. In contrast, recipients of data science need to have sufficient knowledge of the data to understand how to interpret and use the results.

2- Systems  


Like any other profession, a data scientist needs an appropriate set of tools to create value from the data. There are a number of data science solutions available on the market, many of which are open source software. There are specialized tools for every aspect of the data science workflow.

There is no need to discuss the multitude of packages that are available. Many excellent websites examine the various offerings. We will instead offer some thoughts on the use of Excel (or any other spreadsheet) versus code writing and business intelligence platforms.

Spreadsheets are a versatile tool for analyzing data that has proliferated in almost every aspect of business. However, this universal tool is not very suitable for undertaking complex and sophisticated data science. One of the perceived advantages of spreadsheets is that they contain the data, code and output in one convenient file. This convenience comes at a price, as it reduces the robustness of the analysis.

Anyone who has ever had the displeasure of reverse engineering a spreadsheet will understand the limitations of spreadsheets. In spreadsheets, it is not immediately clear which cell is the result of another cell and what the original data is.

Many organizations use spreadsheets as the only source of truth for business data, which should be avoided if information needs to be shared. The best practice of data science is to separate the data, the code and the result.

As mentioned above, the best way to create unicorns in data science is to teach experts in the field how to write analytical code. Writing code with R or Python is like writing an instruction manual on how to analyze data. Anyone who understands the language will be able to know how their conclusions are derived.

Modern data science languages can generate print-quality visualizations and can produce results in many formats, including a spreadsheet or standalone application.

The gold standard for programming in data science is well-documented programming. This technique combines code with "pseudocode" (or to make it easier, it is to make comments next to the written code, and thus explain how it was created and what its function is) and thus allow the algorithm to be fully understood. All programming languages include the ability to add these comments.

Each language has its own methods for combining text with code. RMarkdown, Jupyter Notebooks and Org Mode are popular systems for data analysis and allow easy understanding of these comments (even as a blog post with text editing). Once the code is written, at the touch of a button the machine generates a new report with updated statistics and graphics.

Finally, Business Intelligence tools are useful for disseminating the results of a data science project, but are not very useful for detailed analysis. A platform like Power BI is a great system to visualize the result of an analysis because it provides very flexible ways to cut and slice the data and visualize the results. The analytical capabilities of the platform are not very high but can be modified by inserting code in Python or R to complement its capabilities.

Also read: Why are data science competitions important for startups?

3- Processes


The process of creating value from data follows an iterative workflow that works from raw data to a finished project.

The workflow starts with the definition of a problem that needs to be solved as shown in the figure below. The next step involves loading and transforming the data into a format suitable for the required analysis. The data science workflow contains a loop consisting of exploration, modeling, and reflection, which is repeated until the problem is solved or shown to be unsolvable.

Source

The workflow of a data project is independent of the data science continuity aspect being considered. The same principles apply to all types of analysis. For larger projects, formal project management methods are advised to control time, budget, and quality.

Conclusion


As we can see, organizations must make a conscious, informed and organized effort to become a data-based company, which means that they will finally be making decisions without relying on whims, egos, competencies or other characteristics. Instead, strategic and important decisions that generate value are made based on past data, analyzed in the present and it is a matter of predicting a result that favors the continuity and advantages of the company in the market.

On the other hand, we see that the company must change certain internal processes, and have adequate personnel, in order to have a sophisticated and accurate approach to data. If you want more information about how to implement a data science problem we can help you here.

We hope you enjoyed reading!

Read Here






2020-12-18

Interview with the winners of the data science competition "Real Estate Price Forecast"

Learn how they made their machine learning models and what tools they used with this interview to the top 10 of the competition leaderboard. 

A few days ago we finished the data science competition called "Real Estate Price Forecast" in which 139 data scientists joined, 51 of them sent at least 1 machine learning model to the platform and we received and evaluated a total of 831 models, that is an average of 16 models for each active participant. From here we can draw several conclusions, and it is the need to build different models to evaluate their effectiveness and find the best result. 

Since this was an error metric, the minimum and winning score was 0.248616099466774. Only two people managed to score 0.24, let's look at these ranges:

Ranges of
  • 0.24 = 2 competitors
  • 0.25 = 7 competitors
  • 0.26 = 9 competitors

Due to these good results, we wanted to know in detail what the competitors who were in the first places did. So here we have the questions and answers of our winners.


Tomás Ertola - Argentina - Second Place


Q: In general terms, how did you address the problem raised in the competition?
A: The pipeline I followed was very basic, EDA, Distribution Transformation and application of different models.

Q: For this particular competition, did you have any previous experience in this field? 
A: When I did a DS bootcamp in August 2019 the properati dataset was used to train what would be data cleaning, I had never done a similar model and so I took it as a challenge.

Q: What important results/conclusions did you find in your exploration of the data? What challenges did you have to deal with?
A: The most important fact that makes your model stand out is that the distribution of the three cardinal variables are shifted to the left, so they don't follow a normal distribution. Realizing this, you might want to check which distribution is most like it and apply a correction. In my case I transformed them to a logarithmic distribution.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: I took only 3 categorical variables: Country, City and Department to which I applied a one hot encoding and forgot about it. The hard work was related to correcting the distribution of the cardinal variables.

Q: What Machine Learning algorithms did you use for the competition? 
A: First I made a regressor stacking with a L1 and L2 (Lasso and Ridge) regularization and then an ensembling with Gradient Boosting, XGBoost, Catboost, LightGBM, RandomForest

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Within the ensemble the ones that performed better were the gradient boosting and to that we should add that the assemblies corrected the defects that each one had which made them an optimal solution. And I believe that it is not by chance that they have worked so well, the gradient boosting have been the winners in various competitions for years, the basis on which these algorithms are based in itself already makes them stand out among the most common.

Q: What libraries did you use for this particular competition?
A: Sklearn, scipy, numpy, catboost, xgboost, lightgbm

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have 1 year of data science on my own and I am currently working in Data Analyst for the Government of the City of Buenos Aires.

Q: What advice would you give to those who did not have such good scores in the competition? 
A: The important thing is not the score, but understanding what you are doing. When you start to understand more about the models and the things you do the result improves but the reality is that it is more valuable to understand than the score.

Pablo Neira Vergara - Chile- Third Place


Q: In general terms, how did you address the problem raised in the competition?
A: The main thing is to understand the problem well and think about what new variables could be used, and I don't mean just transformations to normalize or standardize and make dummies, I'm talking about things like thinking that the number of times a city is repeated in the training set can tell us something about the population density, or that maybe even some types of clustering using only some variables can give us more information than those same variables separately.

Q: For this particular competition, did you have any previous experience in this field? 
A: None

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The city seemed to be an excellent categorical variable, however some cities in the test were not present in the training, so I had to think about how to compensate for that lack of information. Also some cities in the training, even though they belong to the same city, and have the same amount of rooms and square meters varied considerably in price.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Dummies of virtually all variables that looked suspiciously categorical, and then applied Standard Scaler, plus applied a k-means along with gridsearch to determine which groupings could provide more information to a base model. Among other things, of course.

Q: What Machine Learning algorithms did you use for the competition? 
A: I tried many algorithms and assemblies, but it turned out that the best results were obtained using k-means along with a gridsearch-optimized GBR.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: I think I made good use of the information available, plus I ended up doing an assembly with the data that was exactly the same from testing to training with the model. If it looks like a duck, quacks like a duck, and flies like a duck the most sensible thing to do is to assume it's a duck, of course sometimes it can be a goose, hence the need to assemble it.

Q: What libraries did you use for this particular competition?
A: The usual: pandas, numpy, sklearn, seaborn, matplotlib and lightgbm

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have been in the area for a little over 4 years. I currently work for the Logistics Observatory of the Ministry of Transport and Telecommunications of the Chilean Government

Q: What advice would you give to those who did not score as well in the competition? 
A: Think about the problem to be solved before going to sleep, look for information about the state of the possible suitable algorithms, and if possible the state of the art of the particular problem itself.


Cesar Gustavo Seminario Calle - Perú - Sixth Position


Q: In general terms, how did you address the problem raised in the competition?
A: I focused on testing a variety of features based on target statistics by province, city and country and setting up my validation scheme with the train data, to compare results before making the submissions.

Q: For this particular competition, did you have any previous experience in this field? 
A: Yes, I have worked on time series models.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The total area was a very important variable in predicting the price. Some cities had very high prices because they were probably condominiums, and for prices above 100,000 the ratio was almost linear, while for lower values the ratio seemed to be exponential.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: I applied target encoding on the price features, number of rooms and total surface area separated for both countries, then a clustering of the apartments using the built features.

Q: What Machine Learning algorithms did you use for the competition? 
A: Simple neural network models and tree based models. Model Stacking.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Extreme gradient boosting, because of the regularization parameters (depth, learning rate, column sampling) and the boosting technique

Q: What libraries did you use for this particular competition?
A: scikit-learn, keras, pandas, seaborn, mlflow

Q: How many years of experience do you have in Data Science and where do you currently work?
A: 2 years of experience, currently working at Voxiva.ai

Q: What advice would you give to those who did not score as well in the competition? 
A: Maintaining an orderly and simple validation scheme allows you to test many hypotheses and obtain results so as not to repeat tasks.


Federico Gutiérrez - Colombia - Seventh Place


Q: In general terms, how did you address the problem raised in the competition?
A: Since we were processing information on two different countries, I decided that a good strategy would be to divide all the information by country. The real estate markets in Colombia and Argentina are very different and each has its own particular way of behaving, so I don't think it is a good idea to create a model for both markets simultaneously.

Q: For this particular competition, did you have any previous experience in this field? 
A: Although I had never created models to make this type of forecast, I did have experience and general understanding of the real estate market and property evaluation. I gained this experience by working for an insurance company.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: This competition presented several challenges, including cleaning and debugging the dataset. To do a good cleanup I had to assume several things, among these I had to manually correct several prices that were far from the realistic values. For this I simply used common sense and basic knowledge of the industry.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: One of the factors that helped me the most was dividing the dataset by country, this made it easier for the models to get closer to reality. For the Argentine market, I discovered that the presence of an extra bathroom for visits directly impacts the final price of the property, so I decided to calculate this new variable and include it in my analysis.

Q: What Machine Learning algorithms did you use for the competition? 
A: I used different algorithms including: linear regression, randomized forest regression, Gradient boosting regression, and XG Boosting regression.

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: The best algorithm was XGBoost, I think this is because this algorithm includes an internal regularization which allows to reduce the overfitting.

Q: What libraries did you use for this particular competition?
A: Pandas, Numpy, Seaborn, Matplotlib, Scipy, Xgboost and Scikit Learn.

Q: How many years of experience do you have in Data Science and where do you currently work?
A: 2 years, I work at The Clay Project

Q: What advice would you give to those who did not score as well in the competition? 
A: I think that for this competition it is worth focusing on understanding very well how the real estate sector works and which variables are critical. I think that if you understand the sector and the context of the problem well, you will be able to include only the relevant variables in your algorithms and thus obtain better results.


Germán Goñi - Chile - Eighth Place


Q: In general terms, how did you address the problem raised in the competition?
A: In my experience the real estate market tends to be different in each country. It is important to have a solution that is not generic, nor does it generate over-adjustment.
It is also relevant to use feature transformation when dealing with variables with asymmetric distribution. 

Q: For this particular competition, did you have any previous experience in this field? 
A: Yes, but with data from another country.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: Test Set contained provinces/cities not observed in Test Set: eye with "naive" models

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: Box-Cox transformations, Different types of encoding for qualitative variables

Q: What Machine Learning algorithms did you use for the competition? 
A: Random Forest, Catboost

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: Catboost

Q: What libraries did you use for this particular competition?
A: Pandas, Numpy, Matplotlib, Seaborn

Q: How many years of experience do you have in Data Science and where do you currently work?
R: 5

Q: What advice would you give to those who did not score as well in the competition? 
A: Consult experts in Domain, real estate management in this case. Review literature and tutorials on Feature Engineering.


Alejandro Anachuri - Argentina - Ninth Place


Q: In general terms, how did you address the problem raised in the competition?
A: I started by doing an analysis of the dataset, seeing what types of data it had, verifying if there were any null values and then looking through graphs to see if there was any correlation between the variables, basically an EDA process.

Q: For this particular competition, did you have any previous experience in this field? 
A: Only experience in similar problems raised in some book or course I was following as a learning experience.

Q: What important results/conclusions did you find in exploring the data? What challenges did you have to deal with?
A: The most important was to have found the relationship between prices and total surface area and by logging the price, this relationship was seen more clearly, this helped me to improve the outcome of the models.

Q: In general terms, what data processing and feature engineering did you do for this competition?
A: label encoder, one hot encoder, data transformation.

Q: What Machine Learning algorithms did you use for the competition? 
A: Random forest

Q: What was the Machine Learning algorithm that gave you the best score and why do you think it worked better than the others? 
A: I only used Random forest which is the model I was studying and I wanted to use this competition to try to understand it more deeply.

Q: What libraries did you use for this particular competition?
A: I mainly used pandas, matplotlib, numpy, sklearn, seaborn

Q: How many years of experience do you have in Data Science and where do you currently work?
A: I have no real experience in data science, I am just self-taught since a couple of months ago. I am a systems engineer and I work in software development in a multinational company.

Q: What advice would you give to those who did not score as well in the competition? 
A: Keep trying and testing different models or tuning on the same models, consult with the people in the group and in this way this community can also begin to grow and be nourished by those who know best.



Conclusions

As we can read, each competitor has followed their own methods, and models, but there is something in particular, and that is the need to try different approaches, questions, answers, and models. We hope you have drawn your own conclusions, you can share them with us in the comments, and we look forward to seeing you in the competition that is currently active, and perhaps you could be the TOP 10 interviewee for the next competition!


PS: we did not get the answers from the competitor who got first place. :( 

Read Here






2020-12-18

Why data science competitions are important for startups?

Today, large companies have large I+D budgets that allow them to experiment and be at the cutting edge of new technologies; always adopting the newest, trying to adapt it to their own needs, trying to find the hidden value in each of them. 

It is natural that not always a new technology is adapted to the needs of a particular company, however, with the process of I+D companies have an “innovation lab” where it is allowed to fail, and where amazing things also happen. 

In recent years, the technology that has gained millions of followers (companies and engineers), is Artificial Intelligence, or Machine Learning to be specific. These large companies have been able to find use cases that allow them to optimize all types of internal operations, lower costs and/or increase sales, income and profitability. 

But where have the startups been in this race to exploit and apply new use cases using this technology?

Unfortunately not all startups have the financial capacity to experiment with new technologies, neither by outsourcing nor by hiring talent internally. 

In fact, many startup founders see the possibility of using this technology as far away as possible. Even worse, we have found that many founders have no idea what machine learning can do for their startups. 

And it's worrying because this technology can give them significant advantages over the competition and/or increase their key differentiation. 

There is talk that data is the new gold, and it makes sense as long as startups know how to turn that data into gold. 

Finally, when we talk about Machine Learning, we are talking about data processed in the right way, which allows the generation of "artificial intelligence" and making predictions or classifications with that data. 

However there is a whole process behind to extract the data correctly, clean it, organize it, have it ready, take it to a machine learning model, experiment with the model (or models) and finally get scores and upload it to production. 

This is not an easy process, and this is where startups look remotely to those who have resources to take advantage of it. If only we could democratize the way we use this technology we could generate more value for all stakeholders, including users obviously. But how do we democratize access to this technology and the solutions that can be created?

Democratizing data science competitions


Some time ago, several platforms were born in which large tech companies from Silicon Valley and even big corps try to solve really complex problems with the help of external people, by using something called "competitions in data science". 

This was because the internal talent couldn’t solve these problems for many reasons, such as they didn't have the time, or skills. Obviously, these were really complex problems.

These data science competition platforms allow the company to access a global talent pool of data science specialists ranging from PHDs to self-taught people who were launched into the adventure of solving the challenge posted by the company sponsoring a competition. 

The prizes are obviously exorbitant, with Netflix even paying $1 million for a machine learning solution. 

The prizes on these platforms range from $10,000 USD to $100,000 USD on average. A privilege that only big tech companies (or big corps) can afford. 

And the startups? Well, if you have raised a Series B or C, you may be able to afford to sponsor a $10,000 USD competition, or even have an in-house team of data scientists to help you explore, experiment or solve problems with machine learning. 

But what about startups that are at an earlier stage and don't have these funds or that internal talent? Or what about the ones that are bootstrapping? This is where an idea came up to help startups in this situation. We have decided to rethink competitions in data science

Rethinking competitions in data science


Our approach is to democratize data science competitions. We realized that other data science competition platforms are focused on very large companies, very high prizes and very complex problems. 

This translates into competitions that can only be sponsored for big tech companies with deep pockets, competitions that take months to complete, and that are made for data scientists and "super-senior" teams. 

After all, sponsoring a $50,000 USD (or 1 million USD) competition is not for every type of company. 

That is why we decided to rethink the way data science competitions are built and decided to focus on startups of any size and from anywhere of the world, that can pay for competitions from $499 USD, that do not take so long to be resolved (4 weeks), that can launch more than one, two or three competitions (because they can afford it) and in which all kinds of talent in data science, of any level and from anywhere in the world can compete.

What if I hire a team of data scientists instead of sponsoring competitions?


In this case, you obviously have to think about how many people you will need to hire, do the hiring process and obviously pay a considerable salary to that talent (since it is worth it!). 

But if you do this without having experimented with the first machine learning model solutions, you will be taking a leap of faith without knowing if you will actually need this technology, if you will really be able to take advantage of it and if it will definitely generate value for your startup. 

Perhaps hiring someone makes more sense if you have already experimented a little with the technology and know what you want and what you can do. 

In case you haven't tested it, (and don't know what can be built for your particular startup and use case), competitions are the best option for you to get the most out of it. 

In fact on our platform there are more than 1,400 data scientists, who will be competing to solve your problem, it means that you will have 1,400 people working for you! We are not talking about 2 or 3 people, we are talking about thousands!

But the greatest benefit you can get from a competition is ultimately the machine learning algorithms as solutions. (see paragraph: "Let's make an example). 


Why sponsor a competition if I already have data scientists on my team?


Having talent in data science is great news for you, you must have already experimented with the technology, have solutions deployed and notice the great advantages of such technology.

However, you aren't looking at the big picture, because a data science team by itself might not have found the best solution to a problem. 

Let's say that for the XYZ problem your team achieved a score of 0.75 (out of 1). The right questions to ask yourself are:
  • Is it the best possible solution? 
  • Is it the most optimal solution? 
  • Is it the best algorithm?
  • Are there more algorithms that my team is not exploring?
  • Is it the highest score someone can get? 
  • What if someone has a score of 0.89 for the exact same problem?
  • And the most important of all: Is there anyone out there, anywhere in the world, who can achieve a higher score? 

In short, you are having an opportunity cost. You aren't looking at the big picture. 

That difference in scores (0.75 vs. 0.89) may seem small, but it could mean thousands of dollars in savings, or in income (depending on the problem). 

Or as many data scientists might say: "It would be a life-or-death difference". Just think about this: What if this model is to predict whether someone has a rare disease or not? It would be a life-or-death difference, on which medical treatment could depend. 

This is why competitions are the best option for you to get the most out of it. You keep your data scientists team, but you experiment with competitions and compare results. Or they simply focus on solving other problems. 

But the greatest benefit you can get from a competition is ultimately the machine learning models as solutions. Let's make an example

Let's make an example


Let's say you have found a data science problem, which is framed in a prediction problem (with medium difficulty), and for which you expect a final score of the model between 0 and 1, being 1 a perfect prediction and 0 a very poor prediction. 

If you had, say, 1 full time data scientist in your startup, working to solve that problem, it would possibly take 1 month to solve it and create a machine learning model, and at the end you could have a score of, say: 0.71. 

0.71 is a good score, after all the maximum score is 1, you are at 71% of accuracy in the solution of the problem. Now the question is: 

How much did it cost your startup to reach that solution? 
  • Answer: The monthly wage of the data scientist! (let’s say $100.000 annual wage, it means $8.333 USD for the solution, that’s a lot!). Other associated costs are the opportunity cost of knowing if there are better solutions and better scores for that same problem! Could there be solutions that reach a score of more than 0.8? Probably yes! But the startup will never know, because if their team member does not keep working on optimizing it, that would be the maximum score, it has no way to compare the final result! This same example applies if you have a more robust team. You'll be limited by the size of your team.

Let's suppose now that you have decided to sponsor a competition and pay the winners $999 USD in cash prizes. 

At the end of the competition, which lasts 8 weeks, you will get 20 machine learning models (the top 20 on the leaderboard), and let's say that the solutions that the competitors came up with, are in a range of: 0.58 to 0.86. This means that the winner of the competition got a score of 0.86 and the 20th competitor got a score of 0.58. 

If we compare the results, we will see that the winner of the competition was well above the single score that your in-house data scientist got (0.86 vs. 0.71). 

You also got 20 different solutions/models, with different approaches, from which you will be able to learn how to approach the solutions better.

How much did it cost your startup to reach that solution? 
  • Answer: $999USD + 30% fee, and the confidence of knowing that the winning model (for that particular problem) is the best model among more than 1,400 data scientists! There is no opportunity cost. This is open innovation in data science!


This same example applies if you plan to outsource this job to a software house that offers the machine learning service (it's only one company, not several competing to deliver the best model!) or if you try to hire a consulting firm, or if you want to hire a freelance. All of them have the same problem as an in-house data science team: you’ll only get a limited value. 

In short, there is no better way to experiment with this technology than by sponsoring a competition and understanding the results and value they can generate for your startup!

We have a tool that allows you to frame your problem and help you sponsor a competition. Click here.

Thanks for reading!

Read Here






2020-12-18

Plan De Estudios Para Aprender Data Science En Los Próximos 12 Meses

Cómo habíamos hablado en una post anterior, estamos terminando 2020 y es hora de hacer planes para el próximo año, y uno de los planes y preguntas más importantes que debemos hacer es ¿que queremos estudiar?, ¿que queremos reforzar?, ¿qué cambios queremos hacer? y cual es el rumbo que vamos a tomar (o a seguir) en nuestras carreras profesionales. 

Muchos de ustedes estarán empezando en el camino a convertirse en data scientist, de hecho puede que lo estén evaluando, ya que han escuchado mucho sobre el tema, pero tienen algunas dudas, por ejemplo acerca de la cantidad de ofertas laborales que puede existir en ésta área, dudas acerca de la tecnología en sí, y acerca del camino que deberían seguir, teniendo en cuenta la amplia gama de opciones para aprender. 

Yo soy un partidario de que debemos aprender de varias fuentes, de varios mentores y de varios formatos. Con fuentes me refiero a las diferentes plataformas virtuales y presenciales que existen para estudiar. Por mentores me refiero a que siempre es una buena idea escuchar los diferentes puntos de vista y de enseñanza de diferentes profesores, y por formatos me refiero a las opciones entre libros, videos, clases, y otros formatos donde está contenida la información. Cuando extraemos información de todas estas fuentes reforzamos el conocimiento aprendido, pero siempre necesitamos una guía, y este post pretende dar algunas luces y estrategias prácticas al respecto. 

Para decidir sobre las fuentes, los mentores y los formatos dependerá de ti escogerlos. Depende si prefieres material en español o en inglés, o si prefieres mezclarlos. También depende de tus gustos y facilidad de aprendizaje: por ejemplo, hay personas que se les da mejor aprender de libros, mientras que otros prefieren aprender de videos. Algunos prefieren estudiar en plataformas que son prácticas (con código en línea), y otros prefieren en plataformas tradicionales: como las de las universidades o MOOCs. Otros prefieren pagar por contenido de calidad, otros prefieren buscar solamente material gratuito. Es por ello que no daremos una recomendación en específico en este post, sino que daremos un plan de estudios. 

Para iniciar debes tener en cuenta el tiempo que vas a dedicar a estudiar y la profundidad en el aprendizaje que puedes lograr, ya que si te encuentras sin trabajo podrías estar disponible a tiempo completo para estudiar, lo cual es una enorme ventaja. Mientras que si estás trabajando tendrás menos tiempo y deberás disciplinarte para poder disponer del tiempo en las noches, en las mañanas o los fines de semana. Finalmente, lo importante es cumplir la meta de aprender y quizás dedicar tu carrera profesional a esta apasionante área!

Dividiremos el año en trimestres de la siguiente manera
  • Primer Trimestre: Aprendiendo Las Bases
  • Segundo Trimestre: Subiendo El Nivel: Conocimientos Intermedios
  • Tercer Trimestre: El Proyecto Real - Un Proyecto Full-stack
  • Cuarto Trimestre: Buscando Oportunidades Mientras Se Mantiene la Práctica

Primer Trimestre: Aprendiendo Las Bases


Si quieres ser más estricto puedes tener fechas de inicio y terminación para este periodo de estudio de las bases. Podría ser algo como: Desde Enero 1 Hasta Marzo 30 de 2021 como fecha límite. Durante este periodo estudiarás lo siguiente:

Un lenguaje de programación que podrás aplicar a data science: Python o R. 
Te recomendamos Python debido al simple hecho de que aproximadamente el 80% de las ofertas de trabajo en data science piden conocimientos en Python. Ese mismo porcentaje se mantiene con respecto a los proyectos reales que encontrarás implementados en producción. Y sumamos el hecho de que Python es multipropósito, así que no “perderás” tu tiempo si en algún momento decides enfocarte por ejemplo en desarrollo web, o de escritorio. Aquí podrás ver otros motivos por los que Python es el líder en data science. Este sería el primer tema a estudiar en los primeros meses del año.


Familiarizarte con estadística y matemáticas. 
Existe un gran debate en la comunidad de data science alrededor de sí necesitamos estas bases o no las necesitamos. Escribiré un post más adelante sobre esto, pero la realidad es que SI lo necesitas, pero SOLO las bases (por lo menos al principio). Y quiero aclarar este punto antes de continuar. 

Podríamos decir que data science se divide en dos grandes campos: Investigación y Desarrollo por un lado y poner algoritmos de Machine Learning en producción por el otro lado. Si usted más adelante decide enfocarse en Investigación y Desarrollo (Research), si va a necesitar matemáticas y estadística en profundidad (muy en profundidad). Si vas a irte por la parte práctica, las librerías te ayudarán a lidiar con la mayor parte de ello, bajo el capó. Cabe aclarar que la mayor cantidad de ofertas laborales, se encuentran en la parte práctica. 

Para ambos casos, y en esta primera etapa solo necesitarás lo básico de:

Estadística (con Python y NumPy)
  1. Estadística descriptiva
  2. Estadistica inferencial
  3. Hypothesis testing
  4. Probabilidad
Matemáticas (con Python y NumPy)
  1. Algebra Lineal
  2. Calculo Multivariable

Nota: Recomendamos que estudies primero Python antes de ver estadistica y matematicas, debido a que el reto es implementar estas bases estadísticas y matemáticas con Python. No busques tutoriales teóricos que muestran solo diapositivas o ejemplos estadísticos y/o matemáticos en Excel, ¡se vuelve muy aburrido y poco práctico! Deberías elegir un curso, programa o libro que te enseñe esos conceptos de forma práctica y con Python. Recuerda que Python es lo que finalmente empleamos, por ello necesitas elegir bien. Este consejo es clave para que no abandones en esta parte, ya que será la más densa y difícil. 

Si tienes estas bases en los primeros tres meses, estarás listo para dar un salto de calidad en tu aprendizaje para los siguientes tres meses.


Segundo Trimestre: Subiendo El Nivel: Conocimientos Intermedios


Si quieres ser más estricto puedes tener fechas de inicio y terminación para este periodo de estudio en el nivel intermedio. Podría ser algo como: Desde Abril 1 hasta el 30 de Junio de 2021 como fecha límite. 

Ahora que tienes unas buenas bases de programación, estadística y matemáticas, es hora de avanzar y conocer las grandes ventajas que tiene Python para aplicar análisis de datos. Para esta etapa estarás enfocado en:

Stack de Python para data science
Python tiene las siguientes librerías que debes de estudiar, conocer y practicar en esta etapa

Pandas es la librería in-facto para análisis de datos, es una de las herramientas más importantes (si no la más importante) y poderosas que debes de conocer y dominar durante tu carrera como data scientist. Pandas te facilitará enormemente la manipulación de datos, la limpieza y la organización de los mismos. 


Feature Engineering
Muchas veces no se profundiza en el aprendizaje de la ingeniería de las características (o Feature Engineering), pero si quieres tener modelos de Machine Learning que hagan buenas predicciones y/o clasificaciones y mejorar los scores, dedicar un tiempo a esta materia es invaluable! 

La ingeniería de características es el proceso de utilizar el conocimiento de los dominios para extraer características de los datos en bruto mediante técnicas de minería de datos. Estas características pueden utilizarse para mejorar el rendimiento de los algoritmos de aprendizaje automático. La ingeniería de características puede considerarse como el propio aprendizaje automático aplicado. Para lograr el objetivo de una buena ingeniería de características debes conocer las diferentes técnicas que existen, por tanto es una buena idea que al menos estudies las principales. 


Modelos Básicos de Machine Learning
Finalizando esta etapa iniciarás con el estudio de Machine Learning. Este es quizás el momento más esperado! Aquí es donde empiezas a conocer los diferentes algoritmos que podrás utilizar, que problemas en particular puedes resolver y cómo puedes aplicarlos en la vida real. 

La librería de Python que te recomendamos para empezar a experimentar con ML es: scikit-learn. Sin embargo es una buena idea que puedas encontrar tutoriales donde explican la implementación de los algoritmos (al menos los más sencillos) desde cero con Python, ya que la librería podría ser una “Caja negra” y podrías no entender lo que está sucediendo bajo el capó. Si aprendes a implementarlos con Python, podrás tener unas bases más sólidas. 

Si implementas los algoritmos con Python (sin una librería), pondrás en práctica todo lo visto en la parte de estadística, matemática y de Pandas.

Estas son algunas recomendaciones de los algoritmos que al menos deberías conocer en esta etapa inicial

Aprendizaje supervisado
  • Regresión Lineal Simple
  • Regresion Lineal Multiple
  • K-nearest neighbors (KNN)
  • Regresion Logistica
  • Arboles de Decisión
  • Random Forest
Aprendizaje No Supervisado
  • K-Means
Bonus: si tienes el tiempo y estas dentro de los rangos de tiempo, puedes estudiar estos
  • Algoritmos Gradient Boosting
  • GBM
  • XGBoost
  • LightGBM
  • CatBoost
Nota: no te pases más de los 3 meses estipulados para esta etapa. Porque te estarás retrasando e incumpliendo con el plan de estudios. Todos tenemos falencias en esa etapa, es normal, sigue adelante y luego podrás retomar algunos conceptos que no entendiste en detalle. Lo importante es tener el conocimiento básico y avanzar!. 

Si al menos logras estudiar los algoritmos mencionados de aprendizaje supervisado y no supervisado, tendrás una idea muy clara de lo que podrás hacer en el futuro. Así que no te preocupes por abarcarlo todo, recuerda que es un proceso, y lo ideal es que tengas unos tiempos claramente establecidos para que no te frustres y sientas que estás avanzando. 

Hasta aquí llega tu estudio “teórico” de las bases de un Data Scientist, que de hecho se ha especializado como Machine Learning Engineer. Ahora seguiremos con la parte práctica!

Tercer Trimestre: El Proyecto Real - Un Proyecto Full-stack


Si quieres ser más estricto puedes tener fechas de inicio y terminación para este periodo de estudio en el nivel intermedio. Podría ser algo como: Desde Julio 1 hasta el 30 de Septiembre de 2021 como fecha límite. 

Ahora que tienes unas buenas bases de programación, estadística, matemáticas, análisis de datos y de algoritmos de machine learning, es hora de avanzar y poner en práctica real todos estos conocimientos.

Muchas de estas sugerencias podrían sonar fuera de lo común, pero creeme que harán una gran diferencia en tu carrera como data scientist. 

Lo primero es crear tu presencia web:
  • Crea una cuenta en Github (o GitLab), y aprende Git. Poder manejar diferentes versiones de tu código es importante, deberías tener un control de versiones sobre los mismos, por no decir que tener una cuenta de Github activa es muy valiosa para demostrar tus verdaderas habilidades. En Github también podrás montar tus Jupyter Notebooks y hacerlos públicos, así demuestras también tus habilidades. Este es el mío por ejemplo: https://github.com/danielmoralesp
  • Aprende lo básico de programación web. La ventaja es que ya tienes Python como una habilidad, por tanto puedes aprender Flask para crear una página web sencilla. O puedes usar algún motor de templates como Github Pages, Ghost o el mismo Wordpress y crear así tu portafolio en línea. 
  • Compra un dominio con tu nombre. Algo como minombre.com, minombre.co, minombre.dev, etc. Esto es valiosísimo para que puedas tener tu CV online y lo actualices con tus proyectos. Allí podrás hacer una gran diferencia, mostrando tus proyectos, tus Jupyter Notebooks y dejando ver que tienes las habilidades prácticas para ejecutar proyectos en esta área. Existen muchos templates front-end para que adquieras gratis o de pago, y le des un look más personalizado y agradable. No uses subdominios gratuitos de Wordpress, Github o Wix, se ve muy poco profesional, haz el tuyo propio. Aqui esta el mío por ejemplo: https://www.danielmorales.co/
  • Monta todos los ejercicios y proyectos que has hecho hasta ahora, en los 6 meses previos, a tu portafolio en línea. Ya tienes material para darte a conocer, no importa que tan profesionales luzcan tu Jupyter Notebooks, trata de arregarlos un poco y móntalos. Mis Jupyter Notebooks por ahora los estoy montando de forma privada aqui: https://www.narrativetext.co/

Elige un proyecto que te apasione y crea un modelo de Machine Learning a su alrededor.

El objetivo final de este tercer trimestre es crear UN ÚNICO proyecto, que a ti te apasione, y que sea ÚNICO entre los demás. Resulta que en la comunidad existen muchos proyectos típicos, como predecir los Sobrevivientes del Titanic, o predecir el precio de las Casas en Boston. Ese tipo de proyectos son buenos para aprender, pero no lo son para demostrarlos como tus proyectos ÚNICOS

Si te apasiona el deporte, trata de predecir los resultados de fútbol de tu liga local. Si te apasionan las finanzas, trata de predecir los precios de las acciones de la bolsa de valores de tu país. Si te apasiona el marketing, trata de buscar a alguien que tenga un e-commerce e implementa un algoritmo de recomendación de productos y súbelo a producción. Si te apasionan los negocios: haz un predictor de las mejores ideas de negocio para 2021. 

Como vez en este caso estas limitado por tus pasiones y por tu imaginación. De hecho esas son las dos claves para que hagas este proyecto: Pasión e Imaginación

Sin embargo no esperes hacer dinero de ello, estás en una etapa de aprendizaje, necesitas que ese algoritmo esté montado en producción, haz una API en Flask con ella, y explica en tu sitio web como lo hiciste y cómo la gente puede acceder a él. Este es el momento de brillar, y a la vez es el momento de mayor aprendizaje.

Muy seguramente te encontrarás con obstáculos, si tu algoritmo da un 60% de Accuracy después de un gran esfuerzo de optimización, no importa, termina todo el proceso completo, súbelo a producción, trata de que algún amigo o familiar lo use, y ése será el objetivo cumplido para esta etapa: Hacer un proyecto Full-stack de Machine Learning

Con full-stack me refiero que tu hiciste todos los siguientes pasos:
  1. Obtuviste los datos de alguna parte (scrapping, open data o API)
  2. Hiciste un análisis de datos
  3. Limpiaste y transformaste los datos
  4. Creaste Modelos de Machine Learning
  5. Subiste el mejor modelo a producción para que otras personas lo usen.

Esto no significa que todo este proceso es lo que harás siempre en tu trabajo, pero significa que conocerás cada una de las partes del pipeline que se necesita para un proyecto de data science para una empresa. Tendrás una perspectiva única!


Cuarto Trimestre: Buscando Oportunidades Mientras Se Mantiene la Práctica



Si quieres ser más estricto puedes tener fechas de inicio y terminación para este periodo de estudio en el nivel intermedio. Podría ser algo como: Desde Octubre 1 hasta el 31 de Diciembre de 2021 como fecha límite. 

Ahora tienes conocimientos teóricos y prácticos. Has implementado un modelo en producción. El paso siguiente depende de ti y de tu personalidad. Digamos que eres alguien emprendedor, y tienes la visión de crear algo nuevo a partir de algo que descubriste o que viste una oportunidad de hacer negocio con esta disciplina, así que es el momento de empezar a planear cómo hacerlo. Si ese es el caso, obviamente este post no abarcara ese proceso, pero tu deberías de saber cuales podrán ser los pasos a seguir (o empezar a averiguarlos).

Pero si eres de los que quieren emplearse como data scientist aquí van mis consejos. 

Consiguiendo trabajo como data scientist

"No vas a conseguir trabajo tan rápido como piensas, si sigues pensando de la misma forma".
Autor

Resulta que todas las personas que inician como data scientists se imaginan trabajando para las grandes empresas de su país o de su región. O incluso remoto. Resulta que si aspiras a trabajar en una empresa grande como data scientist te vas a frustrar al ver los años de experiencia que piden (3 o más años) y las habilidades que solicitan.

Las grandes empresas no contratan Juniors (o muy pocas lo hacen), precisamente porque ya son grandes empresas. Tienen el músculo financiero para exigir experiencia y habilidades y pueden pagar un salario acorde (aunque no siempre es el caso). El tema es que si te enfocas allí te vas a frustrar!. 

Aquí debemos volver a lo siguiente: "Necesitas creatividad para conseguir un trabajo en data science"

Como todo en la vida debemos empezar en diferentes escalones, en este caso, desde el principio. Aquí están los escenarios

  • Si estás trabajando en una empresa y en un rol no relacionado con ingeniería debes demostrar tus nuevas habilidades a la empresa en la que trabajas. Si por ejemplo trabajas en el área de servicio al cliente, deberías aplicarlo a tu trabajo, y hacer por ejemplo, análisis detallados de tus llamadas, de los porcentajes de conversión, almacenar la data y hacer predicciones sobre ella! Si puedes tener datos de tus compañeros, ¡podrías tratar de predecir las ventas de ellos! Esto puede sonar gracioso, pero se trata de que tan creativamente puedes aplicar data science a tu actual trabajo y cómo demostrarle a tus jefes lo valioso que es y EVANGELIZARLOS acerca de los beneficios de la implementación. Te harás notar y seguramente podrían crear un nuevo departamento o puesto de trabajo relacionado con data. Y tu ya tienes el conocimiento y la experiencia. La palabra clave aquí es Evangelizar. Muchas empresas y empresarios apenas estan empezando a ver el poder de esta disciplina, y es tu tarea alimentar esa realidad.
  • Si estás trabajando en un área relacionada con ingeniería, pero que no es data science. Aquí aplica lo mismo del ejemplo anterior, solo que tienes algunas ventajas, y es que podrías acceder a la data de la empresa, y podrías usarla a beneficio de la empresa, haciendo análisis y/o predicciones sobre la misma, y nuevamente EVANGELIZAR a tus jefes tus nuevas habilidades y de los beneficios del data science. 
  • Si estás desempleado (o no quieres o no te sientes a gusto siguiendo los dos ejemplos anteriores), puedes empezar a buscar por fuera, y lo que te recomiendo es que busques emprendimientos y/o startups de tecnología donde apenas están formando los primeros equipos y que estén pagando algún salario, o incluso tengan opciones sobre acciones de la empresa. Obviamente aquí los salarios no serán exorbitantes, y los horarios de trabajo podrían ser más extensos, pero recuerda que estás en la etapa de aprendizaje y práctica (apenas en el primer escalón), por tanto no puedes exigir, debes aterrizar un poco las expectativas y acoplarte a esa realidad, y dejar de pretender que te paguen 10.000 dólares al mes en esta etapa. Pero 1.000 dólares si podría ser algo muy interesante para iniciar esta nueva carrera. Recuerda, eres un Junior en esta etapa.

La conclusión es: no pierdas el tiempo viendo y/o aplicando a ofertas de grandes empresas, porque te vas a frustrar. Se creativo, y busca oportunidades en empresas más pequeñas o recién creadas.


El aprendizaje nunca se detiene
Mientras estás en ese proceso de búsqueda de trabajo o de oportunidades, la cual podría tomarte la mitad de tu tiempo (50% buscando oportunidades, 50% seguir estudiando), tienes que seguir aprendiendo, deberías avanzar a conceptos tales como Deep Learning, Data Engineer u otros temas que sientes quedaron flojos de las etapas pasadas o enfocandote en los temas que te apasionen dentro de este grupo de disciplinas en data science. 

Al mismo tiempo puedes elegir un segundo proyecto, y dedicarle un tiempo a ejecutarlo de principio a fin, y asi aumentar tu portafolio y tu experiencia. Si éste es el caso, trata de buscar un proyecto completamente diferente: si el primero lo hiciste con Machine Learning, que este segundo sea con Deep learning. Si el primero lo montaste a producción en la web, que este segundo lo montes a una plataforma móvil. Recuerda, la clave es la creatividad!

Conclusion


Estamos en un momento ideal para planear 2021, y si éste es el camino que quieres tomar, empieza a buscar las plataformas y los medios por los que quieres estudiar. ¡Ponte manos a la obra y no dejes pasar esta oportunidad de convertirte en data scientist en el 2021!

Si estas [email protected] tengo este portafolio de cursos prácticos en data science, ojalá puedas tomarlo y te enseñaré más en detalle cómo lograr este objetivo para el 2021

Nota: estamos construyendo una comunidad privada en Slack de data scientist, si quieres unirte escribenos al email: [email protected]

Gracias por leer!

Read Here






2020-12-18

21 Trucos de Python Que No te Enseñan En Tus Inicios

Cuando empezamos a estudiar Python lo hacemos de diferentes fuentes, tales como videos, libros, blogs, cursos, bootcamps, youtube o universidades y una infinidad de opciones más. 

Sin embargo, con el “afán” de cumplir con las nociones básicas y de entender las generalidades del lenguaje, muchas veces se pasan por alto pequeños detalles y trucos que nos facilitan la vida como programadores en python, sea que lo utilices para desarrollo web, de escritorio o para data science. 

Estos trucos deberían estar presentes desde el principio de nuestra formación, y si no lo estaban ya no es una excusa para no usarlos :)


1- Crear un paquete 


Los módulos ayudan a compartimentar el código reutilizable, como las funciones, variables y clases de Python. Organizarse de esta manera puede hacer que el código sea más fácil de entender y usar.

Para mí, este es el mayor impulsor de la productividad para los programadores en Python. Te permite trabajar más rápido y cometer menos errores. Además, al escribir paquetes, también mejoras tus habilidades de programación.

Un paquete contendrá uno o más módulos relevantes. Podemos crear un paquete llamado miprimerpaquete, siguiendo los siguientes pasos:

  1. Crea una nueva carpeta llamada MiPrimerPaquete.
  2. Dentro de MiPrimerPaquete, crea una subcarpeta con el nombre miprimerpaquete.
  3. Utilizando un IDE en Python como atom, sublime o pycharm, crea los módulos saludar_visitantes.py (que proporcionará el código para dar la bienvenida a los visitantes cuando entren en el paquete), funciones.py (que proporcionará el código para operar varias funciones), y clases.py (que proporcionará las plantillas desde las que podremos instanciar nuevos objetos)

Notas:
Asegúrate de usar estas convenciones de PEP8 para los nombres de los paquetes y módulos
Antes se exigía que un paquete tuviera un archivo __init__.py, pero con la introducción de los paquetes de espacios de nombres, esto ya no es así.


2- Compruebe el tamaño de los paquetes


Después de instalar todas las dependencias de las bibliotecas necesarias para el funcionamiento de su paquete, es posible que su SSD esté un poco desordenado. Comprobar el tamaño del paquete instalado te ayudará a entender qué paquetes están ocupando más espacio. A partir de aquí, puedes elegir qué paquetes puedes mantener y que paquetes puedes desinstalar

Para encontrar la ruta de los paquetes instalados en tu máquina Linux, escribe:

pip3 show "some_package" | grep "Location:"
Esto devolverá 

path/to/all/packages. 
Algo así como: 

/Users/yourname/opt/anaconda3/lib/python3.7/site-packages
Inserte esa ruta de archivo en el comando de abajo:

du -h path/to/all/packages

donde du del uso del espacio en el disco del sistema de archivos.

Este código dará como resultado el tamaño de cada paquete. La última línea de salida contendrá el tamaño de todos los paquetes.


3 - Comprobar el uso de memoria


Al igual que con la optimización de su espacio de trabajo, también puede ser útil examinar el uso de la memoria de los componentes del código. Puedes hacerlo usando el método sys.getsizeof de Python implementando el siguiente código:

import sys
variables = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'variables']
 
# Get a sorted list of the objects and their sizes
sorted([(x, sys.getsizeof(globals().get(x))) for x in dir()
    if not x.startswith('_') and x not in sys.modules and
    x not in variables], key=lambda x: x[1], reverse=True)

4- Mejore su línea de comandos


Click es una herramienta de línea de comandos para Python que permite crear programas e interfaces intuitivos para el bash shell. Click admite diálogos de opciones, avisos al usuario, solicitudes de confirmación, valores de variables de entorno y mucho más.

Aquí hay un script de ejemplo que podría utilizarse para solicitar una contraseña a un usuario

@click.command()
@click.option('--password', prompt=True, hide_input=True,
              confirmation_prompt=True)
def encrypt(password):
    click.echo('Encrypting password to %s' % password.encode('rot13'))

El resultado será

$ encrypt
Password:
Repeat for confirmation:

5- Comprueba que todo cumpla con las convenciones de Python PEP8


El paquete nblint permite ejecutar el motor de estilo pycodestyle dentro de Jupyter Notebook. Esto comprobará su código (es decir, el linter) con el motor de pycodestyle.

Linting resalta cualquier problema de sintaxis o de estilos en tu código Python, haciéndolo menos propenso a errores y más legible para tus colegas. 

Las herramientas de linting fueron introducidas por primera vez por depuradores frustrados en 1978, y la práctica recibe su nombre del acto de quitar pequeños trozos de tela suelta de la ropa que sale de la secadora.

Leer También: Los 8 Mejores Libros De Python En Español Para Leer En 2021


6- Limpia el caché de Conda


En primer lugar, una nota rápida sobre la diferencia entre pip y conda. pip es la herramienta recomendada por la Autoridad de Empaques de Python para instalar paquetes de Python Package Index, PyPI. conda es un gestor de paquetes y entornos multiplataforma de Anaconda.

En general, es una mala idea mezclar los gestores de paquetes de pip y conda. Esto se debe a que los dos administradores no se hablan entre sí, lo que puede crear conflictos entre paquetes. Considere la posibilidad de usar pip exclusivamente dentro de entornos virtuales a menos que esté listo para comprometerse con conda.

Ya hemos cubierto cómo limpiar los paquetes que has instalado en pip - aquí hay instrucciones para eliminar los paquetes instalados en conda. Si has estado usando el gestor de paquetes de conda, puedes liberar espacio eliminando los paquetes no utilizados y las cachés usando este código:

conda clean --all

7- Asignación múltiple para las variables


Python nos permite asignar valores para más de una variable en una sola línea. Las variables pueden ser separadas usando comas. La línea única para asignaciones múltiples tiene muchos beneficios. Se puede usar para asignar múltiples valores para múltiples variables o múltiples valores para un solo nombre de variable. Tomemos un enunciado del problema en el que tenemos que asignar los valores 50 y 60 a las variables a y b. El código habitual será como el siguiente.

a = 50
b = 60
print(a,b)
print(type(a))
print(type(b))

Resultado

50 60
<class 'int'>
<class 'int'>

Condición I - Valores iguales a las variables

Cuando las variables y los valores de las múltiples asignaciones son iguales, cada valor se almacenará en todas las variables.

a , b = 50 , 60
print(a,b)
print(type(a))
print(type(b))

Resultado

50 60
<class 'int'>
<class 'int'>

Ambos códigos dan los mismos resultados. Este es el beneficio de usar asignaciones de valores de una línea.

Condición II - Valores mayores que las variables

Intentemos aumentar el número de valores en el código anterior. Múltiples valores pueden ser asignados a una sola variable. Al asignar más de un valor a una variable debemos usar un asterisco antes del nombre de la variable.

a , *b = 50 , 60 , 70
print(a)
print(b)
print(type(a))
print(type(b))

Resultado

50
[60, 70]
<class 'int'>
<class 'list'>

El primer valor se asignará a la primera variable. La segunda variable tomará una colección de valores de los valores dados. Esto creará un objeto de tipo lista.

Condición III - Un valor para múltiples variables

Podemos asignar un valor a más de una variable. Cada variable será separada usando un igual al símbolo.

a = b = c = 50
print(a,b,c)
print(type(a))
print(type(b))
print(type(c))

Resultado


50 50 50
<class 'int'>
<class 'int'>
<class 'int'>

8- Swapping de dos variables


El swapping es el proceso de intercambio de los valores de dos variables entre sí. Esto puede ser útil en muchas operaciones de informática. 

Aquí, he escrito dos métodos principales utilizados por el programador para intercambiar los valores así como la solución óptima.

Método I - Usando una variable temporal

Este método utiliza una variable temporal para almacenar algunos datos. El siguiente código se escribe con el nombre de la variable temporal.

a , b = 50 , 60
print(a,b)
temp = a+b  #a=50 b=60 temp=110
b = a       #a=50 b=50 temp=110
a = temp-b  #a=60 b=50 temp=110
print("After swapping:",a,b)


Resultado
50 60
After swapping: 60 50

Método II - Sin utilizar una variable temporal

El siguiente código intercambia la variable sin usar una variable temporal.

a , b = 50 , 60
print(a,b)
a = a+b  #a=110 b=60
b = a-b  #a=110 b=50
a = a-b  #a=60  b=50
print("After swapping:",a,b)


Resultado

50 60
After swapping: 60 50

Método III - Solución óptima en Python

Este es un enfoque diferente para intercambiar variables usando Python. En la sección anterior, hemos aprendido sobre las asignaciones múltiples. Podemos usar el concepto de intercambio.
a , b = 50 , 60
print(a,b)

a , b = b , a
print("After swapping",a,b)

Resultado

50 60
After swapping 60 50

9- Invertir un String


Hay otro truco genial para invertir un string con Python. El concepto utilizado para invertir un string se llama slicing. Cualquier string puede ser invertido usando el símbolo [::-1] después del nombre de la variable.

my_string = "MY STRING"
rev_string = my_string[::-1]
print(rev_string)

Resultado

GNIRTS YM

10- Dividiendo las palabras en una línea


No se requiere un algoritmo especial para dividir las palabras en una línea. Podemos usar la palabra clave split() para este propósito. Aquí he escrito dos métodos para dividir las palabras.

Método I - Usando iteraciones

my_string = "This is a string in Python"
start = 0
end = 0
my_list = []

for x in my_string:
   end=end+1
   if(x==' '):
       my_list.append(my_string[start:end])
       start=end+1

my_list.append(my_string[start:end+1])
print(my_list)


Resultado

['This ', 'is ', 'a ', 'string ', 'in ', 'Python']

Método II - Usando la función split()

my_string = "This is a string in Python"
my_list = my_string.split(' ')
print(my_list)

Resultado

['This ', 'is ', 'a ', 'string ', 'in ', 'Python']


11- Lista de palabras en una línea


Este es el proceso opuesto al anterior. En esta parte vamos a convertir una lista de palabras en una sola línea usando la función join. La sintaxis para usar la función de unión se da a continuación.

Sintaxis: "”.join(string)

my_list = ['This' , 'is' , 'a' , 'string' , 'in' , 'Python']
my_string = " ".join(my_list)

Resultado

This is a string in Python
Lea También: 13 Ideas de Proyectos con Python Para Web, Escritorio, Línea de Comandos y Data Science

12- Más de un Operador Condicional


Por ejemplo, si necesitamos imprimir algo cuando una variable tiene un valor mayor de 10 y menor de 20, el código será algo como lo siguiente.

a = 15
if (a>10 and a<20):
   print("Hi")
En lugar de esto podemos combinar el operador condicional en una sola expresión.

a = 15
if (10 < a < 20):
   print("Hi")

Resultado

Hi

13- Encontrar el elemento más frecuente en una lista


El elemento que aparece la mayor parte del tiempo en una lista será entonces el elemento más frecuente de la lista. El siguiente snippet le ayudará a obtener el elemento más frecuente de una lista.

my_list = [1,2,3,1,1,4,2,1]
most_frequent = max(set(my_list),key=my_list.count)
print(most_frequent)

Resultado
1

14- Encontrar la ocurrencia de todos los elementos de la lista


El código anterior dará el valor más frecuente. Si necesitamos saber la ocurrencia de todos los elementos únicos de una lista, entonces podemos ir al módulo collection de Python. Collection es un maravilloso módulo en Python que da muy buenas características. El método Counter da un diccionario con el par de elementos y ocurrencias.

from collections import Counter
my_list = [1,2,3,1,4,1,5,5]
print(Counter(my_list))


Resultado

Counter({1: 3, 5: 2, 2: 1, 3: 1, 4: 1})

15- Comprobando el anagrama de dos strings


Dos strings son anagramas si una cadena está compuesta por los caracteres de la otra cadena. Podemos usar el mismo método de contador del módulo Collections.
from collections import Counter
my_string_1 = "RACECAR"
my_string_2 = "CARRACE"

if(Counter(my_string_1) == Counter(my_string_2)):
   print("Anagram")
else:
   print("Not Anagram")


Resultado

Anagram

16- Usando condicionales con en el operador ternario


La mayoría de las veces, usamos estructuras condicionales anidadas en Python. En lugar de usar la estructura anidada, una sola línea puede ser reemplazada con la ayuda del operador ternario. La sintaxis es la siguiente.

Syntax: Statement1 if True else Statement2

age = 25
print("Eligible") if age>20 else print("Not Eligible")

Resultado

Eligible

17- Convierte lo mutable en inmutable

La función frozenset() se utiliza para convertir un objeto mutable iterable en inmutable. Usando esto podemos congelar un objeto para que no cambie su valor.

my_list = [1,2,3,4,5]
my_list = frozenset(my_list)
my_list[3]=7
print(my_list)

Resultado

Traceback (most recent call last):
 File "<string>", line 3, in <module>
TypeError: 'frozenset' object does not support item assignment

Al aplicar la función frozenset() en la lista, la asignación de elementos está restringida.


18- Aplicar una función para todos los elementos de una lista


El map() es una función de orden superior que aplica una función particular para todos los elementos de una lista.

Syntax: map(function, iterable)

my_list = ["felix", "antony"]
new_list = map(str.capitalize,my_list)
print(list(new_list))

Resultado

['Felix', 'Antony']

19- Filtrar valores con la función filter()


La función filter() se utiliza para filtrar algunos valores de un objeto iterable. La sintaxis de la función de filtro se indica a continuación.

Sintaxis: filter(function, iterable)

def eligibility(age):
   return age>=24
list_of_age = [10, 24, 27, 33, 30, 18, 17, 21, 26, 25]
age = filter(eligibility, list_of_age)
print(list(age))


Resultado

[24, 27, 33, 30, 26, 25]

20- Calculando el tiempo de ejecución de un programa


Time es otro módulo útil en Python que puede ser utilizado para calcular el tiempo de ejecución.

import time
start = time.clock()
for x in range(1000):
   pass
end = time.clock()
total = end - start
print(total)


Resultado

0.00011900000000000105

21- Imprimir un calendario mensual en Python


El módulo Calendar tiene muchas funciones relacionadas con las operaciones basadas en las fechas. Podemos imprimir el calendario mensual usando el siguiente código.

import calendar
print(calendar.month("2020","06"))

Resultado

June 2020
Mo Tu We Th Fr Sa Su
1  2  3  4  5  6  7
8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30

Conclusion


Espero que hayan disfrutado de este artículo. Como nota final, tienes que entender que aprender los trucos no es una obligación. Pero si lo haces, puedes ser único entre otros programadores. La práctica continua es necesaria para adquirir fluidez en la programación.

Si quieres profundizar en tu conocimientos de python, tenemos este curso de python intermedio para tí!

¡Gracias por leer este artículo!

Más articulos que podrían ser de su interés:

Read Here






2020-12-18

Los 8 Mejores Libros De Python En Español Para Leer En 2021

Estamos terminando el año 2020 y es hora de empezar a planear el 2021. Dentro de esta planeación es importante tener en cuenta nuestra educación y lo que queremos aprender en este nuevo ciclo. Es por ello que te traemos los mejores libros de Python en español en este listado. 

Sabemos que Python ha ganado terreno  gracias a su popularidad y uso generalizado en diferentes áreas tecnológicas, entre ellas la web, el escritorio y la data. Esto se ve reflejado en las ofertas laborales, miremos este gráfico

Fuente: https://www.codingdojo.com/blog/top-7-programming-languages

Vemos en este gráfico como los desarrolladores en Python, en promedio, ganaron este año 2020 más dinero que los desarrolladores de otros lenguajes de programación, y ganándole por primera vez a Java. O por lo menos así lo demuestran las ofertas laborales. Python venía ganando terreno, y ya en el 2019 estaba en el segundo lugar, pero este año terminó por coronarse como el lenguaje de programación mejor pagado! 

Este gráfico muestra la tendencia de salarios en el sitio de empleos Indeed a nivel global, y con salarios en dólares, pero suponemos que en latinoamérica, y teniendo en cuenta nuestros propios rangos salariales, la tendencia (de ganar más dinero desarrollando con Python) debería mantenerse y ser igual o similar. Así que este es el momento para aprenderlo y/o mejorarlo!

De acuerdo con este artículo:

“En los últimos años, Python ha ido subiendo constantemente en las filas de los lenguajes de programación. Este año, finalmente ha superado la racha de Java y se ha adelantado. Mientras tanto, todos los demás lenguajes se han mantenido estables, con un aumento gradual en todos los ámbitos. Parece que la demanda de programadores sigue aumentando, y dudamos que haya llegado a su punto máximo todavía”


No debería sorprendernos que Python sea el #1

Entre los lenguajes de programación principales, Python es el más versátil de todos, con el cual se pueden construir aplicaciones, scripts simples de consola, crear enlaces a bases de datos, y se puede incluso crear redes neuronales para inteligencia artificial. Esto es debido a la vasta librería y comunidad que posee, pero también es importante resaltar que es compatible con la mayoría de sistemas y bases de datos más importantes. Por último, aclarar que tiene una sintaxis bastante simple, lo que la hace muy fácil de leer y de aprender. 

Teniendo esto como base, vamos ahora a revisar cuáles libros de Python en español deberías tener en mente para estudiar en este próximo 2021

Hallazgos


Estos son algunos hallazgos interesantes del listado que estás a punto de leer, te recomiendo prestarles atención antes de continuar!

  • Los libros de este post que tienen enlaces a Amazon pueden comprarse en versión Kindle y a muy buenos precios (entre 2 y 4 dólares), lo cual es muy económico
  • La ventaja de los libros de Amazon es que tienen los reviews y algunos de ellos con comentarios muy valiosos que permiten tomar una mejor decisión
  • Algunos de los libros en español que se encuentran en Amazon serán traducciones literales del libro original en Inglés, así que no los recomiendo debido a que tienen muchos errores gramaticales y se hacen muy difíciles de digerir. Pueden darse cuenta por el nombre del autor (nombres extranjeros), y son los que tienen los peores reviews y comentarios.
  • Los sitios diferentes a Amazon no tienen versiones digitales de los libros, por tanto si eres un amante de los libros físicos, en esta lista encontrarás los mejores de ellos. Obviamente debemos tener un poco de paciencia mientras llega la copia a nuestras casas. ¡Puede ser un buen regalo de navidad!
  • Apoyemos a los autores latinos, comprando los libros (no tratando de descargarlos gratis de internet jejej) debido a que esto incentiva a los autores a seguir creando contenido. Me atrevo a decir que este es uno de los motivos por los cuales no existe mucho contenido en español, porque preferimos comprar extranjero y no invertimos en latinos :(
  • Desde datacademy.dev no tenemos cuenta de afiliados de Amazon o en las otras librerías, así que estas recomendaciones son completamente imparciales.



1- Aprende Python en un fin de semana


Autores: Alfredo Moreno Muñoz, Sheila Córcoles Córcoles



El objetivo del libro consiste en construir una base sólida de programación y del lenguaje de programación Python para que puedas desenvolverte ante cualquier situación. Para ello, los autores han diseñado un método de aprendizaje basado completamente en prácticas progresivas junto con nociones básicas teóricas, y lo mejor de todo, estructurado de tal forma que te permitirá aprenderlo en un fin de semana.

Una vez hayas acabado el libro, siguiendo el modo de aprendizaje que los autores proponen, te garantizan que vas a ser capaz de tener la autonomía suficiente para llevar a cabo tus propios proyectos de programación, o al menos lanzarte a que lo intentes.

Se te van a ocurrir una cantidad grande de ideas de proyectos de programación, ya que cuantos más conocimientos vas aprendiendo, más curiosidad desarrollarás y más ideas te irán surgiendo. No desesperes si no lo consigues a la primera, ya que seguro que de cada error aprendes algo que te sirve para seguir avanzando.

Reviews en Amazon


Algunos comentarios de compradores en Amazon:



Link al libro Aprende Python en un fin de semana



2- Comenzando con Python: Un inicio desde cero y hasta donde quieras llegar


Autor: Walter Lopez


Un libro creado desde cero en español, para un entendimiento más rápido y fluido.Toda la información de este libro fue recolectada de forma muy metódica y organizada para una fácil comprensión del lector.

Antes de publicar este libro, el autor lo ha puesto a prueba con sus propias hijas de 10 y 12 años, las cuales lo han sorprendido con su rapidez y fluidez a la hora de entender todo y explicárselo luego. Esto no significa que sea solo para niños sino también para recién iniciados en la materia.

Reviews del libro en Amazon


Link al libro Comenzando con Python


Lee también: 13 Ideas de Proyectos con Python Para Web, Escritorio, Línea de Comandos y Data Science


3- Introducción a la programación con Python


Autores: Omar Trejos y Luis Muñoz



Si es usted un programador con amplia experiencia, este es el libro que necesita tener a la mano cuando, por alguna razón, deba consultar determinados fundamentos tanto conceptuales como prácticos de este lenguaje. 

Si su experiencia es moderada, entonces este será un gran libro de ayuda pues le permitirá acceder a los conceptos que subyacen al lenguaje Python y, con ello, podrá fortalecer su conocimiento tanto en programación en general como en dicho lenguaje. 

Si usted quiere convertirse en programador y aún no ha entrado en el fascinante mundo de un lenguaje como tal, entonces este es su libro pues encontrará en sus páginas lo fundamental requerido para poder aprovechar al máximo la potencialidad de este lenguaje de programación. 

En Python tanto su sintaxis como sus instrucciones permiten que los conceptos teóricos se simplifiquen en tiempo de ejecución al construir un programa. Por estas razones lo invito a que disfrute este libro y aprenda a utilizar el lenguaje de programación Python, el lenguaje de mayor proyección actual en el mundo de la programación.

Link al libro Introducción a la programación con python


4- Python Fácil

Autor: Arnaldo Perez Castaño




Este libro está dirigido a estudiantes universitarios, profesores de ciencias de la computación y programadores, representa una excelente oportunidad de acercamiento al lenguaje de programación libre más reconocido y utilizado por grandes empresas como Google o la NASA por su alto nivel de expresividad y sofisticación: Python. 

El autor nos presenta una amplia gama de posibilidades de uso en la creación de aplicaciones, dotando al lector con las nomenclaturas clave para manipular Python con ayuda de gráficos prácticos que ejemplifican sus explicaciones.

Usted encontrará en estas páginas una guía que compagina teoría y práctica en igual proporción, descubrirá las ventajas del uso de Python frente a otros lenguajes como PHP, JAVA, y CSharp; conocerá los elementos que componen su estructura sintáctica concisa y clara, así como sus componentes léxicos: los tokens, literales, delimitadores, las sentencias y palabras clave, variables de entorno, tipos de datos, secuencias de las que se vale y las operaciones que pueden realizarse, por ejemplo las llamadas tuplas; es decir, obtendrá las herramientas necesarias para ejercitar y dominar la programación orientada a objetos.

Comprenderá el funcionamiento de los operadores lógicos, la generación de iteradores de manera explícita o implícita, los decoradores y las metaclases, es decir, las herramientas de Python que permiten simplificar y al mismo tiempo extender el código de un programa. Aprenderá a realizar el procesamiento de ficheros, conocerá las características de un XML y de HTML, así como la estructura de datos (estructuras arbóreas) y algoritmos

Link al libro Python Facil

5- Big Data Con Python


Autores: Rafael Caballero, Enrique Martín y Adrián Riesco


El análisis de datos está presente en nuestras vidas: los periódicos hablan de noticias virales, las empresas buscan científicos de datos, los comercios nos ofrecen ofertas personalizadas en función de nuestras costumbres y nosotros mismos engrasamos el sistema ofreciendo información personal gratuita desde nuestras redes sociales, búsquedas en internet e incluso desde dispositivos inteligentes para controlar nuestra actividad física diaria.
 
En este libro se presentan los conocimientos y las tecnologías que permitirán participar en esta nueva era de la información, regida por el Big Data y el aprendizaje automático; se analiza la “vida” de los datos paso a paso, mostrando cómo obtenerlos, almacenarlos, procesarlos, visualizarlos, y extraer conclusiones de ellos; es decir, mostrar el análisis de datos tal y como es: un área fascinante, que requiere muchas horas de trabajo cuidadoso.
 
Asimismo, se analiza el lenguaje de programación Python, el más utilizado dentro del análisis de datos debido a la multitud de bibliotecas que facilita, pero no se limita al “estándar”, sino que presenta tecnologías actuales que, con Python como interfaz, permitirán escalar el tamaño de los datos al máximo. Por ello, nuestro viaje con los datos nos llevará, por ejemplo, a conocer la base de datos MongoDB y el entorno de procesamiento Spark.
 
El libro contiene ejemplos detallados de cómo realizar las distintas tareas en Python; y además, por comodidad para el lector de los fragmentos incluidos se facilita el acceso de los lectores a un repositorio donde encontrarán el código listo para ser ejecutado. También cada capítulo presenta lecturas recomendadas para poder profundizar en aquellos aspectos que resulten más interesantes.


Link al libro Big Data con Python


6- Python Con Aplicaciones A Las Matemáticas, Ingeniería y Finanzas


Autores:
BÁEZ LÓPEZ, David; CERVANTES VILLAGÓMEZ, Ofelia
ARÍZAGA SILVA, Juan Antonio;




VENTAJAS

• Cada capítulo cuenta con una serie de ejercicios que el lector puede resolver obteniendo así más experiencia en la solución de problemas implementando sus algoritmos en lenguaje Python.
• En todos los capítulos se encuentran expuestos los objetivos perseguidos a largo de éste, así como una introducción y una conclusión.
• El presente material consiste en una versión de los temas que los autores han impartido durante los últimos dos años.

CONOZCA

• El concepto de algoritmo y su descripción por medio de un lenguaje llamado pseudocódigo.
• Cómo opera cada uno de los distintos tipos de “condiciones” en Python.
• En qué consisten los subalgoritmos y cómo implementarlos en Python.

APRENDA

• A utilizar los ciclos “Mientras” y “Para” en el diseño de los algoritmos.
• A crear los arreglos “vectores” y “matrices” para estructurar una gran cantidad de datos.
• A programar utilizando el paradigma de la programación orientada a objetos (PPO) en Python.

DESARROLLE SUS HABILIDADES PARA

• Efectuar funciones con las estructuras de datos de Python.
• Realizar operaciones sobre “matrices” en Python.
• Crear, escribir y almacenar datos en un archivo.

Link al libro Python Con Aplicaciones A Las Matemáticas, Ingeniería y Finanzas

Lee También: ¿Dónde Encontrar Los Mejores Datasets (Datos Abiertos) En Español?


7- Python 3 al Descubierto


Autor: Arturo Fernandez


Se ofrece un repaso a las principales características del lenguaje, así como otros aspectos relacionados, siempre desde un punto de vista práctico, con la intención de que el lector consiga rápidamente familiarizarse con el lenguaje.

Link al libro Python 3 al descubierto.



8- Python Práctico


Autores: Alfredo Moreno Muñoz, Sheila Córcoles Córcoles



Variables, programar, bucles, programación orientada a objetos, funciones, bases de datos, recursividad, booleanos, terminal, integración de aplicaciones, listas, Python, código fuente, excepciones, IDLE, tuplas, entorno de programación, pruebas, diccionarios, ficheros, pilas, colas, intérprete, comentarios de código? ¿Cuántos de estos términos te resultan familiares? En estos días la programación se está convirtiendo en algo cada vez más a la orden del día. 

Está demostrado que programar tiene una serie de beneficios y es por ello por lo que está siendo introducida como asignatura en los colegios. Esta es una buena guia para iniciar con Python

Link al libro Python Práctico


Conclusión


Como es de suponerse, libros en español sobre Python existen mucho menos en comparación a libros en Inglés. Sin embargo, los pocos que existen, nos sirven para introducirnos o incluso avanzar en el conocimiento del lenguaje, y la ventaja que tenemos es que entenderemos cada detalle de lo que explica el autor, ya que se encuentra escrito en nuestra lengua materna. Esperamos puedas darte este regalo a ti mismo para el 2021.

Si quieres aprender Python en español con un curso completo hasta Programación Orientada a Objetos, tenemos este curso para ti.

Read Here






2020-12-18

¿Dónde Encontrar Los Mejores Datasets (Datos Abiertos) En Español?

Como data scientists nos hemos acostumbrado a trabajar con datasets tales como el del Titanic, o el dataset del Iris, o de las casas de Boston en EEUU (Boston Houses). Como ven se trata de algo muy “gringo” y que muchas veces no se adapta a nuestra realidad. Esto, obviamente, no tiene nada de malo, casi siempre aprendemos data science y machine learning con material que nos llega desde los países anglo-parlantes, pero nuestro trabajo ahora es adaptar el conocimiento que tenemos a la realidad que nos rodea. 

Por otro lado, las empresas y los gobiernos cada día están adoptando políticas más abiertas hacia los datos, y últimamente se ha hablado mucho sobre los “Datos abiertos”. Citando textualmente Wikipedia, se trata de lo siguiente:

“El concepto datos abiertos (open data, en inglés) es una filosofía y práctica que persigue que determinados tipos de datos estén disponibles de forma libre para todo el mundo, sin restricciones de derechos de autor, de patentes o de otros mecanismos de control. Tiene una ética similar a otros movimientos y comunidades abiertos, como el software libre, el código abierto (open source, en inglés) y el acceso libre (open access, en inglés).”

Pues bien, me he dado a la tarea de buscar los sitios de datos abiertos de los gobiernos de los países de América Latina y esto es lo que me he encontrado. Todas estas políticas de datos abiertos se han dado debido a que la información y la transparencia beneficia a la democracia y la rendición de cuentas por parte de las autoridades. El objetivo de estas herramientas es que las personas tengan acceso a la información en base a la cual los gobiernos toman decisiones para las políticas públicas y que también, los científicos de datos (y cualquier persona), los puedan utilizar para sus propias investigaciones e intereses. Por ejemplo, para construir aplicaciones y conducir análisis.

Espero que puedas aprovecharlos, empezar a jugar con ellos, hacer ejemplos y proyectos con ellos, y quizás descubrir y demostrar hallazgos interesantes sobre ellos a tus colegas, amigos, jefes o incluso gobiernos!

Moraleja: ¡no dependamos siempre de los datos de países anglo!


1- México


Aquí podrás encontrar los datos abiertos del gobierno de México. Podrás encontrar datos por sectores, tales como cultura y turismo, desarrollo, economía, educación, energía y medio ambiente, finanzas y contrataciones, y muchos más. Esta plataforma permite conectarse también a datos en tiempo real por medio de APIs, y en otros tendrás la posibilidad de descargar los datos históricos y estáticos en diferentes formatos, tales como JSON, XML o CSV. 

Los casos de uso que pueden surgir a partir de estos datos son interminables, y puedes elegir entre las áreas que más te interesen. Por ejemplo, si lo tuyo es la economía puedes hacer todo tipo de análisis sobre indicadores de empleo, de pobreza, de productividad y más. 

Esta plataforma tiene 9.311 datasets. Algo muy interesante para los data scientists mexicanos, quienes pueden acceder gratuitamente a esta información y jugar con la data en la forma que lo deseen!

Aqui podrás encontrar el acceso: https://datos.gob.mx/


2- Colombia


Aquí podrás encontrar los datos abiertos del gobierno de Colombia. Según la misma plataforma se ha realizado esta iniciativa con el fin de “Investigar, desarrollar aplicaciones, crear visualización e historias con los datos”. Más de 1.175 entidades han publicado diferentes sets de datos. Esta plataforma tiene más de 10.000 datasets! 

La plataforma permite que la comunidad también cargue datasets siguiendo con unos lineamientos predeterminados, y tiene una subdivisión entre los datos creados de forma oficial, y los creados por la comunidad. Están divididos en diferentes categorías, pero los principales son: 


Aqui podrás encontrar el acceso: https://www.datos.gov.co/

Si has desarrollado algún producto, con base de alguno de estos datasets, también puedes compartirlo con el gobierno postulando tu uso de datos abiertos, o puedes visualizar lo que han postulado otros participantes

Esta plataforma actualmente tiene un proyecto piloto llamado Data Sandbox, el cual es un es un espacio colaborativo para las entidades públicas del país, en donde se podrán realizar diferentes proyectos piloto de Analítica y Big Data. Una muestra de que los gobiernos están apostando fuerte por este tipo de iniciativas. 

Para novedades y convocatorias puedes acceder también aquí. Si quieres sugerir nuevos datasets también lo puedes hacer.


3- Perú


Por el lado peruano también nos encontramos una plataforma de datos abiertos del gobierno nacional, una plataforma con 4.889 datasets y divididos por diferentes categorías tales como:
 

Como ven, un alto número de datos disponibles para experimentar y analizar en detalle. 

Aquí los diferentes ministerios se encargan de publicar los datos, en diferentes formatos, y hacerlos accesibles a todo público. 

La información del sitio está dividida por “Recursos”, “Dataset” y otros, dentro de los cuales los recursos son por lo general archivos .zip con diferentes formatos en los datos o análisis, mientras que los datasets son formatos Excel o CSV. Sin duda una fuente de datos inagotable para seguir aprendiendo y experimentando con data science!

Aquí podrás encontrar el acceso: https://www.datosabiertos.gob.pe/


4- Chile


En la plataforma de datos abiertos del gobierno de chile podrás encontrar conjuntos de información pública del gobierno de manera fácil. Para ello, en algunos casos, la información está publicada en más de un formato.

Este sitio contiene un buscador y catálogos con diversas categorías para ayudar la búsqueda de la información. También se puede encontrar información georeferenciada y archivos de imágenes. Algunos de estos datos ya están disponibles en diversos sitios de gobierno pero www.datos.gob.cl los reúne en un solo sitio web donde se pueden realizar búsquedas.


En esta plataforma podrás encontrar más de 4.000 datasets, 525 organizaciones que han subido datos, y podrás filtrar por 23 categorías diferentes. Dentro de las categorías con más datos están:


Casi todos los datasets se encuentran en formato XLSX y CSV lo cual es muy conveniente para nosotros como científicos de datos. 

Podrás acceder aqui: https://datos.gob.cl/


5- Argentina



Argentina también tiene su propia plataforma de datos abiertos donde se pone al alcance de los usuarios datos públicos en formatos abiertos para que nosotros podamos usarlos, modificarlos y compartirlos, y el objetivo como siempre es hacer visualizaciones, aplicaciones y herramientas con ellos.

El sitio tiene 998 datasets a la fecha, y 33 organizaciones han aportado al crecimiento de esta plataforma. También se puede acceder por diferentes categorías entre las más populares son: 


Algo interesante que he encontrado en este sitio es la posibilidad de conectarse también por medio de APIs y un repositorio en github en la cual hay diferentes paquetes y análisis que puedes investigar más a profundidad.

Dentro de las APIs puedes acceder a:

  • API georef: para normalizar unidades territoriales, provincias,departamentos, municipios y calles.
  • Series de Tiempo: permite consultar indicadores con evolución en el tiempo, de forma personalizada y actualizada.
  • CKAN: permite la organización de los datos publicados a través de su esquema de conjuntos de datos y recursos, así como el acceso programático a éstos, aplicando estándares aprobados internacionalmente para la generación de metadatos

Accede a la plataforma al siguiente enlace: https://datos.gob.ar/


Por Ciudades


Debido al auge de los datos abiertos en los gobiernos a nivel mundial, y a que muchas veces los datos cobran mayor relevancia según la ciudad, las diferentes alcaldías de las diferentes ciudades más importantes de cada país se han dado a la tarea de crear open data para ellos. 

Así es como tenemos las siguientes menciones especiales a sitios de datos abiertos por ciudad. Si no encuentras tu ciudad aquí, puedes hacer una busqueda en Google, quizás también tengan en tu ciudad!



Conclusion


Ya no hay excusa para seguir trabajando siempre con los datos del Titanic o con las casas de Boston, tienes un montón de datos en tu país o incluso en tu ciudad de origen donde puedes practicar tus habilidades de Python y data science, e incluso de visualizaciones de datos. Puedes incluso atreverte a construir aplicaciones más elaboradas, o hasta presentarlas a los gobernantes. 

Una buena estrategia a seguir es construir algún proyecto de data science, montarlo en la web y luego enviarlo a quienes publicaron los datasets en alguna de estas plataformas gubernamentales. Es una buena forma de mostrar tus habilidades y de crear un portafolio. 

Si aún no conoces en detalle cómo hacer análisis y visualizaciones de datos te recomendamos este curso.

¡Nos leemos en una próxima!

Read Here






2020-12-18

13 Ideas de Proyectos con Python Para Web, Escritorio, Línea de Comandos y Data Science

Esta es una lista de ideas y proyectos interesantes que puedes construir usando Python.

Python es uno de los lenguajes de programación más utilizados en el mundo, y eso puede contribuir a su naturaleza de propósito general, lo que lo convierte en un candidato adecuado para varios dominios de la industria. 

Con Python, se pueden desarrollar programas no sólo para la web, sino también para escritorio, línea de comandos y data science. Python puede ser adecuado para programadores de diferentes niveles de habilidad, desde estudiantes hasta desarrolladores intermedios, pasando por expertos y profesionales. Pero cada lenguaje de programación requiere un aprendizaje constante, y es el mismo caso con Python.

Si realmente quieres obtener un conocimiento práctico profundo, no hay mejor manera de hacerlo con Python que emprender algunos proyectos geniales que no sólo te mantengan ocupado en tu tiempo libre sino que también te enseñan cómo sacar más provecho de Python, pero sobre todo, que te motiven y te den un aliento extra para seguir aprendiendo y viendo el resultado reflejado en un proyecto cool!

¿Sabías que?
Según Stackoverflow, Python es uno de los lenguajes más queridos y deseados por los desarrolladores?. Mira aquí esta encuesta del presente año

Antes de comenzar:  Debes elegir en qué plataforma quieres desarrollar el proyecto (o proyectos)

Python puede ser un lenguaje de programación muy versátil en las manos adecuadas, y puedes construir muchos programas ingeniosos con él y así fortalecer tu dominio del lenguaje. Es de suma importancia tener más exposición al conocimiento práctico que al teórico, especialmente cuando se trata de aprender lenguajes de programación, como Python.

Pero antes de sumergirnos en los divertidos proyectos que tenemos guardados para ti, debes decidir en qué plataforma vas a trabajar. Las plataformas de los proyectos mencionados en este artículo pueden clasificarse en cuatro categorías que se enumeran a continuación:

  • Web. Construir una aplicación web le permite a usted y a todos los demás acceder a ella desde cualquier lugar a través de Internet. Para eso, necesitarías trabajar en el front-end, la parte visual, y el back-end de la aplicación, donde se implementa la lógica de negocios. Herramientas y frameworks como Django, Flask y Web2Py son algunas de las muchas opciones que puedes usar para esto.
  • GUI de Escritorio. Las aplicaciones de escritorio también se utilizan muy comúnmente y atienden a una parte considerable de los usuarios. Cuando se trata de construir aplicaciones de escritorio, Python hace que sea muy fácil desarrollar una usando su paquete PySimpleGUI, que permite construir todos los elementos necesarios usando Python. El framework PyQt5 también ofrece elementos de construcción GUI avanzados pero tiene una curva de aprendizaje más pronunciada.
  • Línea de comandos. Los programas de línea de comando funcionan sólo en ventanas de consola y no tienen ningún tipo de interfaz gráfica de usuario. La interacción con el usuario tiene lugar a través de comandos y es el método más antiguo de interactuar con los programas, pero no confundas su falta de GUI con su falta de utilidad. Cientos de empresas importantes dependen de programas de línea de comandos para realizar sus actividades comerciales diarias. Para construir programas de línea de comandos, se pueden utilizar herramientas como docopt, Python Fire, plac y cliff.
  • Análisis de Datos. Como una rama emergente y en ahora en constante evolución, el análisis de datos ha hecho que python continúe su crecimiento acelerado y su adopción masiva por parte de los científicos de datos, y de los mismos desarrolladores que deciden cambiar su enfoque de carrera. Como en los casos anteriores, Python hace muy fácil acercarse a este conocimiento, y por medio de librerías como Pandas, Numpy, Matplotlib y Scikit-learn hace que sea el lenguaje preferido por los data scientists.


Ideas de proyectos en Python

Si ya te has decidido por la plataforma que vas a usar, vayamos directamente a los proyectos. A continuación se mencionan algunos proyectos divertidos dirigidos a desarrolladores de todos los niveles de habilidad que jugarán un papel crucial para llevar sus habilidades y confianza con Python al siguiente nivel.

1. Prueba de velocidad de “tipeo” en Python - 

Plataforma: GUI de Escritorio

¿Has jugado a un juego de velocidad de escritura en el teclado? Es un juego muy útil para medir tu velocidad de tipeo en el teclado y mejorarla con la práctica regular. Puedes construir tu propio juego de velocidad de tipeo en Python con sólo seguir unos pasos.

Aquí podrás usar la librería pygame para trabajar con gráficos e incluso con sonido. En la pantalla de tu computador aparecerá el input de tu teclado y al final te dirá el tiempo que demoraste escribiendo una frase.

Puedes seguir este tutorial gratuito para lograr el objetivo.


2. Detección de fraude con tarjetas de crédito

Plataforma: Data Science


Los fraudes con tarjetas de crédito son más comunes de lo que crees, y últimamente, han estado en el punto más alto. Hablando en sentido figurado, estamos en camino de cruzarnos con mil millones de usuarios de tarjetas de crédito para finales de 2022. Pero gracias a las innovaciones en tecnologías como la Inteligencia Artificial, el Aprendizaje Automático y la Ciencia de Datos, las compañías de tarjetas de crédito han sido capaces de identificar e interceptar con éxito estos fraudes con suficiente precisión.

En pocas palabras, la idea detrás de esto es analizar el comportamiento de gasto habitual del cliente, incluyendo el mapeo de la ubicación de esos gastos para identificar las transacciones fraudulentas de las no fraudulentas. Para este proyecto, se puede usar R o Python con el historial de transacciones del cliente como conjunto de datos e ingerirlo en árboles de decisión, redes neuronales artificiales y regresión logística. A medida que alimentas más datos a tu sistema, deberías ser capaz de aumentar su precisión general.

Conjunto de datos: Los datos sobre la transacción de las tarjetas de crédito se utilizan aquí como un conjunto de datos.
Código fuente: Detección de fraude de tarjetas de crédito usando Python


3. Acortador de URL

Plataforma: Web



Los URLs son la fuente principal de navegación hacia cualquier recurso en Internet, ya sea una página web o un archivo, y, a veces, algunos de estos URLs pueden ser bastante grandes con caracteres extraños. Los acortadores de URL juegan un papel importante en la reducción de los caracteres de estos URL y en hacerlos más fáciles de recordar y trabajar con ellos.

La idea de hacer un acortador de URL es usar los módulos Random y String para generar un nuevo URL corto a partir del URL largo introducido. Una vez hecho esto, necesitarías mapear las URLs largas y cortas y almacenarlas en una base de datos para permitir a los usuarios usarlas en el futuro.

Ejemplos de acortador de URL -


Aquí podrás acceder a un tutorial gratuito de como hacerlo usando Django


4. Generador de árbol de directorios

Plataforma: Línea de Comandos

Un generador de árbol de directorios es una herramienta que se utilizará en condiciones en las que se desea visualizar todos los directorios de su sistema e identificar la relación entre ellos. Lo que un árbol de directorios indica esencialmente es qué directorio es el directorio padre y cuáles son sus subdirectorios. 

Una herramienta como esta sería útil si trabajas con muchos directorios, y quieres analizar su posicionamiento. Para ello, puedes usar la biblioteca llamada os para listar los archivos y directorios junto con el framework de docopt.

Ejemplos de Generadores de Árbol de Directorios -


5. Agregador de contenidos

Plataforma: Web

Internet es una fuente de información de primer orden para millones de personas que siempre están buscando algo en línea. Para aquellos que buscan información masiva sobre un tema específico pueden ahorrar tiempo usando un agregador de contenido.

Un agregador de contenido es una herramienta que reúne y proporciona información sobre un tema de un gran número de sitios web en un solo lugar. Para crear uno, se puede utilizar la ayuda de la librería requests para manejar las solicitudes HTTP y de BeautifulSoup para analizar y hacer scraping de la información necesaria, junto con una base de datos para guardar la información recopilada.

Ejemplos de agregadores de contenido:

Popurls
AllTop
Theweblist
Hvper


6. Reproductor de MP3

Plataforma: GUI de Escritorio


Si te gusta escuchar música, te sorprendería saber que puedes construir un reproductor de música con Python. Puedes construir un reproductor de mp3 con interfaz gráfica con un conjunto básico de controles de reproducción, e incluso mostrar información como el artista, la duración, el nombre del álbum y más.

También puedes tener la opción de navegar por las carpetas y buscar archivos mp3 para tu reproductor de música. Para facilitar el trabajo con archivos multimedia en Python, puedes usar las bibliotecas simpleaudio, pymedia y pygame.

Aquí hay un tutorial que puedes seguir para construirlo

Ejemplos de reproductores de MP3
MusicBee
Foobar2000


7. Herramienta de renombramiento de archivos

Plataforma: Línea de Comandos


Si su trabajo requiere que administre un gran número de archivos con frecuencia, el uso de una herramienta de renombramiento de archivos puede ahorrarle una gran parte de su tiempo. Lo que hace esencialmente es que renombra cientos de archivos usando un identificador inicial definido, que podría ser definido en el código o pedido al usuario.

Para que esto suceda, podrías usar las bibliotecas como sys, shutil y os en Python para renombrar los archivos instantáneamente. Para implementar la opción de agregar un identificador inicial personalizado a los archivos, puede utilizar la biblioteca regex para que coincida con los patrones de nombre de los archivos.

Ejemplos de herramientas de renombramiento masivo de archivos -
Ren
Rename


8. Una app para hacer Quizzes

Plataforma: Web

Otro proyecto popular y divertido que se puede construir usando Python es una aplicación de quizzes. Un ejemplo popular de esto es Kahoot, que es famoso por hacer del aprendizaje una actividad divertida entre los estudiantes. La aplicación presenta una serie de preguntas con múltiples opciones y pide al usuario que seleccione una opción y más tarde, la aplicación revela las opciones correctas.

Como desarrollador, también puede crear la funcionalidad de añadir cualquier pregunta deseada con las respuestas que se utilizarán en el test. Para hacer una aplicación de quizzes, necesitarías usar una base de datos para almacenar todas las preguntas, opciones, las respuestas correctas y las puntuaciones del usuario.

Ejemplos de aplicaciones del Quiz
Kahoot
myQuiz


9. Tic Tac Toe

Plataforma: Web, GUI o Línea de Comandos

Tic Tac Toe es un juego clásico que estamos seguros que cada uno de ustedes conoce. Es un juego simple y divertido y requiere sólo dos jugadores. El objetivo es crear una línea horizontal, vertical o diagonal ininterrumpida de tres X u O en una cuadrícula de 3x3, y quien lo haga primero será el ganador del juego. 

Un proyecto como este puede utilizar la biblioteca de Python pygame, que viene con todos los gráficos y el audio necesarios para empezar a construir algo así.

Aquí hay algunos tutoriales que puedes probar:

Más proyectos de divertidos de Python para el desarrollo de juegos:


10. Construir un asistente virtual

Plataforma: GUI de Escritorio


Casi todos los teléfonos inteligentes de hoy en día vienen con su propia variante de un asistente inteligente que recibe órdenes de usted ya sea por voz o por texto y gestiona sus llamadas, notas, reservas de un taxi y mucho más. 

Algunos ejemplos de esto son Google Assistant, Alexa, Cortana y Siri. Si te preguntas qué se necesita para hacer algo así, puedes usar paquetes como pyaudio, SpeechRecognition y gTTS. El objetivo aquí es grabar el audio, convertir el audio en texto, procesar el comando, y hacer que el programa actúe de acuerdo con el comando.

Aquí podrás encontrar un tutorial gratuito que puedes seguir.



11. Calculadora

Plataforma: GUI de Escritorio

Por supuesto, nadie debería perderse la vieja idea de desarrollar una calculadora mientras se aprende un nuevo lenguaje de programación, aunque sea sólo por diversión. Estamos seguros de que todos saben lo que es una calculadora, y si ya lo has intentado, puedes intentar mejorarla con una mejor interfaz gráfica que la acerque a las versiones modernas que vienen con los sistemas operativos de hoy en día. Para hacer que eso suceda, puedes usar el paquete tkinter para añadir elementos GUI a tu proyecto.


12. Detección de somnolencia del conductor

Plataforma: Data Science


Los accidentes de tráfico cobran muchas vidas cada año, y una de las causas de los accidentes de tráfico son los conductores somnolientos. Siendo una causa potencial de peligro en la carretera, una de las mejores maneras de prevenirlo es implementar un sistema de detección de somnolencia.

Un sistema de detección de la somnolencia de los conductores como éste es otro proyecto que tiene el potencial de salvar muchas vidas al evaluar constantemente los ojos del conductor y alertar con alarmas en caso de que el sistema detecte el cierre frecuente de los ojos.

Para este proyecto es imprescindible una cámara web que permita al sistema vigilar periódicamente los ojos del conductor. Para que esto suceda, este proyecto Python requerirá un modelo de aprendizaje profundo y bibliotecas como OpenCV, TensorFlow, Pygame y Keras.

Código fuente: Sistema de detección de somnolencia del conductor con OpenCV y Keras



13. Conversor de divisas

Plataforma: GUI de Escritorio


Como su nombre indica, este proyecto incluye la construcción de un convertidor de moneda que permite introducir el valor deseado en la moneda base y devuelve el valor convertido en la moneda de destino. Una buena práctica es codificar la capacidad de obtener tasas de conversión actualizadas de Internet para obtener conversiones más precisas. Para esto también, puedes usar el paquete tkinter para construir la interfaz gráfica de usuario.


Otras ideas de proyectos


Otras ideas
  • Comprobador de plagios en Python
  • Explorador de archivos Python
  • Despertador
  • Una aplicación de alerta de precios en tiempo real
  • Comprobador de conectividad del sitio
  • Herramienta de consulta Regex
  • Rastreador de gastos
  • Descargador de vídeos de Youtube

Conclusión

Concluyendo con nuestra lista de ideas y proyectos interesantes que se pueden construir usando Python, podemos decir que Python puede ser un lenguaje de programación muy útil para desarrollar aplicaciones de todo tipo y escala.

Además, los paquetes proporcionados por Python ofrecen un inmenso valor a los desarrolladores al simplificar en gran medida el proceso de desarrollo. Para terminar, nos gustaría decir que el potencial con Python es ilimitado, y lo único que podría faltarle podría ser la idea correcta.

Si tienes más sugerencias o ideas, me encantaría escucharlas.

Si quieres tomar un curso de python, te recomendamos el siguiente: Curso de Python Intermedio.

Nos leemos en una próxima!

Read Here






2020-09-24

Current 5 projects

Current 5 projects and stages

  1. 1- DataSource.ai - Data Science competitions for startups. Is my main project. However is difficult to get startups to sponsor competitions. I'll need a partner who joins in sales/business development. I'm not so good with this task
  2. 2- datajobs.dev - Fully remote data science jobs. Is a niche job board
  3. 3- datacademy.dev - I have more than 3 years of experience as mentor and teacher in a coding bootcamp called makeitreal.camp with 4 full-stack ruby-on-rails bootcamps under my belt and 2 data science bootcamps. I decide to start a platform to share intermediate and advanced data science and machine learning short-courses and screencasts in spanish
  4. 4- dataenpoint.co - Machine Learning API Marketplace. Here I wanna to have an API marketplace where developers can earn money with their machine learning endpoints. Nos launched yet. Not functional yet.
  5. 5- narrativetext.co - (Fully-functional now) - Is a place to share data science notebooks as newsletters and data scientists can earn money with subscriptions. Read data science technical code is better in notebooks, not in plain text through Medium, for instance. 

Read Here






2020-09-24

Working on 5 different projects simultaneously

Currently I'm working on different projects, at the same time, and this kind of approach has some pros and cons

Pros
  1. * I keep motivation high
  2. * Change projects per day keeps me in a good mood
  3. * I learn from different projects
  4. * I can test different ideas and make a good protfolio
  5. * I'll never know before the execution wich one will'be a winner
  6. * Is like to have a micro-vc when I invest time in better projects
  7. * I can test all ideas that I had before, but I was aware of this strategy
  8. * I need to be really good with time management and distractions
  9. * When I'll have MVPs ready, and start monetizing I'll get different income sources
  10. * I need to automate lot of tasks when MVP will be ready

Cons
* Some tasks keeps pendings
* Its necessary to repeat some code in different projects, which is a bad thing
* I have limited time, so I can't have done all things I want in a short-term period
* If one idea really kicks-off, probably I'll need to shut-down some of the others. This is a good thing at the same time
* Lower time in marketing. I need to put more time on this task
* Burn-out feeling sometimes. 
* In the short-term, I have less time to learn new things and for study ML


Read Here






2020-09-10

Having datacademy.dev Ready

I have almost ready datacademy.dev. It takes me just like 2 days to have all the code neccesary to host courses and screencasts related to datascience. However this time the intent is to have a dedicated platform just for education in spanish. This is because is my main skill right now and I'll need to explore this way of living, as a teacher!

Truth be said, in spanish we haven't too much material to learn, and the best tutorials is obviouly made in English. My approach is to have high quality and intermediate/advanced data scienece courses and screencasts. The idea with screencast is to have at least one per week. With courses the idea is to have 1 each month/ or 2 months with that last between 10-15 hours each one. With diverse and correlated topics, all of them in data scienece and with a price range of $19.99 to $49.99.

Last and most relevant goal here is to gain more knowledge every time, because you know: "Teaching we learn the double". 

I have 2.5 years of experience as mentor/teacher, so I already know lots of tips and tricks to teach effectivly. Once I reach $500 usd monthy, I'll upgrade my setup: micro and camera.

Read Here






2020-09-01

Launching today on ProductHunt again. But this time with DataSource.ai

Seems like I made a big mistake with this launch, because I launched yeasterday at nigth, especifically 10pm GMT-7, thinking about to have the Asia dn Europe trractio, but seems like the real hour for ProductHunt is 00:00 PT. So, big mistake.

I wake up today with just 2 upvotes and as the last time, I'm thinking that this is a failure. My team and I promoted the launching and now we got 47 upvotes, but the bad news is that this is all... I decided to stop self-promoting, because the general idea is to be hunted by producthunt users, and seems like anybody note us. So for now I'll be following a list of other sites where I can post.

Read Here






2020-09-01

Learn how you can be the #1 Product of the Day on ProductHunt

This aritcule was originally postred at Sept 16 - 2019, when I launched Bookcademy...


Hi All!

I want to share with you the lessons learned on the process to be the #1 Product of the Day on ProductHunt

I posted my startup last Friday 13th 2019 and I got 5 upvotes! and I thought that the launch was a big failure, and I had a long talking with my partner and she was more inclined to keep working and see what other paths we could explore. However, I was expecting short-term results from PH given the number of stories about success on launching there. It was a Friday the 13th for me, I seriously thought to quit to the effort to launch, but I finally got the conclusion that we needed more time and users to figure out if we were really on track. I downgrade my servers (balancers and other stuff) because for me that didn’t work!

As always, I tried to solve this thing with a book! Trying to find something in books that helps me to solve this concern. I found “Traction: How Any Startup Can Achieve Explosive Customer Growth Hardcover” by Gabriel Weinberg and Justin Mares. However, I didn’t advance too much on the lecture but I found inspiration to keep working and go forward!

Sunday morning (2 am CT) (2 days later!) and I started to receive lots of emails (from my alert system in the platform) that noticed me about a lot new user signups and errors with my server and with my transactional email provider. But I was sleeping! At 7 am I watch my emails and see such crazy stuff and I thought that I was hacked or something like that, but why if I had nothing yet? just a few users, and without payments system yet!. I was so curious, and I don’t used to touch my laptop or desktop Sundays, but this time was different.

When I was able to turn on my laptop I could saw an amount of traffic on my site and the referrer was: ProductHunt!

Totally unexpected, and I started immediately working and trying to solve the bugs in the platform and answering questions about the bugs! Once we have that ready, everything goes smooth and users love the product!

The key points here were:
  • You just need a good product with a clear value proposition
  • Posting probably doesn’t work the same day. Probably you need to wait a while
  • Be prepared, because once your post is live, servers need to run like a charm
  • Have good servers for at least 1–2 weeks or more
  • If you have a community behind that’s great, but if not (we don’t have one), people will discover your product on that platform (that’s the game)
  • Don’t launch Fridays, the best day for launch is from Monday to Wednesday (given that you probably need to wait to see votes)
  • Best hour to launch is at 6 am GMT

Here is the result: https://www.producthunt.com/posts/bookcademy-2

Regards!

DanielM

Bookcademy.com

Read Here






2020-08-23

Last week, this week

Last week I was trying to follow the shotgun approach, inspired by Alex West and I made a list of nre possible ideas to approach. Whit this list in mind, I started to create dataendpoint.co, and I advanced a lot on the design and creating on that platform. However this brianstorm help me aout to arrive to another solution within datasource.ai who allows me to pivot in the idea. 

The idea basically was to not approach companies in Latam and intead of that approach starups around the world and specially from USA. Also the other valuable things that I discovered about this pivot was the fact that I can play with almos each part of the project like.
- I can change de pricing plans
- I can change de target companies
- I can change the marketing copy
- I can change the geographically approach
- I can change the details of the competitions, like timing, prizes, allocation prizes and other rules

This king of changes allow to me to be uniq, and no to copy anyoneelse, and to have a unique value proposition. In this time focused on startups and to democratize the data science competicions. If you can see I started with a copy cat, but now I'm having a unique value proposition with the right model to approach

Thanks to this brainstorm I'm almos finishing the changes in the datasource platform to re-launch with this new approach. 

This week the idea is to start with startup school new program, focused on a 1 month sprint, trying to iterate , launch and even monetizaze during the sprint. I'll be dpijg that this week, so I hope to get my first approach ready and hoping to have a goos acceptance and first competitions rolling

Hopes!

Read Here






2020-08-17

All weekend programming

I'm building dataendpoint and the idea is to have a marketplace of machine learning APIs. The advance is a lot for this weekend, however there are things that I think that I need to do first. For instance, I need to create each API, with the correct documentation and correcta Jupyter Notebooks. Once I have this, I can show people what the app can make. 

This need probably a little more time than I suspected, but I need to advance all I can

Read Here






2020-08-16

Sunday Working

Today I'm creating DataEndpoint project, with the idea to have an API Marketplace, when I can ipluad my personal projects and also machine learnings engineers and data scientists can upload their APIS and earn money with subscriptions for requests for the API. This API Marketplace is focused exclusivly on Machine learning projects.

The idea is to have the MVP ready this month, this is the second sprint this month

Read Here






2020-08-15

Habits that I need to keep or to develop

List of habits
  • - Write a daily learning on this blog
  • - Keep focus on the monetization plan and projects
  • - Keep healthy weekly schedules, and not working Sundays

Other good habits
  • - Exercise
  • - Reading
  • - Study English
  • - Deep dive on Machine Learning and Deep Learning
  • - Painting
  • - Playing guitar

Read Here






2020-08-15

Monetization schedule (2 months from idea to monetization)

This monetization schedule is to create products and test them in a fast peace. 

First month (Review, Build and Growth Hacking Strategy)

Week one (Review phase)
Review lists of ideas, make some research, try to find competitors, try to find best spots to get first customers, build a simple Business Model Canvas

Week two (Build MVP phase I)
Design and create database, select template, select tech stack, buy domain, start coding

Week three (Build MVP phase II)
Create all necessary code and build the payments system

Week four (Growth hacking strategy)
Select channels to launch, create strategy to approach market

Second month (Monetization)
Launch, feedback, learn, monetize, achieve the goal of $500 USD, MRR

If everything works and monetizes, put laser focus on creating value and raise the amount of MRR to achieve the next MRR goal. If I don’t achieve the goal, start over again. Rest a while (1 week max) think about the process, write about the learnings and go again for the next idea on the list. 

Read Here






2020-08-15

New mindset to approach daily activities, habits, routines and projects

The first thing that I need now is to have the correct mindset about projects, habits, routines and activities. One of them is for instance asking me if this activity is monetizable. This could happen with screencasts, for instance. I need to finish the Python screencast, but if I ask myself if this activity ends up with a monetizing strategy the answer is no. For that I couldn't do this activity anymore, if this is not my main monetization activity. 

So, I’ll need to ask me always this questions
  1. 1. Does this activity make me earn money almost immediately?
  2. 2. Is this activity inside my monetization schedule?
  3. 3. Laser focus on trying to monetize just this products, other things are a waste of time.

Read Here






2020-08-15

One month sprint sessions

The other thing that we need to make is one month sprint sessions, launching and trying to monetize at least 1 product per month. As we talk previously I’ll need to create this kind of testing products, launching them on the sites mentioned, and try to monetize immediately. 

Following this post: https://www.danielmorales.co/posts/7

The key here is to try to do that in just one month!!


Read Here






2020-08-15

Should an entrepreneur learn to code? — YES!

Yes! specially if that entrepreneur will want to enter within digital space. This is not an option is a must-have skill.

Why is it important?

Probably in your first days as an entrepreneur you don’t have any friends, colleagues or people that have developer skills. Probably you can haven’t a lot of money to hire a good engineer. Probably you can haven’t any reputation (aka past success) to attract engineers to your new adventure. Probably you can’t convince or sell your great idea to any developer… for all of that you will don’t have one of the most important task in your early days: the product.

It’s for this the only way to execute your big plans is to learn to code. You will not need be a senior developer, you’ll only need the sufficient skills to make a good MVP.

A lot of entrepreneurs get stuck here, ’cause they don’t know how to solve this bottleneck. If the escenario is similar to the above the only way to solve it’s with your own hands. If you can make a good MVP most forward you will may hire or attract your first believer in a shape of developer and you’d could learned a lot of stuff in middle of the trip.

How much time will I need invest to learn to code?

If you don’t have any engineer background it’s probably that you will need to invest 6 month (full time) as at least. I know, I know it’s a lot of time. Probably you are saying in this moment: “if I don’t launch my startup this month it’s probably that my competence will take a huge advantage” or “I don’t have free full time for that”. Well… you don’t have any other way. I can promise you that this invest will be the most important invest on your entrepreneur career, why?:

1- The most important thing is your learning mindset: if you invest this energy and time in learning you will get a type of “learning mindset”. This learning mindset you will apply in so much things more forward.

2- With this new knowledge (programming) you will can to know: what type of engineers my startup will need? on what type of language? what type of details will I need for achieve my goals? interview questions or challenges and correct answers, what are your engineers doing? in other words, you can see the other world, the world of geeks! With that you will have a lot of power.

3- You will can to know: what is it the new and most advantage technologies? how to apply this new technologies in my startup? and might do that before your competence do. Probably you know what is Machine Learning, but do you know how Python, Pandas or Scikit-learn do some things or in which cases you would should use it? Again, you don’t need know in detail this skills but you will need know how the basics works.

What is the path of my new learning adventure?

1- The first step is select the right path for learning. If you will build a web app you will need HTML, CSS and Javascript. Again you will don’t an expert or senior developer. You only need an MVP. In one month or less (full time learning) you will get it

2- Algorithms: you need this skill. It’s more easy than you think, believe me. I don’t have any engineer background but I’d had the effort for learn it and believe me, it’s more easy than you think: you need know: strings, types, loops, methods, objects and a few things more. With this basis you will have done a lot to move into a specific language

3- Specific language: it depends of your goal. For me, in this moment the best 2 language for learn are: Javascript and Python. Whit this 2 languages you will can go into a lot of technologies such as: Artificial Intelligence (Python), Machine Learning (Python), Deep Learning (Python), API’s (Python and Javascript), Frontend- React.js (Javascript), IoT (Javascript), AR/VR (Javascript), Blockchain (Javascript), Alexa (Javascript), Mobile Apps (Javascript), Web App (Python).

4- Specific technology: above you could see the most important specific technologies in this times. It depends of you focus, but with those 2 languages you will can embrace all this technologies.

Once you have these skills you are in another level. You have a lot of power. You will can identify the new technologies, you might apply it and you will have the best of the two worlds: entrepreneur mindset and developer mindset.

Original post was created ehere: https://medium.com/@danielmorales/would-should-an-entrepreneur-learn-to-code-yes-c9782fc89dc7

Read Here






2020-08-15

How can I sell the vision to a team without? — 4 simple steps

Look at stars… you can probably did or do, but maybe your team not. You can probably see or will see the future, the way or the next step but your partners and employees not.

That would be the beginning of a big trouble ’cause you need support points not obstacle points.

Maybe your team hadn’t read, listened or viewed any success story or success people. They are living in a “scarcity life” (or maybe scarcity mindset) and they can’t believe that things will happen for them. The problem here is: one person will can not change their mindset overnight. That takes time, a lot of time. Believe in that you will achieve the success is very hard and require a lot of time, books, movies, audios, real life examples and more.

Have a great vision it depends of you can have a successful mindset. Typically have a vision means: “You can see, imagine and visualize things will happen in the future. Things that another people can’t see. Sometimes, crazy and out of the box things”

Have a vision for me is an skill more than a one esoteric thing, ’cause you can practice and do it part of your life. Your can write about, share it and make it happen.

Now, how can I sell this vision to my team?


First of all, sell is the key word here ’cause you will need this skill for share your vision. Sell means: you will should “infect” your team with your vision. For example:

In Colombia, when I was launch my last startup (and where I live for now) the people mindset (how you might would imagine) are from “scarcity”. The scarcity environment is contagious ’cause your family, friends, colleagues basically everyone live in a “scarcity mindset”. This is the reality. So when you approach this people with and “abundance mindset” this type of person will think that your are basically crazy and out of the reality. With that you will might imagine how hard is sell the big price, picture and dream here. The least tag for you will be: dreamer, scammer or …. another big bullshits.

However I did decided make it possible. My first job was tell the story:

1- What is a Venture Capital Firm? in that point I needed explain to my team and key employees How it works?, Whats is their target?, When their money comes from?, What will they expect of us? How they works in another countries? Why did decided they will support us? so… all the things for that they can to undertand what really are happening and they will can feel part of that

2- Once they’ve clear about that the next job is will explain the key targets such as: expansion, fast growth, next rounds, possible exits and more.

3- Once the key employees know the actual startup situation you will can have will commit them e.g shares, not high salaries

4- Vesting Form. Once you had share your vision and explain them how it works you need put their commit on a Vesting Form. This is the best way to retain your best talent and protect your startup. In that point you have the target, you sold the vision.

Note: you don’t need explain every point to every team member. You need sell your vision only to key employees.

This post was originally created here: https://medium.com/@danielmorales/how-can-i-sell-the-vision-to-a-team-without-4-simple-steps-cc4bee71c325

Read Here






2020-08-15

Now, how do I execute a seed round after it’s raised?

This post was originally posted here: https://medium.com/@danielmorales/now-how-do-i-execute-a-seed-round-after-its-raised-8b91198099f3


4 years ago, my small team and I, we raised a USD $350.000 seed round by a Colombian Venture Capital firm called Velum Ventures. We were grew up to the 10% every month and we sold about USD $150.000 annually. Our startup was called FloreyMas.com (flowers and gifts delivery) and our value proposition was:

1- An amazing customer service

2- Fast digital experience: UX/UI, payment and delivery.

3- Mid-term drop-shipping: Mid-term because we managed the production of the flower arrangements under our own hands (because we wanted quality guaranties in this point), but we didn’t have any farm, warehouse, vehicles for deliveries and the most importan thing: any inventories. We had a key partnerships and our physical locations was “in front of” our big and best suppliers (local spaces designed for wholesale flowers). This was the key because our suppliers took the inventories risk (the flowers has a low life expectative) and we only bought fresh flowers if we had a paid customer and we only bought in the “delivery moment”. Our suppliers were located in downtown places and this means that the rental prices was so low and also we could get more office space (as a plus).

With those proposition values (especially with the point 3) and with our fast growing (’cause we could sell in lowest prices — 20% less aproximally), we was ready for pitch our startup.

After 12 months (since we had the first approach with the Fund), we got it! we had an investment! We had one of the 3 first investment of Velum Ventures and we were so excited! Very few number of startups (in Colombia or LATAM) get an invest and least got our capital raised (USD $350.000 in total after a few follow-On).

But all of us we know that raise money is only the first (or really second) step in this difficult startup world. The key question is: Now, how do I execute a seed round after it’s raised?

The first thing that I would say is: always we will get a lot of mistakes. Always. I was the CEO and founder, but after of I had the invest I was working in everything whiting the startup, especially in engineering. For me was so much difficult be the leader, manage people and did all of kind the stuff it suppose that a CEO make.

However I was decided to that happen and I did it (well… only a few things). I did a lot of mistakes and eventually I will write about that. In this particular case and taking my experience I will to talk about my first and big mistakes (I’m the only guilty about this mistakes):

1- We didn’t find a “really winner business model” before we had raised the money

2- We had focus on fast growing (other product lines, more employees and more cities or countries) and spend a lot of money on ineffective paid ads. We hadn’t focus on the recurring customers and organic purchases growth, word of mouth and business model and strategies quality.

3- As a CEO I losed my focus on small things and day-to-day tasks.

4- Bad hirings, I didn’t know of the Colombian labor Acts.

5- Bad partners, bad relationships, bad communication (All of that was my fault, only my fault), ’cause I was a bad leader and a bad comunicator. I’m getting better.

6- I hadn’t a life balance and I hadn’t a life partner woman (aka serious girlfriend or wife :)) I’m getting better :)

About this 6 points I will write more foreward.
Now, how do I execute a seed round after it’s raised?

1- You will need an execution plan, and this plan it depends of your particular situation. This is all about assets allocation: put the money in the correct moment, amount and department. For example: in our situation we needed , first of all, to be professionals. We needed a lawyer and an accountant. Lawyer because we needed employees with all the benefits and law requirements (our employees before we raised the money had been as a freelancers, all of them). And we needed an accountant because the numbers needed to be in order.

2- One of the most important thing on you will need invest money is on a good persons as an Administrative Help. Your day-to-day tasks are so complicated, so much and so busiest but necessary. For that if you actually don’t have this type of person, you will need it

3- Then of these points you will need invest in the most important thing: your growth. Here the question is: from where do I get our most important and big incomes? we have different answers for that question: paid ads (the worst answer), word of mouth (the best answer), growth hacking (the super best answer). Which one is your’s? Put the money there.

4- Team is the next big one. You probably will want to growth the team, but… you will need explore your key activities and hire really good people. This point is one of the most important, more later I will write about.

Here you can finish the first part of your assets allocation. Please you will to be sure in what points you will need invest: moment, amount and department

Best Regards

Read Here