NumPy
What is NumPy
NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.
NumPy is used for scientific computations in Python. These computations are very useful while doing machine learning computations. It can be used to write a machine-learning and deep learning algorithms from the scratch. Out of the libraries we are going to cover in this blog, Numpy is the most important one. So you should have a good understanding of Numpy.By using NumPy, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use NumPy under the hood.
Python lists have limitations to it. Like we cannot perform algebric computations using list. Thus Numpy has arrays, which cover all the limitations of lists.NumPy arrays are more like C arrays than Python lists.
import numpy as np
After this import statement, we will always use pd for accessing the in-built pandas functions and methods.
Creating a numpy array.
s=np.array([1,2,3,4])
s
array([1, 2, 3, 4])
Some basic computations
s+3
array([4, 5, 6, 7])
So we add 3 to all the elements. It is just simple scalar addition. If we do this operation on lists, then it won’t add 3 to the numbers, instead it will add append 3 to the list.
[1,2,3,4]+[3] #example
[1, 2, 3, 4, 3]
Subtraction
s-3 #we can do similar scalar computations
array([-2, -1, 0, 1])
Checking the type and shape
print(type(s))
print(s.shape)
print(s.dtype)
<class 'numpy.ndarray'>
(2, 4)
int64
The output clearly shows that the type of s is of type ndarray.
The shape of the array is (4,). This means that it is a one-dimensional array, with 4 columns and 1 rows.
The datatype of the array elements is of type int.
s=np.array([[1,2,3,4],[7,8,9,10]])
s.shape
(2, 4)
The shape now here is (2,4) which means 2 rows and 4 columns. Also for creating a multidimensional array, we pass multiple lists as done above.
Converting to a different data type
s.astype('float')
array([[ 1., 2., 3., 4.],
[ 7., 8., 9., 10.]])
Generating Random Numbers With Numpy
np.random.normal()
3.137676397424334
The above code gives a random number from normal distribution.
Generate Multiple Random Numbers From The Normal Distribution
np.random.normal(size=4)
array([-0.47623571, 1.7373749 , 0.15226591, 0.38203617])
Generate Four Random Integers Between 1 and 100
np.random.randint(low=1, high=100, size=4)
array([47, 69, 27, 65])
Create an Array of Zeroes and Range Generator
np.zeros(6)
array([ 0., 0., 0., 0., 0., 0.])
Create an array from 1 to 100
np.arange(0, 100)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99])
Create 100 equally space numbers between 0 and 1
np.linspace(0, 1, 100)
array([ 0. , 0.01010101, 0.02020202, 0.03030303, 0.04040404,
0.05050505, 0.06060606, 0.07070707, 0.08080808, 0.09090909,
0.1010101 , 0.11111111, 0.12121212, 0.13131313, 0.14141414,
0.15151515, 0.16161616, 0.17171717, 0.18181818, 0.19191919,
0.2020202 , 0.21212121, 0.22222222, 0.23232323, 0.24242424,
0.25252525, 0.26262626, 0.27272727, 0.28282828, 0.29292929,
0.3030303 , 0.31313131, 0.32323232, 0.33333333, 0.34343434,
0.35353535, 0.36363636, 0.37373737, 0.38383838, 0.39393939,
0.4040404 , 0.41414141, 0.42424242, 0.43434343, 0.44444444,
0.45454545, 0.46464646, 0.47474747, 0.48484848, 0.49494949,
0.50505051, 0.51515152, 0.52525253, 0.53535354, 0.54545455,
0.55555556, 0.56565657, 0.57575758, 0.58585859, 0.5959596 ,
0.60606061, 0.61616162, 0.62626263, 0.63636364, 0.64646465,
0.65656566, 0.66666667, 0.67676768, 0.68686869, 0.6969697 ,
0.70707071, 0.71717172, 0.72727273, 0.73737374, 0.74747475,
0.75757576, 0.76767677, 0.77777778, 0.78787879, 0.7979798 ,
0.80808081, 0.81818182, 0.82828283, 0.83838384, 0.84848485,
0.85858586, 0.86868687, 0.87878788, 0.88888889, 0.8989899 ,
0.90909091, 0.91919192, 0.92929293, 0.93939394, 0.94949495,
0.95959596, 0.96969697, 0.97979798, 0.98989899, 1. ])
Indexing and Slicing Numpy Arrays
a = [[34, 25], [75, 45]]
b = np.array(a)
b
array([[34, 25],
[75, 45]])
b[0, 1]# Select the top row, second item
25
b[:, 1]# Select the second column
array([25, 45])
b[1, :]# Select the second row
array([75, 45])
b[1,1] # select the intersection of 1st row and 1st column
45
Aggregate functions
print(b.mean())
print(b.min())
print(b.max())
44.75
25
75
Change Value using Array Index
b[0]=100
b
array([[100, 100],
[ 75, 45]])
The value of the entire row is set to 100. If you want to change only 1 value, use the following:
b[0,1]=12
b
array([[100, 12],
[ 75, 45]])
Now just the 2nd element of 1st row changes.
Changing Shapes
Sometimes you’ll need to change the shape of your data without actually changing its contents. For example, you may have a vector, which is one-dimensional, but need a matrix, which is two-dimensional. There are two ways you can do that.
b.reshape(1,4)
array([[100, 100, 75, 45]])
b.reshape(4,1)
array([[100],
[ 12],
[ 75],
[ 45]])
More Examples on Reshaping
a=np.arange(6).reshape((3, 2))
a
array([[0, 1],
[2, 3],
[4, 5]])
np.reshape(a, (2, 3))
array([[0, 1, 2],
[3, 4, 5]])
Transpose
Transpose is the same as matrix transpose.
m(row,column)=m(column,row)
a
array([[0, 1],
[2, 3],
[4, 5]])
a.T
array([[0, 2, 4],
[1, 3, 5]])
a.transpose()
array([[0, 2, 4],
[1, 3, 5]])
Matrix Multiplication
a
array([[0, 1],
[2, 3],
[4, 5]])
Using scalar multiplication
a*a
array([[ 0, 1],
[ 4, 9],
[16, 25]])
The above result is not a matrix multiplication product. It simply multiplies the numbers element wise. That means (0,0) is multiplied with (0,0), (0,1) with (0,1) and so on. This is not matrix multiplication but a normal scalar multiplication.
Using Matrix Multiplication
a=np.array([[1,2],[3,4]])
np.dot(a,a)
array([[ 7, 10],
[15, 22]])
The above result is a proper matrix multiplication
Pandas
The real world datasets are not very clean, and contain a lot of irrelevant,dummy and empty values. We cannot pass the data as it is to the Machine Learning algorithms, as the algorithms can process the data in only some formats. Thus it is necessary to tranform the data into appropriate formats. For cleaning and transformation of data, we use the Pandas library.
Pandas is the most widely used library for data munging and transformation. It is Python equivalent of SQL. SQL, as you might know is used for directly working with databases, while Pandas is used to work with tables rather than databases. However Pandas has a better data cleaning and wrangling capacity. A basic knowledge of Pandas is sufficient enough for working with Machine Learning problems. In the following blog, I will try to explain all the important functions and methods in Pandas, and try to give its SQL equivalent code for better understanding.
#importing the library
import pandas as pd
df=pd.read_csv('Pokemon.csv')
Now Pandas has tha capacity to read different types of data files. It can be a excel file, .csv file or even a json file. All of these files have their different method. Like read_csv() is for csv files, similarly read_excel() is for .xlsl file or excel files. Now df is the variable where the csv file is stored.
Displaying the Data
Just to check what does the dataset looks like. We will display the first 5 rows of the dataset
SQL: select * from df LIMIT 5;
df.head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
.head() with the number of rows as the argument shows the first ‘n’ rows of the dataset. If you just write df without .head(), the complete dataframe will be displayed.
Now the above table is known as a dataframe, and each column of the ‘dataframe’ is known as a ‘Series’.
For displaying a single column of the dataframe, just specify the column name as follwing: df[‘column-name’]
df['HP'].head(5)
0 45
1 60
2 80
3 80
4 39
Name: HP, dtype: int64
Showing only certain columns
SQL: select HP, Defence, Total from df;
Thsi will only display name and salary from salary table.In pandas , we pass a list of column names to the variable as follows:
df[[‘col-name1’,’col-name2’]]
df[['HP','Defense','Total']].head(5)
HP | Defense | Total | |
---|---|---|---|
0 | 45 | 49 | 318 |
1 | 60 | 63 | 405 |
2 | 80 | 83 | 525 |
3 | 80 | 123 | 625 |
4 | 39 | 43 | 309 |
Slicing
Slicing dataframes is same as slicing lists. The only difference is that we can pass two dimensions of slicing, 1-row and 2-column.
The syntax is df.iloc[row,column]
df.iloc[1:,:4].head()
# | Type 1 | Type 2 | Total | |
---|---|---|---|---|
Name | ||||
Ivysaur | 2 | Grass | Poison | 405 |
Venusaur | 3 | Grass | Poison | 525 |
VenusaurMega Venusaur | 3 | Grass | Poison | 625 |
Charmander | 4 | Fire | unknown | 309 |
Charmeleon | 5 | Fire | unknown | 405 |
In the above statement, we slice from 2nd row till the last row, and 1st column to the 3rd column.
Data Filtering
Data Filtering is done by using a condition in the where clause in SQL. For pandas we have boolean filtering, where we specify a condition on some column.
SQL: select * from df;
where Total>100;
Lets filter Pokemons with Total > 100
(df['Total']>100).head(5)
0 True
1 True
2 True
3 True
4 True
Name: Total, dtype: bool
Whats the problem in the above output?? The problem is that we just get True or False and not the actual data. For getting the actual data, write the above code in following way: df[df[‘column-name’] condition]
df[df['Total']>100].head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
Now the above output contains the Pokemons with HP > 100
Multiple conditions
df[(df['Total']>100)&(df['HP']>50)].head(5)
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
5 | 5 | Charmeleon | Fire | NaN | 405 | 58 | 64 | 58 | 80 | 65 | 80 | 1 | False |
6 | 6 | Charizard | Fire | Flying | 534 | 78 | 84 | 78 | 109 | 85 | 100 | 1 | False |
Changing the index
df.index #checking the current index
RangeIndex(start=0, stop=800, step=1)
For changing the index, use method .set_index(‘column-name’), where column-name is the column you want to set as the index.
df.set_index('Name',inplace=True)
The inplace=True attribute is used to make the changes directly into df. If we dont use inplace , then the changes made in the dataframe will be temporary and will disappear in the next step
df.index
Index(['Bulbasaur', 'Ivysaur', 'Venusaur', 'VenusaurMega Venusaur',
'Charmander', 'Charmeleon', 'Charizard', 'CharizardMega Charizard X',
'CharizardMega Charizard Y', 'Squirtle',
...
'Noibat', 'Noivern', 'Xerneas', 'Yveltal', 'Zygarde50% Forme',
'Diancie', 'DiancieMega Diancie', 'HoopaHoopa Confined',
'HoopaHoopa Unbound', 'Volcanion'],
dtype='object', name='Name', length=800)
Working with NULL values
df.isnull().sum()
# 0
Type 1 0
Type 2 386
Total 0
HP 0
Attack 0
Defense 0
Sp. Atk 0
Sp. Def 0
Speed 0
Generation 0
Legendary 0
dtype: int64
The above output shows that there are 386 null values in ‘Type 2’ column while other columns are free from null values.
Filling/Droping the NULL values
df.fillna('unknown',inplace=True)
The above statement will fill all the NULL values in the dataframe with value as ‘unknown’. We can also fill NULL values for only some columns. For that, specify the column name and use .fillna(‘value’) as follows: df[‘column-name’].fillna(‘value’). For eg:
df['Type 2'].fillna('unknown',inplace=True)
df.isnull().sum()
# 0
Type 1 0
Type 2 0
Total 0
HP 0
Attack 0
Defense 0
Sp. Atk 0
Sp. Def 0
Speed 0
Generation 0
Legendary 0
dtype: int64
So now we do not have any NULL values left in the dataframe.
df.dropna(inplace=True)
The above statement will drop all NULL values from the dataframe. We can also drop NULL values from a single column also. The syntax will be same as in case of .fillna()
Counting Frequency
This is basically used to get a count of a value in a column of a dataframe.
SQL: Select Type1,count(Type1)
from df,
Group By Type1;
For Pandas we have .value_counts(),which returns the frequency of each unique value in a dataframe column.
df['Type 1'].value_counts()
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Electric 44
Rock 44
Ground 32
Ghost 32
Dragon 32
Dark 31
Poison 28
Steel 27
Fighting 27
Ice 24
Fairy 17
Flying 4
Name: Type 1, dtype: int64
The above output shows the frequency of each distinct Type from the ‘Type 1’ column.
GroupBy
It has the same functioning as the GroupBy in SQL. .groupby() typically refers to a process where we’d like to split a dataset into groups, apply some aggregate function, and then combine the groups together.
SQL: Select Type1,count(Type1) from df ,Group By Type1;
df.groupby('Type 1')['Type 1'].count()
Type 1
Bug 69
Dark 31
Dragon 32
Electric 44
Fairy 17
Fighting 27
Fire 52
Flying 4
Ghost 32
Grass 70
Ground 32
Ice 24
Normal 98
Poison 28
Psychic 57
Rock 44
Steel 27
Water 112
Name: Type 1, dtype: int64
Okay so the above output is same as the output we got with value_counts(). Lets try grouping by with multiple columns.
SQL: Select Type1,count(Type1),
from df,
Group By Type1,Type2;
df.groupby(['Type 1','Type 2'])['Type 1'].count()
Type 1 Type 2
Bug Electric 2
Fighting 2
Fire 2
Flying 14
Ghost 1
Grass 6
Ground 2
Poison 12
Rock 3
Steel 7
Water 1
unknown 17
Dark Dragon 3
Fighting 2
Fire 3
Flying 5
Ghost 2
Ice 2
Psychic 2
Steel 2
unknown 10
Dragon Electric 1
Fairy 1
Fire 1
Flying 6
Ground 5
Ice 3
Psychic 4
unknown 11
Electric Dragon 1
..
Rock Ground 6
Ice 2
Psychic 2
Steel 3
Water 6
unknown 9
Steel Dragon 1
Fairy 3
Fighting 1
Flying 1
Ghost 4
Ground 2
Psychic 7
Rock 3
unknown 5
Water Dark 6
Dragon 2
Electric 2
Fairy 2
Fighting 3
Flying 7
Ghost 2
Grass 3
Ground 10
Ice 3
Poison 3
Psychic 5
Rock 4
Steel 1
unknown 59
Name: Type 1, dtype: int64
So in the above example, we have grouped Type1 and Type2 columns, and have counted the number of Pokemons in each of the subgroups.
Aggregations
Applying aggregate functions on the dataframe
SQL: select max(HP) from df
2)select min(HP) from df
3)select count(*) from df…etc
print('Max HP:',df['HP'].max())
print('Min HP:',df['HP'].min())
print('No. of rows:',df.shape[0])
Max HP: 255
Min HP: 1
No. of rows: 800
Matplotlib
Matplotlib is the most prominent visualisation library in Python. It is used to generate graphs similar to some proprietory software like Matlab,SAS,etc. Some of the most important graphs that maybe useful in understanding the data better are : Line Charts, Bar Charts, Pie Charts, Scatter Plots, Box-plots and few others. Matplotlib is very easy to use. The only thing we need to take care is to pass data in the proper format to the Matplotlib functions, and the graphs are rendered automatically.
import matplotlib.pyplot as plt
l1=[1,2,3,4]
l2=[4,5,6,7]
plt.plot(l1,l2)
plt.title('Trial Graph')
plt.show()
To plot a graph, we just use the .plot() method. The arguments we pass to this method, is the data that we want to plot. The default graph is the line graph. .title() is used for naming the title. After .plot(), we need to use .show() to render the graph on the screen.
plt.scatter(l1,l2)
plt.show()
To plot a different graph, we use the desired graph type, such as .scatter or .bar.
plt.bar(l1,l2,color='r')
plt.xlabel('X-axis')
plt.ylabel('Y-label')
plt.show()
.xlabel() and .ylabel() is used to set the name for x-axis and y-axis. All these methods are similar to those in Matlab.
Plotting some real world data
#Lets say we want to plot this
df['Type 1'].value_counts()
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Electric 44
Rock 44
Ground 32
Ghost 32
Dragon 32
Dark 31
Poison 28
Steel 27
Fighting 27
Ice 24
Fairy 17
Flying 4
Name: Type 1, dtype: int64
Now we can simply write .plot() method after the above statement to render the graph.
df['Type 1'].value_counts().plot.bar()
plt.show()
df['Type 1'].value_counts().plot.pie()
plt.show()
Now the graph type should be selected such that it represents the data very well. Like for a continous distribution, a histogram is suitable but for a time-series, a line plot is suitable. Lets see a continous distribution plot..
df['Total'].plot.hist(color='g')
plt.title('Distribution of Total')
plt.show()
The above graph shows the distribution of Total for all the given pokemons.
df['Total'].plot.box()
plt.show()
Above is a boxplot, showing the median, 25th quartile, 75th quartile, Max value and Min value.
The methods and functions covered in this blog are more than sufficient to get you started with Machine Learning and AI.
For further reference, prefer using offical docs.
We hope this post was helpful. Feel free to comment in case of doubts and do let us know your feedback. Stay tuned for more!