How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick

On this text, I’ll go deeper into the step “The data” from the Machine Finding out Workflow. Throughout the earlier article, How do I work with data using a Machine Learning Model? I described three of the six steps.

All the data for the Machine Finding out Model should be numerical. Getting ready the data contains filling in missing values, and altering all non-numerical values into numbers, e.g.: textual content material into lessons or integers, string dates break up into days, months, and years as integers, and boolean certain/no as 0 and 1.

Throughout the earlier articles, I was working with the right information the place no preparation was needed. Within the precise world good information doesn’t exist, you’ll on a regular basis must work with the data. This step is named Exploratory Data Analysis (EDA) is a really highly effective course of to conduct initially of every information science problem.

1. Loading the data 
2. Dealing with missing values 
1. Set up missing values 
2. Filling missing values 
1. Numeric values 
2. Non-numeric values 
3. Convert non-numeric information into numeric 
1. Textual content material into numbers 
2. Dates into numbers 
3. Courses into numbers 
4. The availability code

We have now to load the data for Exploratory Data Analysis (EDA). The data set used for this textual content is the “Residence Prices in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Importing the devices
import pandas as pd# Load the data into pandas DataFrame
data_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")
# Let's look at what information we have
data_frame.head()

The first 5 rows from the loaded information set

As we’re in a position to see even throughout the first 5 rows we’ve missing values, inside the kind of empty cells.

First, now we have to determine information varieties for each column throughout the loaded information set. We would like numeric values, any information form aside from object is sweet.

# Checking columns information varieties to know learn to take care of missing values
data_frame.dtypes

Beneath we’ve a list of column names and their information varieties, e.g. column id is of form object.

id                       object
metropolis                     object
form                     object
squareMeters            float64
rooms                   float64
floor                   float64
floorCount              float64
buildYear               float64
latitude                float64
longitude               float64
centreDistance          float64
poiCount                float64
schoolDistance          float64
clinicDistance          float64
postOfficeDistance      float64
kindergartenDistance    float64
restaurantDistance      float64
collegeDistance         float64
pharmacyDistance        float64
possession                object
buildingMaterial         object
scenario                object
hasParkingSpace           int64
hasBalcony                int64
hasElevator               int64
hasSecurity               int64
hasStorageRoom            int64
worth                     int64
dtype: object

Sooner than we even start filling missing values now we have to know whereby columns and what variety of missing values we’ve.

# Checking information varieties vs NaN values - sooner than and after filling missing information
info_df = pd.DataFrame({
'Data Kind': data_frame.dtypes,
'Missing Values': data_frame.isna().sum()
})print(info_df)

Beneath we’ve a list of columns with its Data Kind and number of Missing Values for each column.

Data Kind  Missing Values
id                      object               0
metropolis                    object               0
form                    object            2203
squareMeters           float64               0
rooms                  float64               0
floor                  float64            1030
floorCount             float64             171
buildYear              float64            2492
latitude               float64               0
longitude              float64               0
centreDistance         float64               0
poiCount               float64               0
schoolDistance         float64               2
clinicDistance         float64               5
postOfficeDistance     float64               5
kindergartenDistance   float64               7
restaurantDistance     float64              24
collegeDistance        float64             104
pharmacyDistance       float64              13
possession               object               0
buildingMaterial        object            3459
scenario               object            6223
hasParkingSpace         object               0
hasBalcony              object               0
hasElevator             object             454
hasSecurity             object               0
hasStorageRoom          object               0
worth                    int64               0

As we’re in a position to see we’ve many missing values, e.g. column buildYear has 2492 missing values.

As soon as we try to create a Machine Finding out Model primarily based totally on the DataFrame …

X = data_frame.drop("worth", axis=1)
y = data_frame["price"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.match(X_train, y_train)

… we’ll get an exception.

ValueError                                Traceback (newest identify closing)<ipython-input-15-345ee3a9038d> in <cell line: 17>()
15 model = RandomForestClassifier()
16 # 'match()' - Assemble a forest of bushes from the teaching set (X, y).
---> 17 model.match(X_train, y_train)
18 # 'predict()' - Predict class for X.
19 y_preds = model.predict(X_test)

/usr/native/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype)
1996     def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
1997         values = self._values
-> 1998         arr = np.asarray(values, dtype=dtype)
1999         if (
2000             astype_is_view(values.dtype, arr.dtype)ValueError: could not convert string to float: '1e1ec12d582075085f740f5c7bdf4091'

Sooner than we create a Machine Finding out Model now we have to fill in missing values even after they’re numerics.

Numeric values

Filling missing numeric columns with indicate() values isn’t the right idea, nonetheless as a kick off point, it’s enough.

For this course of, I’ve used two methods fillna() and indicate() for explicit columns, e.g.: column floor, data_frame[“floor”], and used the parameter inplace=True to stay away from reassigning value to the column.

# Dealing with missing values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].indicate(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].indicate(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].indicate(), inplace=True)# With out parameter inplate=True
# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].indicate())

Non-numeric values

As soon as we address non-numeric values the worst issue we’re in a position to do is to fill in missing values with the an identical value. What do I indicate by that?

First, I take a look at the distinctive values for explicit column.

# Checking non-numeric columns distinctive information to fill NaN
print(f"State of affairs: {data_frame['condition'].distinctive()}")

State of affairs: ['premium' 'low']

We don’t want all our flats to be solely premium or low. Filling missing values with a single value is a extremely unhealthy idea.

That’s why I exploit the below code to hunt out distinctive values for explicit columns after which randomly apply this value to the column.

unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.various(unique_conditions) if pd.isna(x) else x)

The an identical can be utilized to totally different columns, e.g.: metropolis, for its values.

Cities: [‘szczecin’ ‘gdynia’ ‘krakow’ ‘poznan’ ‘bialystok’ ‘gdansk’ ‘wroclaw’ ‘radom’ ‘rzeszow’ ‘lodz’ ‘katowice’ ‘lublin’ ‘czestochowa’ ‘warszawa’ ‘bydgoszcz’]

Since we’ve stuffed in all missing values we’re in a position to start altering them into numbers on account of ALL the data for the Machine Finding out Model should be numerical.

Usually it’s simple to change textual content material into numbers. Throughout the information set, we use, the id column incorporates textual content material 2a1a6db97ff122d6bc148abb6f0e498a, on this case, we’re in a position to change it into the amount hexadecimal sort. The an identical goes with boolean values like certain/no, we’re in a position to convert them into 0 and 1.

# Convert non-numeric information into numeric# id column form 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(
lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' certain/no into bool
data_frame['hasParkingSpace'] = 
data_frame['hasParkingSpace'].map({'certain': 1, "no": 0})

Even dates are saved in textual content material sort, e.g.: 2024–06–10, now we have to interrupt up each half, the 12 months, the month, and the day into separate variables/columns. The column is in a novel information set city_rentals_wro_2007_2023.csv from the an identical “Residence Prices in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.

# Convert non-numeric information into numeric
# altering column 'date_listed' of form 'str' into separate numbers
data_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for 12 months, month, and day
data_frame['year'] = data_frame['date_listed'].dt.12 months
data_frame['month'] = data_frame['date_listed'].dt.month
data_frame['day'] = data_frame['date_listed'].dt.day
# drop the distinctive 'date' column if you need
data_frame = data_frame.drop('date_listed', axis=1)

IMG

The above desk reveals added columns after altering the column date_listed.

Textual content material information can be change into lessons after which into numbers, the below code does it very successfully. I’m not going into many particulars, I exploit present libraries and their classes the OneHotEncoder and the ColumnTransformer, all on the market in scikit-learn.

How did I decide which column is also dealt with as a category? It’s related to the strategy described earlier Filling missing values — Non-numeric values, and it’s a part of the Exploratory Data Analysis (EDA).

Columns added after altering column ‘date_listed’.

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer# Flip the lessons into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)

Sooner than transforming the DataFrame we had 28 columns now we’ve 44 columns with out human-readable column names, as a substitute we’ve solely numbers as column names.

ALL the data is numerical, we achieved EDA and ended up with the DataFrame ready to be used throughout the Machine Finding out Model.

Beneath we’re in a position to uncover all the provide code wanted for getting ready the data for using it with a Machine Finding out Model.

Steps coated:

1. Loading the data 
2. Dealing with missing values 
3. Set up missing values 
4. Filling missing values 
◦ Numeric values 
◦ Non-numeric values 
5. Convert non-numeric information into numeric 
6. Textual content material into numbers 
7. Dates into numbers 
8. Courses into numbers

# Importing the devices
import pandas as pd
import numpy as npdata_frame = pd.read_csv(csv_file_name)
# Dealing with missing values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].indicate(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].indicate(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].indicate(), inplace=True)
data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].indicate(), inplace=True)
data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].indicate(), inplace=True)
data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].indicate(), inplace=True)
data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].indicate(), inplace=True)
data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].indicate(), inplace=True)
data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].indicate(), inplace=True)
data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].indicate(), inplace=True)
unique_types = data_frame["type"].dropna().distinctive()
data_frame["type"] = data_frame["type"].apply(lambda x: np.random.various(unique_types) if pd.isna(x) else x)
data_frame["ownership"].fillna("condominium", inplace=True)
unique_bms = data_frame["buildingMaterial"].dropna().distinctive()
data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply(
lambda x: np.random.various(unique_bms) if pd.isna(x) else x)
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.various(unique_conditions) if pd.isna(x) else x)
unique_hes = data_frame["hasElevator"].dropna().distinctive()
data_frame["hasElevator"] = data_frame["hasElevator"].apply(
lambda x: np.random.various(unique_hes) if pd.isna(x) else x)
# Convert non-numeric information into numeric
# id column form 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' certain/no into bool
data_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'certain': 1, "no": 0})
data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'certain': 1, "no": 0})
data_frame['hasElevator'] = data_frame['hasElevator'].map({'certain': 1, "no": 0})
data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'certain': 1, "no": 0})
data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'certain': 1, "no": 0})
# X - teaching enter samples, choices
X = data_frame.drop("worth", axis=1)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Flip the lessons into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
transformed_df.to_csv("saved_transformed_df.csv")
# y - teaching enter labels, the required consequence, the purpose value
y = data_frame["price"]

Beneath we’re in a position to uncover all the provide code wanted for making a Machine Finding out Model primarily based totally on the prepared information.

# Import 'train_test_split()' function
# "Lower up arrays or matrices into random put together and check out subsets."
from sklearn.model_selection import train_test_split# Lower up the data into teaching and check out models
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
# Setup random seed - to have the an identical outcomes, me and in addition you
np.random.seed(42)
# Import the LinearRegression estimator class
from sklearn.linear_model import LinearRegression
# Instantiate LinearRegression to create a Machine Finding out Model
model = LinearRegression()
# 'match()' - Assemble a forest of bushes from the teaching set (X, y).
model.match(X_train, y_train)
# 'predict()' - Predict class for X.
y_preds = model.predict(X_test)

NOTE: On this text, I’m merely barely scratching the ground. This topic desires further finding out and evaluation by your self. I’m nonetheless initially of my finding out strategy of AI & ML!

Image generated with Midjourney, edited in GIMP. Screenshots made by creator.

Thank you for being a valued member of the Nirantara family! We appreciate your continued support and trust in our apps.

Nirantara Social - Stay connected with friends and loved ones. Download now: Nirantara Social
Nirantara News - Get the latest news and updates on the go. Install the Nirantara News app: Nirantara News
Nirantara Fashion - Discover the latest fashion trends and styles. Get the Nirantara Fashion app: Nirantara Fashion
Nirantara TechBuzz - Stay up-to-date with the latest technology trends and news. Install the Nirantara TechBuzz app: Nirantara Fashion
InfiniteTravelDeals24 - Find incredible travel deals and discounts. Install the InfiniteTravelDeals24 app: InfiniteTravelDeals24

If you haven't already, we encourage you to download and experience these fantastic apps. Stay connected, informed, stylish, and explore amazing travel offers with the Nirantara family!

Source link

How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick | Jun, 2024

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

📈 Predicting Google Stock Prices with Kernel Regression and Interactive Widgets! 🚀 | by Unicorn Day | Jul, 2024 – Niraranra

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

📈 Predicting Google Stock Prices with Kernel Regression and Interactive Widgets! 🚀 | by Unicorn Day | Jul, 2024 – Niraranra

Zendaya Went Full “Challengers” in Ralph Lauren Outfit at Wimbledon

Top Insights

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra – Nirantara

WhatsApp could soon integrate Google’s Live Translate into chats – Niraranra

Elon Musk ‘Fully Endorses’ Donald Trump After Deadly Rally Shooting

How do I prepare data for the Machine Learning Model? Make it all numbers! | by Machine Learning Maverick | Jun, 2024

Numeric values

Non-numeric values

Related Posts