On this text, I’ll go deeper into the step “The data” from the Machine Finding out Workflow. Throughout the earlier article, How do I work with data using a Machine Learning Model? I described three of the six steps.
All the data for the Machine Finding out Model should be numerical. Getting ready the data contains filling in missing values, and altering all non-numerical values into numbers, e.g.: textual content material into lessons or integers, string dates break up into days, months, and years as integers, and boolean certain/no as 0 and 1.
Throughout the earlier articles, I was working with the right information the place no preparation was needed. Within the precise world good information doesn’t exist, you’ll on a regular basis must work with the data. This step is named Exploratory Data Analysis (EDA) is a really highly effective course of to conduct initially of every information science problem.
1. Loading the data
2. Dealing with missing values
1. Set up missing values
2. Filling missing values
1. Numeric values
2. Non-numeric values
3. Convert non-numeric information into numeric
1. Textual content material into numbers
2. Dates into numbers
3. Courses into numbers
4. The availability code
We have now to load the data for Exploratory Data Analysis (EDA). The data set used for this textual content is the “Residence Prices in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Importing the devices
import pandas as pd# Load the data into pandas DataFrame
data_frame = pd.read_csv("apartments_rent_pl_2024_01.csv")
# Let's look at what information we have
data_frame.head()
As we’re in a position to see even throughout the first 5 rows we’ve missing values, inside the kind of empty cells.
First, now we have to determine information varieties for each column throughout the loaded information set. We would like numeric values, any information form aside from object is sweet.
# Checking columns information varieties to know learn to take care of missing values
data_frame.dtypes
Beneath we’ve a list of column names and their information varieties, e.g. column id is of form object.
id object
metropolis object
form object
squareMeters float64
rooms float64
floor float64
floorCount float64
buildYear float64
latitude float64
longitude float64
centreDistance float64
poiCount float64
schoolDistance float64
clinicDistance float64
postOfficeDistance float64
kindergartenDistance float64
restaurantDistance float64
collegeDistance float64
pharmacyDistance float64
possession object
buildingMaterial object
scenario object
hasParkingSpace int64
hasBalcony int64
hasElevator int64
hasSecurity int64
hasStorageRoom int64
worth int64
dtype: object
Sooner than we even start filling missing values now we have to know whereby columns and what variety of missing values we’ve.
# Checking information varieties vs NaN values - sooner than and after filling missing information
info_df = pd.DataFrame({
'Data Kind': data_frame.dtypes,
'Missing Values': data_frame.isna().sum()
})print(info_df)
Beneath we’ve a list of columns with its Data Kind and number of Missing Values for each column.
Data Kind Missing Values
id object 0
metropolis object 0
form object 2203
squareMeters float64 0
rooms float64 0
floor float64 1030
floorCount float64 171
buildYear float64 2492
latitude float64 0
longitude float64 0
centreDistance float64 0
poiCount float64 0
schoolDistance float64 2
clinicDistance float64 5
postOfficeDistance float64 5
kindergartenDistance float64 7
restaurantDistance float64 24
collegeDistance float64 104
pharmacyDistance float64 13
possession object 0
buildingMaterial object 3459
scenario object 6223
hasParkingSpace object 0
hasBalcony object 0
hasElevator object 454
hasSecurity object 0
hasStorageRoom object 0
worth int64 0
As we’re in a position to see we’ve many missing values, e.g. column buildYear has 2492 missing values.
As soon as we try to create a Machine Finding out Model primarily based totally on the DataFrame …
X = data_frame.drop("worth", axis=1)
y = data_frame["price"]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
np.random.seed(42)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.match(X_train, y_train)
… we’ll get an exception.
ValueError Traceback (newest identify closing)<ipython-input-15-345ee3a9038d> in <cell line: 17>()
15 model = RandomForestClassifier()
16 # 'match()' - Assemble a forest of bushes from the teaching set (X, y).
---> 17 model.match(X_train, y_train)
18 # 'predict()' - Predict class for X.
19 y_preds = model.predict(X_test)
/usr/native/lib/python3.10/dist-packages/pandas/core/generic.py in __array__(self, dtype)
1996 def __array__(self, dtype: npt.DTypeLike | None = None) -> np.ndarray:
1997 values = self._values
-> 1998 arr = np.asarray(values, dtype=dtype)
1999 if (
2000 astype_is_view(values.dtype, arr.dtype)ValueError: could not convert string to float: '1e1ec12d582075085f740f5c7bdf4091'
Sooner than we create a Machine Finding out Model now we have to fill in missing values even after they’re numerics.
Numeric values
Filling missing numeric columns with indicate() values isn’t the right idea, nonetheless as a kick off point, it’s enough.
For this course of, I’ve used two methods fillna() and indicate() for explicit columns, e.g.: column floor, data_frame[“floor”], and used the parameter inplace=True to stay away from reassigning value to the column.
# Dealing with missing values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].indicate(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].indicate(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].indicate(), inplace=True)# With out parameter inplate=True
# data_frame["buildYear"] = data_frame["buildYear"].fillna(data_frame["buildYear"].indicate())
Non-numeric values
As soon as we address non-numeric values the worst issue we’re in a position to do is to fill in missing values with the an identical value. What do I indicate by that?
First, I take a look at the distinctive values for explicit column.
# Checking non-numeric columns distinctive information to fill NaN
print(f"State of affairs: {data_frame['condition'].distinctive()}")
State of affairs: ['premium' 'low']
We don’t want all our flats to be solely premium or low. Filling missing values with a single value is a extremely unhealthy idea.
That’s why I exploit the below code to hunt out distinctive values for explicit columns after which randomly apply this value to the column.
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.various(unique_conditions) if pd.isna(x) else x)
The an identical can be utilized to totally different columns, e.g.: metropolis, for its values.
Cities: [‘szczecin’ ‘gdynia’ ‘krakow’ ‘poznan’ ‘bialystok’ ‘gdansk’ ‘wroclaw’ ‘radom’ ‘rzeszow’ ‘lodz’ ‘katowice’ ‘lublin’ ‘czestochowa’ ‘warszawa’ ‘bydgoszcz’]
Since we’ve stuffed in all missing values we’re in a position to start altering them into numbers on account of ALL the data for the Machine Finding out Model should be numerical.
Usually it’s simple to change textual content material into numbers. Throughout the information set, we use, the id column incorporates textual content material 2a1a6db97ff122d6bc148abb6f0e498a, on this case, we’re in a position to change it into the amount hexadecimal sort. The an identical goes with boolean values like certain/no, we’re in a position to convert them into 0 and 1.
# Convert non-numeric information into numeric# id column form 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(
lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' certain/no into bool
data_frame['hasParkingSpace'] =
data_frame['hasParkingSpace'].map({'certain': 1, "no": 0})
Even dates are saved in textual content material sort, e.g.: 2024–06–10, now we have to interrupt up each half, the 12 months, the month, and the day into separate variables/columns. The column is in a novel information set city_rentals_wro_2007_2023.csv from the an identical “Residence Prices in Poland” — https://www.kaggle.com/datasets/krzysztofjamroz/apartment-prices-in-poland/data.
# Convert non-numeric information into numeric
# altering column 'date_listed' of form 'str' into separate numbers
data_frame['date_listed'] = pd.to_datetime(data_frame['date_listed'])# create new columns for 12 months, month, and day
data_frame['year'] = data_frame['date_listed'].dt.12 months
data_frame['month'] = data_frame['date_listed'].dt.month
data_frame['day'] = data_frame['date_listed'].dt.day
# drop the distinctive 'date' column if you need
data_frame = data_frame.drop('date_listed', axis=1)
IMG
The above desk reveals added columns after altering the column date_listed.
Textual content material information can be change into lessons after which into numbers, the below code does it very successfully. I’m not going into many particulars, I exploit present libraries and their classes the OneHotEncoder and the ColumnTransformer, all on the market in scikit-learn.
How did I decide which column is also dealt with as a category? It’s related to the strategy described earlier Filling missing values — Non-numeric values, and it’s a part of the Exploratory Data Analysis (EDA).
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer# Flip the lessons into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
Sooner than transforming the DataFrame we had 28 columns now we’ve 44 columns with out human-readable column names, as a substitute we’ve solely numbers as column names.
ALL the data is numerical, we achieved EDA and ended up with the DataFrame ready to be used throughout the Machine Finding out Model.
Beneath we’re in a position to uncover all the provide code wanted for getting ready the data for using it with a Machine Finding out Model.
Steps coated:
1. Loading the data
2. Dealing with missing values
3. Set up missing values
4. Filling missing values
◦ Numeric values
◦ Non-numeric values
5. Convert non-numeric information into numeric
6. Textual content material into numbers
7. Dates into numbers
8. Courses into numbers
# Importing the devices
import pandas as pd
import numpy as npdata_frame = pd.read_csv(csv_file_name)
# Dealing with missing values
# Filling NaN values
data_frame["floor"].fillna(data_frame["floor"].indicate(), inplace=True)
data_frame["floorCount"].fillna(data_frame["floorCount"].indicate(), inplace=True)
data_frame["buildYear"].fillna(data_frame["buildYear"].indicate(), inplace=True)
data_frame["schoolDistance"].fillna(data_frame["schoolDistance"].indicate(), inplace=True)
data_frame["clinicDistance"].fillna(data_frame["clinicDistance"].indicate(), inplace=True)
data_frame["postOfficeDistance"].fillna(data_frame["postOfficeDistance"].indicate(), inplace=True)
data_frame["kindergartenDistance"].fillna(data_frame["kindergartenDistance"].indicate(), inplace=True)
data_frame["restaurantDistance"].fillna(data_frame["restaurantDistance"].indicate(), inplace=True)
data_frame["collegeDistance"].fillna(data_frame["collegeDistance"].indicate(), inplace=True)
data_frame["pharmacyDistance"].fillna(data_frame["pharmacyDistance"].indicate(), inplace=True)
unique_types = data_frame["type"].dropna().distinctive()
data_frame["type"] = data_frame["type"].apply(lambda x: np.random.various(unique_types) if pd.isna(x) else x)
data_frame["ownership"].fillna("condominium", inplace=True)
unique_bms = data_frame["buildingMaterial"].dropna().distinctive()
data_frame["buildingMaterial"] = data_frame["buildingMaterial"].apply(
lambda x: np.random.various(unique_bms) if pd.isna(x) else x)
unique_conditions = data_frame["condition"].dropna().distinctive()
data_frame["condition"] = data_frame["condition"].apply(
lambda x: np.random.various(unique_conditions) if pd.isna(x) else x)
unique_hes = data_frame["hasElevator"].dropna().distinctive()
data_frame["hasElevator"] = data_frame["hasElevator"].apply(
lambda x: np.random.various(unique_hes) if pd.isna(x) else x)
# Convert non-numeric information into numeric
# id column form 'str' into 'int'
data_frame["id"] = data_frame["id"].apply(lambda x: int(x, 16) if isinstance(x, str) else x)
# columns with 'str' certain/no into bool
data_frame['hasParkingSpace'] = data_frame['hasParkingSpace'].map({'certain': 1, "no": 0})
data_frame['hasBalcony'] = data_frame['hasBalcony'].map({'certain': 1, "no": 0})
data_frame['hasElevator'] = data_frame['hasElevator'].map({'certain': 1, "no": 0})
data_frame['hasSecurity'] = data_frame['hasSecurity'].map({'certain': 1, "no": 0})
data_frame['hasStorageRoom'] = data_frame['hasStorageRoom'].map({'certain': 1, "no": 0})
# X - teaching enter samples, choices
X = data_frame.drop("worth", axis=1)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Flip the lessons into numbers
categorical_features = ["city", "type", "ownership", "buildingMaterial", "condition"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot,
categorical_features)],
the remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_df = pd.DataFrame(transformed_X)
transformed_df.to_csv("saved_transformed_df.csv")
# y - teaching enter labels, the required consequence, the purpose value
y = data_frame["price"]
Beneath we’re in a position to uncover all the provide code wanted for making a Machine Finding out Model primarily based totally on the prepared information.
# Import 'train_test_split()' function
# "Lower up arrays or matrices into random put together and check out subsets."
from sklearn.model_selection import train_test_split# Lower up the data into teaching and check out models
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
# Setup random seed - to have the an identical outcomes, me and in addition you
np.random.seed(42)
# Import the LinearRegression estimator class
from sklearn.linear_model import LinearRegression
# Instantiate LinearRegression to create a Machine Finding out Model
model = LinearRegression()
# 'match()' - Assemble a forest of bushes from the teaching set (X, y).
model.match(X_train, y_train)
# 'predict()' - Predict class for X.
y_preds = model.predict(X_test)
NOTE: On this text, I’m merely barely scratching the ground. This topic desires further finding out and evaluation by your self. I’m nonetheless initially of my finding out strategy of AI & ML!
Image generated with Midjourney, edited in GIMP. Screenshots made by creator.
Thank you for being a valued member of the Nirantara family! We appreciate your continued support and trust in our apps.
- Nirantara Social - Stay connected with friends and loved ones. Download now: Nirantara Social
- Nirantara News - Get the latest news and updates on the go. Install the Nirantara News app: Nirantara News
- Nirantara Fashion - Discover the latest fashion trends and styles. Get the Nirantara Fashion app: Nirantara Fashion
- Nirantara TechBuzz - Stay up-to-date with the latest technology trends and news. Install the Nirantara TechBuzz app: Nirantara Fashion
- InfiniteTravelDeals24 - Find incredible travel deals and discounts. Install the InfiniteTravelDeals24 app: InfiniteTravelDeals24
If you haven't already, we encourage you to download and experience these fantastic apps. Stay connected, informed, stylish, and explore amazing travel offers with the Nirantara family!
Source link