Data Exploration and Data Cleaning
Data exploration and data cleaning form the base of every AI workflow. Clean data leads to stronger models. Exploration helps you understand structure, errors, and patterns.
What is data exploration
Data exploration checks shape, types, missing values, duplicates, and simple stats. It gives a first view of the dataset.
Key steps
- Check dataset size.
- Check column types.
- Check missing values.
- Check unique values for each column.
- Check simple statistics like mean and median.
- Look for outliers.
Basic Python example
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())
print(df.duplicated().sum())
What is data cleaning
Data cleaning fixes errors and prepares the dataset. It includes removing duplicates, filling missing values, converting types, and handling outliers.
Common cleaning actions
- Remove duplicates.
- Drop irrelevant columns.
- Fill missing values.
- Convert string numbers to numeric types.
- Standardize text values.
- Handle outliers.
Cleaning example in Python
# Remove duplicates df = df.drop_duplicates() # Fill missing values df["age"] = df["age"].fillna(df["age"].median()) # Convert to numeric df["salary"] = pd.to_numeric(df["salary"], errors="coerce") # Drop rows with invalid values df = df.dropna()
Tips for strong preprocessing
- Keep transformations simple.
- Keep logs of steps.
- Use clear column names.
- Test the dataset after each change.
- Store a clean version for modeling.
Conclusion
Data exploration shows structure. Data cleaning fixes it. These steps improve AI models and reduce errors. Every AI student and practitioner needs strong preprocessing skills.
Data Exploration w Data Cleaning b Darija
Data exploration w data cleaning houma l9a3da dyal kol workflow f AI. Data n9iya katsayeb models aqwa. Exploration kayb9a awwel fase bach tfham dataset.
Ash hiya data exploration
Katchouf shape, types, missing values, duplicates, w stats basita. Katt3tik nazra 3amma 3la dataset.
Steps mhemmin
- Chouf size dyal dataset.
- Chouf types dyal columns.
- Chouf missing values.
- Chouf unique values.
- Chouf stats basita b7al mean w median.
- Chouf outliers.
Exemple b Python
import pandas as pd
df = pd.read_csv("data.csv")
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())
print(df.duplicated().sum())
Ash hiya data cleaning
Data cleaning katssayeb errors w katwajjid dataset. Katchmel delete duplicates, fill missing values, convert types, w handle outliers.
Steps dyal cleaning
- Tri duplicates.
- Hayed columns li mafihom faida.
- 3mer missing values.
- Convert types.
- Standardize text.
- Handle outliers.
Exemple cleaning b Python
df = df.drop_duplicates() df["age"] = df["age"].fillna(df["age"].median()) df["salary"] = pd.to_numeric(df["salary"], errors="coerce") df = df.dropna()
Tips
- Khdem b steps s7lin.
- Sjjel kull step.
- Smi columns b klam wadi.
- T9der ttesti dataset b3d kull taghyir.
- Khlli version n9iya bach tbuildi models.
Khitam
Data exploration katfham dataset. Data cleaning katsla7 dataset. Had lfaslat kayrfa3o quality dyal models f AI. Talaba w practitioners kay7tajouhm f ay project.