Clustering in Machine Learning
Introduction
Clustering is an unsupervised learning method. It groups similar data points without labels. Next, you see how it works and how each algorithm forms clusters.
Clustering هو طريقة ف unsupervised learning. كيجمع points لي كيشابهو بعضياتهم بلا labels.
Core Concepts Explained
Clustering searches for structure. The algorithm checks similarity and creates groups. Each group contains points that stay close to each other.
Clustering كيشوف similarity و كيدير grouping بشكل تلقائي.
How Clustering Works
- You provide unlabeled data
- The algorithm measures similarity
- It forms clusters based on distance or density
- Points in the same cluster stay close
Popular Clustering Algorithms
1. K Means
K Means splits data into K clusters. You choose K. The algorithm places centers and assigns points to the closest center. It updates centers until movement becomes small.
Best For
- Large datasets
- Simple cluster shapes
2. Hierarchical Clustering
This algorithm builds a hierarchy of clusters. It merges or splits clusters step by step. You cut the tree at the level you want.
Best For
- Small or medium datasets
- Flexible clusters
3. DBSCAN
DBSCAN groups points based on density. It detects dense regions and marks low density points as noise.
Best For
- Data with noise
- Irregular cluster shapes
Distance Measures
- Euclidean distance
- Manhattan distance
- Cosine similarity
Challenges in Clustering
- Selecting the number of clusters
- Handling noisy data
- Scaling features
Improving Clustering Results
- Normalize features
- Apply dimensionality reduction
- Test different K or density parameters
Where Clustering Is Used
- Customer segmentation
- Anomaly detection
- Document grouping
- Image grouping
Syntax or Model Structure Example
Below is a Python example for K Means.
from sklearn.cluster import KMeans
import pandas as pd
data = pd.read_csv("data.csv")
X = data[["f1", "f2"]]
model = KMeans(n_clusters=3)
model.fit(X)
print(model.labels_)
print(model.cluster_centers_)
هادا مثال بسيط كيبين كيفاش نخدمو K Means ف sklearn.
Clustering in Moroccan Darija
Clustering كيجمع data f clusters بلا labels. Algorithm كيحسب similarity و كيحط كل point ف group اللي قريبة ليه.
K Means
K Means كيحدد K clusters. كيحسب centers و كيعيد التوزيع.
Hierarchical
Kaybni tree ديال clusters. تقدر تقطعو ف أي مستوى.
DBSCAN
Kaylqa regions فيها density عالية و كيعرف noise بلا صعوبة.
Nqat Sariha
- Scaling ضروري
- اختيار K كيحتاج تجريب
- Dimensionality reduction كيعاون بزاف
Multiple Practical Examples
1. K Means with 3 Clusters
model = KMeans(n_clusters=3)
model.fit(X)
print(model.labels_[:10])
2. DBSCAN Example
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X)
print(labels[:10])
Explanation of Each Example
The first example clusters data into fixed groups. The second example detects dense regions and marks noise when needed.
ف المثال الأول كنحددو العدد ديال clusters. ف الثاني algorithm كيعتمد على density.
Exercises
- Explain clustering in one sentence.
- Train a K Means model with three clusters.
- Plot cluster centers on a scatter plot.
- Train a DBSCAN model and detect noise.
- Use MinMaxScaler before clustering.
- Try different K values and compare results.
- Use PCA before clustering and check improvements.
- List two strengths of clustering.
- List two challenges in clustering.
- Create clusters from synthetic data using sklearn.
Conclusion
Clustering groups data by similarity. It reveals structure that helps in analytics and AI workflows. It works well with scaling and careful parameter selection.
Clustering كيساعدك تشوف structure مخبية ف data. خاصو scaling و اختيار parameters باش يعطي نتائج واضحة.