🏗️ Mining Process Analysis at MetalCorp

Understanding correlation and optimization in iron mining separation processes

🔍 What's This Iron Mining Process About?

We are analyzing a flotation process, a crucial step in mineral processing where the goal is to selectively extract valuable iron minerals from unwanted silica (waste), ultimately producing a high-grade iron concentrate.

Our dataset provides insights into key aspects of this process, including:

⚙️ Why Are We Studying Correlation?

Understanding correlations allows us to identify how changes in our control parameters or feed characteristics influence the quality and efficiency of our output. Essentially, we're looking at how one variable affects another.

For example:

Airflow: Volume and Dispersion of Air in Flotation Cells

In flotation cells, air bubbles are critically important. Iron minerals are made water-repellent (hydrophobic) via chemical agents and attach to these air bubbles. Conversely, in reverse flotation, unwanted minerals (gangue) are made hydrophobic to float away, leaving iron behind.

Diffused air bubbles provide a large surface area for mineral particles to attach. If airflow is too low, we might experience poor recovery rates due to insufficient bubble surface. Mechanical impellers mix the mineral pulp with reagents and air, ensuring other chemicals like collectors, frothers, depressants, and pH modifiers selectively promote attachment to the desired iron minerals.

Air bubble size is crucial: it cannot be too large (poor selectivity) or too small (excessive froth stability). Froth stability is also a key focus point, as an unstable froth can lead to loss of valuable minerals, while an overly stable froth can entrain too much waste.

Liquid Level / Pulp Level

The liquid height (pulp level) in the flotation tank significantly affects the residence time of the material and the effectiveness of the separation process. A carefully controlled pulp level ensures optimal particle-bubble contact and allows for effective froth cleaning, which helps in cleanly separating iron from waste.

🧪 Why Unpacking Relationships is Critical

In a well-performing plant, we expect predictable outcomes. For instance, if iron content in the feed increases, we should ideally see a corresponding increase in the iron content of our concentrate. If this relationship isn't observed or is unexpectedly weak, it signals a deeper issue:

📈 My Key Performance Indicators (KPIs) Cheat Sheet in Iron Ore Flotation

Numbers You Want to GO UP

Metric Why it Matters How Airflow & Pulp Level Impact It
Iron Recovery (%) This is the percentage of valuable iron from the original ore that you successfully capture in our final concentrate. Higher recovery means less waste and more product to sell.
  • Airflow: Proper air flow generates enough bubbles to attach to all liberated iron particles. Too little air, and particles aren't picked up; too much, and froth might be unstable, dropping particles back.
  • Pulp Level: An optimized pulp level allows for good particle-bubble contact time, maximizing the chance for iron minerals to attach and float.
Iron Concentrate Grade (%) This is the purity of our final iron product, typically the percentage of Fe (iron) in the concentrate. A higher grade means a more valuable product for steelmaking, commanding a better price.
  • Airflow: Controlled air flow promotes the formation of small, stable bubbles that are more selective, reducing the entrainment of gangue.
  • Pulp Level: A deeper froth layer allows more time for gangue particles to drain back into the pulp, thus "cleaning" the concentrate and increasing its purity.
Throughput (Tons/Hour) The rate at which the flotation circuit processes ore. Higher throughput often means more overall production.
  • Airflow & Pulp Level: Optimal settings allow the flotation cells to operate efficiently at their designed capacity, processing more material per hour.
Concentrate Volume/Mass The total amount of salable iron concentrate produced over a period.
  • Directly related to Recovery and Throughput: More recovery and higher throughput naturally lead to a greater volume/mass of concentrate.

📉 Numbers You Want to GO DOWN

Metric Why it Matters How Airflow & Pulp Level Impact It
Gangue Content (%) High gangue = lower-grade product and higher downstream costs.
  • Airflow: Too much air entrains fine gangue.
  • Pulp Level: Shallow froth allows gangue carryover.
Reagent Consumption Chemicals are costly—optimizing reduces waste.
  • Optimization: Better control → less chemical use.
Energy Consumption Lower energy = better efficiency and cost savings.
  • Airflow: Efficient airflow reduces blower load.
  • Pulp Level: Stability = less pump surge.
Tailings Iron Content (%) More iron in tailings = lower recovery and profits.
  • Recovery Loss: Poor bubble capture leads to iron loss.

Other Key Considerations

Monitor these metrics and adjust air flow and pulp level accordingly to maximize iron recovery and reduce processing costs.

What Should You Do With These Numbers?

🧠 Step 1: Analyze correlation results and identify helpful vs. harmful settings

To Increase Iron (%Fe):

    Column 05 Level- pulp height - shows a slight positive correlation with Iron content (+0.16). While not strong, it may be worth monitoring or controlling more closely. Also Column 06 Level could be increasing too, as their correlation to % Iron concentrate is the second highest: (+0.146). We could also think of increasing Column 3 Air flow too as its correlation to target is +0.10, more air in this column could float the right minerals.

To Reduce Silica (%SiO₂):

    Column 01 Airflow is the most negatively correlated with Silica (-0.219), thus increasing them means less Silica. Similarly, increasing Column 03 Airflow which is at (-0.218) correlated with %SiO2 would reduce Sicica output.

What’s Hurting Performance?

    Similarly, we can see that increasing Column 01 Level would increase Silica concentrate, which are positively correlated at (+0.017); while implying less on our % Iron concentrate, which are plausible as negatively correlated with Column 01 Level at (-0.014).

print("--- Correlations with '%Fe' ---") print(correlation_matrix['Fe'].sort_values(ascending=False)) print("--- Correlations with '%SiO₂' ---") print(correlation_matrix['SiO2'].sort_values(ascending=True))
Correlation Output

✅ Step 2: Plot It Out (Visual Inspection)

Use scatter plots to visually understand the relationships:

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # === 2a. Scatter Plot: Air Flow vs % Iron Concentrate === sns.scatterplot(data=df, x="Flotation Column 01 Air Flow", y="% Iron Concentrate") plt.title("Iron Concentrate vs. Column 01 Air Flow") plt.xlabel("Flotation Column 01 Air Flow") plt.ylabel("% Iron Concentrate") plt.grid(True) plt.show() # === 2b. Scatter Plot with Regression Line: Ore Pulp pH vs % Iron Concentrate === sns.lmplot(data=df, x="Ore Pulp pH", y="% Iron Concentrate", height=6, aspect=1.5) plt.title("Iron Concentrate vs. Ore Pulp pH with Regression Line") plt.grid(True, linestyle=':', alpha=0.7) plt.show() # === 2c. Heatmaps for overall correlation insights=== plt.figure(figsize=(14, 10)) sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", fmt=".2f", linewidths=.5) plt.title("Correlation Matrix of All Process Variables") plt.show()

After just one or two graphs for general ideas about this dataset, now we are visualizing almost all plots for features vs target

# === 2d. Identify Numerical Columns and Input Features === df.columns = df.columns.str.strip() # Strip any leading/trailing whitespace numerical_columns = df.select_dtypes(include=np.number).columns.tolist() # Set your target variables target_iron = "% Iron Concentrate" target_silica = "% Silica Concentrate" # Exclude target variables from input features input_features = [col for col in numerical_columns if col not in [target_iron, target_silica]] print(f"\nTarget Iron Column: '{target_iron}'") print(f"Target Silica Column: '{target_silica}'") print(f"Number of Input Features identified: {len(input_features)}") print(f"Input Features: {input_features}") # === 2c. Define Function to Plot Distributions for Numerical Columns === def plot_numerical_distributions(dataframe, unique_value_threshold=30): =============================================================================== Plots histograms or bar charts for numerical variables: - If unique values <= threshold, plot a bar chart. - If unique values > threshold, plot a histogram. =============================================================================== print("\n--- Generating Histograms/Bar Charts for Numerical Variables ---") numerical_columns_to_plot = dataframe.select_dtypes(include=np.number).columns.tolist() for col in numerical_columns_to_plot: series_to_plot = dataframe[col].dropna() if series_to_plot.empty: print(f"Skipping plot for '{col}' as it contains only NaN values.") continue plt.figure(figsize=(10, 6)) if series_to_plot.nunique() <= unique_value_threshold: print(f"Plotting bar chart for '{col}' (Unique values: {series_to_plot.nunique()})") series_to_plot.value_counts().sort_index().plot(kind='bar') plt.title(f'Frequency of {col}', fontsize=16) plt.xlabel(col, fontsize=14) plt.ylabel('Frequency', fontsize=14) plt.xticks(rotation=45, ha='right', fontsize=10) else: print(f"Plotting histogram for '{col}' (Unique values: {series_to_plot.nunique()})") sns.histplot(series_to_plot, kde=True, bins='auto', edgecolor='black') plt.title(f'Distribution of {col}', fontsize=16) plt.xlabel(col, fontsize=14) plt.ylabel('Count', fontsize=14) plt.xticks(fontsize=10) plt.yticks(fontsize=10) plt.grid(axis='y', linestyle='--', alpha=0.7) plt.tight_layout() plt.show() # === 2d. Call the Function === plot_numerical_distributions(df, unique_value_threshold=20)

Through visual inspection, we can often identify:

Take a look at my Deepnote notebook. This platform is brilliant, I really like them. Comparing Deepnote with Google Colab or Jupyter notebook would be for another blog post.

✅ Step 3: Try some machine learning model

Use a machine learning model to see which variables truly matter and have the most predictive power over our outputs:

# === 3a. Import Libraries === from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split import numpy as np # === 3b. Define inputs and outputs for the model === features_cols = ["% Iron Feed", "% Silica Feed"] + ALL_CONTROL_COLS X = df[features_cols].dropna() y = df[OUTPUT_IRON_CONCENTRATE].loc[X.index] # === 3c. Split the data === X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # === 3d. Train the Random Forest Regressor === model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X, y) # === 3e. Analyze feature importance === importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False) print("Feature Importances for % Iron Concentrate:") print(importances.head(10)) # === 3f. Model validation steps === === To avoid overfitting and evaluate current model === from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split # Split into training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train on training data model = RandomForestRegressor(n_estimators=100, random_state=42) model.fit(X_train, y_train) # Predict on test data y_pred = model.predict(X_test) # Evaluate performance mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print("Mean Squared Error:", mse) print("R-squared Score:", r2)

This analysis will show which process inputs **have the most power to influence our product quality**, even if their simple correlations seem weak due to complex interactions.

🔍 Additional Advanced Analysis Techniques

4. 🔄 Align Data by Time (Handling Lags) in Time Series Analysis

In continuous processes, inputs don't instantly affect outputs; there's often a **time delay**. For example, a change in feed properties might only be reflected in the final concentrate after pulp flows through all columns.

We use Random Forest model to predict the efficiency of an iron ore concentration process, using recent and lagged sensor readings. The model is evaluated using both a test set and OOB (Out-of-Bag) validation for reliability.

# === Example: Shift '% Iron Feed' to account for a 30-minute lag (assuming 1-minute data frequency) df["% Iron Feed_lagged_30min"] = df["% Iron Feed"].shift(periods=30, freq='min') # === Then, compute correlations or build models with the *shifted* feed data: corr, _ = pearsonr(df["% Iron Feed_lagged_30min"].dropna(), df["% Iron Concentrate"].dropna()

Testing various lags (e.g., 5, 10, 15 minutes) and looking for the strongest correlation will help identify the true time delay for cause-and-effect relationships.

Summary & Call to Action

This analysis has aimed to provide a data-driven understanding of our iron ore flotation process. Key takeaways include:

🏁 Final Thoughts: Bridging Data to Operations

The observation that iron in the feed isn't consistently leading to better iron in the product points to potential **deep inefficiencies or untapped optimization opportunities** within our current operational strategy. This is a critical point to address.

Recommendation: Present these findings to operations and the process engineering team. Specifically, ask:

"Why doesn't increased iron in the feed consistently translate to higher iron in the product concentrate? What process parameters or equipment limitations might be causing this apparent bottleneck, and how can we investigate further through targeted adjustments or process studies?"

Understanding and rectifying this gap is paramount for improving our product quality, maximizing yield, and enhancing overall profitability.

Executive Summary: Iron Ore Flotation Performance Trends

This report provides an overview of our iron ore flotation circuit's key performance indicators (KPIs) and highlights recent trends that impact our operational efficiency and product quality. Our primary goals remain maximizing iron recovery and concentrate grade while minimizing waste.

Key Performance at a Glance


Driving Factors & Next Steps

Our analysis indicates that recent variability in Ore Pulp pH and minor fluctuations in Flotation Column Air Flow are likely contributing to the observed trends. While no critical system failures ("broken" states) have been identified, the data suggests areas where our process is not fully optimized.

Specifically, we've noted:

Our immediate next steps will focus on:

  1. Tightening control on Ore Pulp pH to consistently maintain optimal levels.
  2. Conducting targeted experiments to identify optimal air flow rates for each flotation column under varying feed conditions.
  3. Implementing real-time data monitoring with adjusted time lags to enable more responsive process control.

By addressing these optimization opportunities, we aim to revert the current trend in concentrate grade, improve silica rejection, and ensure sustained high recovery.