AI in an ETL Pipeline – Session 2

AI in an ETL Pipeline: A Potato Example

Session 2: AI in Business | Topic: How data flows through a pipeline, where ML and GenAI plug in, and how results reach you

What is ETL?

ETL stands for Extract, Transform, and Load. It’s a standard way to move data from where it’s collected into a form that’s useful for reporting, dashboards, or AI. Once the data is cleaned and structured, you can plug in both machine learning (ML) models and generative AI (GenAI) to get numbers, predictions, and human-readable text.

Extract: Pull data from field sensors, weather APIs, spreadsheets, storage logs, or equipment.
Transform: Clean, standardize, and combine data (e.g. fix units, fill gaps, join field + weather).
Load: Write the result into a database, data lake, or analytics tool so people and AI can use it.

AI fits into this pipeline after (or during) the transform step: you feed the prepared data into ML models for predictions and into GenAI for summaries and alerts. Then you load both the raw outputs and the generated text to dashboards, apps, and SMS.

ML vs GenAI in the Pipeline

Machine learning (ML) uses patterns in your data to produce structured outputs: numbers (e.g. predicted yield in t/acre), categories (e.g. high/medium/low risk), or yes/no flags. ML models are trained on historical data and run on the same kind of structured data from your ETL pipeline. They don’t write sentences—they output columns you can chart, filter, and send to a dashboard or app.

Generative AI (GenAI) produces text (or images). In a pipeline, GenAI takes the same data—and often the ML outputs—and turns them into plain language: weekly summaries (“Field F102 is trending below target; consider extra scouting”), SMS alerts, answers to questions like “Why is field F101 predicted higher than last year?”, or short reports for management. GenAI doesn’t replace ML; it makes ML results easier to read and act on.

Where ML and GenAI sit in the pipeline

Extract → Transform → ML → GenAI → Load

ML: predictions & scores → GenAI: summaries & alerts → Load: dashboard, app, SMS

ML examples (potato)

Yield prediction: Regression model trained on field + weather + history → predicted t/acre per field.
Disease/risk classification: Model trained on weather + soil + past outbreaks → “low / medium / high” risk per field or week.
Grade or quality: Model trained on storage temp, humidity, variety → predicted grade or “process vs table” suitability.

GenAI examples (potato)

Weekly summary: “This week F101 and F103 are on track. F102 is 0.4 t/acre below target; wet conditions in the north corner may be a factor.”
SMS alert: “Alert: Field F102 yield forecast dropped. Consider scouting. Details in app.”
Q&A over your data: User asks “Which fields need attention?” → GenAI reads ML outputs and answers in plain language.

Potato Example: From Field Data to ML and GenAI

Below is a single pipeline: data is extracted from several sources, transformed into one table, then fed to an ML model for yield and risk, and to GenAI for summaries and alerts. The results are loaded to a database and then surfaced on a dashboard, in an app, and via SMS.

End-to-end pipeline

Extract
Fields, weather, soil, yields

→

Transform
Clean, join, one table

→

ML
Yield & risk

→

GenAI
Summaries & SMS

→

Load
DB → Dashboard, App, SMS

Step 1: Extract (data sources)

Data is pulled from sources typical in potato operations:

Field records: variety, planting date, field ID, acreage.
Weather: temperature, rainfall, growing degree days (from a weather API or station).
Soil/sensors: soil moisture, pH, or other sensor readings if available.
Historical yields: tonnes per acre or per field from past seasons.

Step 2: Transform (clean and join)

Raw data is cleaned and combined into one table per field-season: same units, same date ranges, and no missing key values. Example shape after transform:

field_id	variety	planting_date	growing_degree_days	rainfall_mm	soil_moisture_avg	historical_yield_t_per_acre
F101	Russet	2024-05-01	1850	220	0.42	18.2
F102	Russet	2024-05-05	1820	245	0.45	17.8
F103	Yellow	2024-05-03	1835	230	0.38	16.5

Step 3: Plug in ML and GenAI

The transformed table is the input to an ML model. The model adds predicted yield and risk level per field. Those outputs (plus the underlying data) are then passed to GenAI to produce short summaries and alert text.

Example table after ML (two new columns):

field_id	predicted_yield_t_per_acre	risk_level
F101	18.4	Low
F102	16.9	Medium
F103	17.2	Low

GenAI takes this table (and optional context like “weekly update”) and generates, for example: “F101 and F103 on track. F102 below target (16.9 t/acre); medium risk—consider scouting.” That text is stored and also sent as SMS or in-app copy. The pipeline runs on a schedule (e.g. weekly) so predictions and messages stay current.

Step 4: Load — Where results go

The final dataset—including ML predictions and GenAI text—is loaded into a database or data store. From there it powers a web dashboard, a mobile app, and SMS (or email). The next section shows example outputs for each.

Output to Dashboard, App, and SMS

The same pipeline can feed multiple surfaces. Below are mock examples of how the ML and GenAI outputs could look on a dashboard, in an app, and in an SMS.

Dashboard (web)

A simple web dashboard might show key numbers and a chart. Tiles can come from the loaded ML results; the “Weekly summary” text is from GenAI.

Avg predicted yield

17.5 t/ac

Fields on track

Needs attention

Predicted yield by field (bar chart)

F101F102F103

Weekly summary (GenAI)

F101 and F103 on track. F102 below target (16.9 t/acre); medium risk—consider scouting.

App (mobile)

The same metrics and summary can appear in a mobile app. Alerts can be push notifications; the content can be the same GenAI summary or a shorter line.

9:41

●●● 🔋

🥔

Field Watch

Yield & risk at a glance

ALERT Just now

F102 below target. Tap for details.

Predicted yield

F101 Russet

18.4 t/ac

F102 Russet

16.9 t/ac

F103 Yellow

17.2 t/ac

▣

Fields

▣

Alerts

▣

Summary

SMS

SMS messages are kept short. GenAI can generate one or two lines from the ML results, e.g. for alerts or a daily/weekly digest.

SMS (example)

Field Watch: F102 yield forecast 16.9 t/ac (below target). Medium risk—consider scouting. Details in app.

Why This Pipeline View Helps

Keeping ETL, ML, and GenAI in one pipeline makes it easier to add new data sources, new models, and new ways to consume the results. Good AI depends on good data; ETL forces you to define what you extract and how you transform it before adding models.

Data first: One clean dataset feeds both ML and GenAI, so predictions and messages stay consistent.
Reusable: The same pipeline can feed multiple ML use cases (yield, risk, quality) and multiple GenAI outputs (summary, SMS, Q&A).
One load, many surfaces: Load results once to a database; the dashboard, app, and SMS all read from the same place.

Takeaway

An ETL pipeline gives you a clear place to plug in both ML (for predictions and risk scores) and GenAI (for summaries and alerts). In a potato context, that means combining field, weather, and history into one dataset, running ML for yield and risk, then using GenAI to turn those results into plain-language updates. Loading once to a database lets you surface the same outputs on a dashboard, in an app, and via SMS so your team can act on them wherever they are.