Machine Learning : Purpose and Data

Machine learning can be thought of as dealing with data such that we get answers for the questions we have. It is the development of a function that will use your data to give you approximations for your questions.

My argument is: given the three time frames (past, present, future), we basically look for:

  1. What the thing is – identity / classification
  2. What’s inside the thing – composition / decomposition
  3. What’s not in the thing – negation / contrast / boundary definition

And these three apply within each time frame:

Time frameWhat it isWhat’s inside itWhat’s not in it
PresentCurrent labelCurrent componentsCurrent exclusions
PastPrevious identityPrevious compositionPrevious exclusions
FuturePredicted identityPredicted compositionPredicted exclusions

But all of these possibilities map into the same object. For example you have a picture of a ball. With this we can answer:

  • What is the ball like currently?
  • What is the ball made of?
  • What would the ball be like in the future?
  • What was the ball like in the past?
  • What is the ball not made of in the past, present, or future?

The meta layer to this is when we try to answer questions like “What were the other balls like?”

What was the ball adjacent to it like right now or in the past or future?

For example, in text in the present time frame :

You have a block of text. You can identify what the block is, what the sentiment is, and what the content is. You can also say what’s in it, including what words are in it and what words are not.

LLMs are amazing because they can now tell you what was before or after this block. They are essentially approximating an entirely new object.

We are talking about being able to look at data, learn patterns from it, and create a new object that was never part of the training.

This is no longer about the same object across time. It’s about relations between distinct objects in a shared context (space, sequence, graph, or discourse).

Since the object’s data itself cannot tell us much about other objects and we won’t have data about all the objects, we might have to figure out just how our current object relates with other objects with the bits and pieces we have about them and try to reconstruct the other objects using that.

You have pieces of text as training data in LLMs and it generates new pieces of text, objects that were never in its training


The whole scope can be imagined as one object, and it’s nine questions from the three time frames we had. Now imagine many such objects and try to answer their nine questions using this object we have as training data.

And of course we are not going to have all the data for all the objects but we try to get as much as we can.

assume shades of green being the data available

And with whatever data we have we try to establish the relation between the objects so that we can recreate the objects and solve the nine questions.


We saw that the scope goes from 0 to 100 real quick. It went from just telling you what’s inside the object to predicting other objects and everything about them.

Definitely, for a different scope of predictions you want to do, you need a different scope of data.

Platonic ideal of data would be : All nine questions and answers and that too for all objects, and then we are able to train over it. Such data would allow us to generate objects and try to answer its nine questions.

This is what we see in today’s AI arms race: people want to get their hands on any type of data they can find so they can train their models on it. The more data it has, the more competent it becomes at solving these meta questions.

But still let’s just make a quick table that roughly tells us what type of data we need for what type of questions we got.

Question TypeTime FrameData NeededExample
What is it? (Classification)PresentLabeled examples of the thingImages with class labels
What is it? (Classification)PastHistorical records with labelsOld medical records with diagnoses
What is it? (Classification)FutureTime-series of labeled statesStock categories across quarters
What’s inside it? (Decomposition)PresentStructured/annotated internalsPart-segmented images, parse trees
What’s inside it? (Decomposition)PastArchived compositional dataHistorical ingredient lists, old schematics
What’s inside it? (Decomposition)FutureSequential compositional change dataMaterial degradation logs over time
What’s not in it? (Negation)PresentNegative examples, contrastive pairsWhat a tumor scan is not, OOD samples
What’s not in it? (Negation)PastRecords of absence or exclusionWhat ingredients a recipe never used
What’s not in it? (Negation)FutureAnomaly baselines, failure mode logsWhat a healthy engine reading won’t show
Relations to other objectsAnyCo-occurring, linked, or contextual dataText corpora, knowledge graphs, scene graphs
Reconstructing unknown objectsAnyPartial observations across many objectsWeb-scale text, satellite imagery mosaics