Coding the “Brain”

A how-to for creating one’s own algorithmic ambassador to the web

February 25, 2014
Reading Time
8 minutes

I have always been interested in automation. In the wake of cheap computing and other technological advancement, more and more repetitive tasks are being automated, from tax preparation to checkout services to manufacturing to product handling. Recently, I’ve found that maintaining a social media activity feels repetitive. As a result, I’ve wondered, using natural language generation, can I automate my Internet presence? Can an algorithm represent me on my social networks? Is it possible that, on the Internet, no one knows you’re a bot?

To start, I limited my domain to Twitter because its concise, fleeting format makes it a good place for experimentation. And to represent me, I developed the Brain, a bot that learns to imitate certain users (or hand-picked “muses”) and tweets in some amalgamation of their styles.

A few of the muses include webbedspace, daniel_rehn, wwwtext, WilbotOsterman, ftrain, and a_antonellis.

Some example tweets:

he slime’s transforming your arm directly into its own cells! But if you focus, you can push back, and turn its tentacle into your flesh! – webbedspace

As those experiences trend toward the monolithic that density becomes heavier, more likely to crush than to lead to unexpected encounters – WilbotOsterman

By learning from these muses, the Brain sometimes actually tweets a poignant observation:

the mind a true internet of things

Or some kind of strangely elegant aphorism:

people are just more noise

Or vague threats:

anytime i’ve been phased out you drink the pain

So it seems possible to create one’s own algorithmic ambassador to the web. Interested in substituting cold code for human wit? Here’s how….

Act natural

Natural language generation — the name for this domain of challenges — is a multifaceted and, therefore, tricky endeavor. How can an algorithm, which does not really “understand” language — generate grammatical and sensible prose?

The distinction between “grammatical” and “sensible” is important, as both are requirements for text to be considered understandable or “passable” by humans.

For example, the following sentence is grammatical, but not sensible:

we spent the entirety of his nose between his 900 million

The following sort of makes sense — you can get the gist of it — however, the phrase is not grammatical:

operation olympic games the multiyear corruption is

But if you’re willing to toss aside those concerns, there’s a simple approach which is fun if only for its wild unpredictability: Markov chain generation.

Invisible chains

Markov chain generation is an intuitive technique based on the simple concept of sequencing possible events (or words, in this case) so that the probability of each event (or word) depends only on the the previous one. Because Markov chain generation is such a simple approach, it doesn’t always produce grammatical or sensible output. But it can be very entertaining. The code can lead to quite a few “happy accidents,” which exploit the human tendency to interpret liberally. Better still? Every once in a while the algorithm spontaneously produces a surprisingly coherent metaphor.

A Markov chain is a system typically represented like so:

markov chains


Each circle is “state,” and at any given moment the system is in one of these states.

The lines that connect the states (known as “edges”) have values which represent the probability that one state will lead to another state. These edges are “directed,” that is, they go from one state to another in only one direction. For instance, A is connected to C, but C is not connected to A.

In a Markov chain, look at the current state, then roll a die (or more accurately, pick a random value between 0 and 1) to determine what the next state will be.

For instance, in the above diagram, say we start in state A. Then we want to figure out what the next state will be. The two states that A is connected to are states B and C. The edge that connects A to C has a probability of 0.7, and the edge that connects A to B has a probability of 0.3. So we have a 70 percent chance of going to C and a 30 percent chance of going to B.

Imagine that we pick our random number and it’s determined that we now enter state C. Now we repeat the process. State C is connected to B, and it is also connected to itself. What this means is that there is a (40 percent) chance that the next state after C will be C again.

Pick a random number again, go to another state (B or C), and this continues ad nauseam.

Listen to many

So how can we adapt such a system for language generation?

It’s simple — assume that each “state” is a word, and generate a chain of states (words) to create a sentence.

The question is how do we determine the edges (or probabilities) of the system? If we start with the word “the,” how do we know what word we’re likely to go to next?

What we need is a corpus — a collection of text — to model the Markov system off. In the Brain, this text is collected from other Twitter users (“muses”) and over time the Brain builds a vocabulary and “learns” the probabilities that one word leads to another.

But for demonstration purposes, let’s just use the following text from Space Ipsum, “a space-themed lorem ipsum generator.” Consider each sentence as a document and allow the collection of sentences to be our “corpus”:

The Eagle has landed. The regret on our side is, they used to say years ago, we are reading about you in science class. Now they say, we are reading about you in history class.

Never in all their history have men been able truly to conceive of the world as one: a single sphere, a globe, having the qualities of a globe, a round earth in which all the directions eventually meet, in which there is no center because every point, or none, is center — an equal earth which all men occupy as equals. The airman’s earth, if free men make it, will be truly round: a globe in practice, not in theory.

For those who have seen the Earth from space, and for the hundreds and perhaps thousands more who will, the experience most certainly changes your perspective. The things that we share in our world are far more valuable than those which divide us.

We can process this text to build a vocabulary and learn the relationships between words, accumulating “knowledge” to later generate text.

# We'll use this to keep track of words
# and the probabilities between them.
knowledge = {}

# Split the text into sentences (sentenceuments).
# Clean up line breaks, lowercase everything, and remove empty strings.
sentences = filter(None, text.lower().replace('\n', '').split('.'))

# Generate the "knowledge".
for sentence in sentences:
    # Split the sentence into words.
    # Splitting on whitespace is a decent approach.
    # We also remove empty strings.
    words = filter(None, sentence.split(' '))

    # We want to keep track of the start and end
    # of sentences so we know where to start and end.
    words.insert(0, '<start>')

    for idx, word in enumerate(words):
        if idx < len(words) - 1:
            entry = knowledge.get(word, {})

            # Look at the next word so we can
            # build probabilities b/w words.
            next = words[idx+1]

            # Increment the count of this word
            # in the knowledge.
            if next not in entry:
                entry[next] = 0
            entry[next] += 1

            knowledge[word] = entry

Speak your mind

With training complete, you can start to use this accumulated knowledge to generate sentences.

Sentence generation works like the Markov chain example above. Begin with “<start>”; then look at the words that have started a sentence and randomly select one. Next, look at the word selected. What word comes after this word? Randomly pick one, and so on.

def generate():
    # Start with the start word.
    sentence = ['<start>']

    # Start picking words for the sentence!
    while(sentence[-1] != '<stop>'):
            word = weighted_choice(knowledge[ sentence[-1] ])
        except KeyError:

    # Join the sentence, with a period for good measure.
    return ' '.join(sentence[1:-1]) + '.'

def weighted_choice(choices):
    Random selects a key from a dictionary,
    where each key's value is its probability weight.
    # Randomly select a value between 0 and
    # the sum of all the weights.
    rand = random.uniform(0, sum(choices.values()))

    # Seek through the dict until a key is found
    # resulting in the random value.
    summ = 0.0
    for key, value in choices.items():
        summ += value
        if rand < summ: return key

    # If this returns False,
    # it's likely because the knowledge is empty.
    return False


With this small set of text, we can get some fairly coherent sentences:

the things that we share in science class.

never in theory.

the qualities of the eagle has landed.

the world are reading about you in history class.

Get smart(er)

Markov chain generation is a nice approach, but after a while grammaticality gets important. When working with much larger sets of text, you might see more gibberish surface. There are more sophisticated approaches that have greater assurances of grammaticality, such as context-free grammars, where you define certain grammatical rules that are adhered to by a recursive algorithm. Such a technique requires a bit more care and craft, but you can expect smarter-sounding results, and perhaps uncover some real pearls of wit and wisdom.

Complete code is available here.

Francis Tseng is an interactive developer based out of IDEO’s New York office. He is currently working on Argos, a modern news processing platform for automating the context of stories. You can follow him on Twitter @frnsys and his bot @pub_sci.