My Project

The Library of Babel

"The Library of Babel" is a philosophical novella by Jorge Luis Borges. The story describes a universe consisting of an immense library of 410-page books. Together, the books contain every possible permutation of 25 characters (22 letters, a space, a period, and a comma). Each page contains 40 lines and each line 80 characters. This leads to 410 * 40 * 80 = 1,312,000 characters per book and a total of 25^1,312,000 books in the library. For comparison, there are ~10⁸⁰ atoms in our universe.

Although the vast majority of books are complete gibberish, every meaningful book that can and will ever be written is contained somewhere within the library. There will be some collection of books that outline the entire history and future of the universe and answer every question that will ever be asked. However, due to the random order of the books and the extent to which nonsensical books outnumber intelligible ones, the task of locating useful information is practically impossible. This is demonstrated by a quick browse of libraryofbabel, a virtual construction of the library created by Jonathan Basile.

Claude Shannon

Claude Shannon was an American mathematician and founder of the field of information theory, which deals with the mathematics of quantifying, analyzing, and optimizing the storage, transmission, and processing of information. His 1948 article 'A Mathematical Theory of Communication' was the founding paper in this field.

A stochastic process is a sequence of random events that adhere to a specific probabilisic rule or distribution. A Markov process is a stochastic process where the future state depends only on the current state, not on any previous states. In his paper, Shannon discusses the mathematics of an information source and describes how a discrete source of information, such as a natural written language, can be approximated by a sufficiently complex stochastic process, such as a discrete Markov process. Approximations to natural language can have different 'orders of approximation' depending on how many words of the current state are used to determine the next state (i.e., the next word). When constructing a zero-order approximation to the English language, all words from the English language would have equal chance of being each new word in a sentence. For instance, 'pleuropneumonia', which appears with low frequency in natural English, would have the same chance as 'the' of being the next word. In contrast, a first-order approximation would choose each word with a probability that represents its frequency in the natural language. For instance, 'the' would be selected much more frequently than 'pleuropneumonia'. A second-order approximation selects each next word with a probability that represents its frequency of following the previous word in the sentence in natural English. For instance, if the current 'state' is the sentence 'They jump inside the', only 'the' would be used to determine the next word. The next word is most likely to be a word that frequently follows 'the' in natural English, such as a noun or adjective, and is less likely to be a word that infrequently follows 'the', such as another article. A third-order approximation considers the previous two words in the sentence, a fourth-order approximation considers the previous three words, and the nth-order approximation considers the previous n-1 words.

The Library of Shannon

This website constructs a variation of The Library of Babel, in which the structure of the text in the books represents the structure of natural English. That is, the statistical properties of The Library of Shannon should be comparable to the statistical properties of a library containing English books. This is achieved by randomly sampling from a source text of natural English (over 4 million words from 1000 random Wikipedia articles). Such a library has the advantage of filtering out much of the books that are necessarily giberrish due to the fact they don't obey the statistical constraints of the English language. In this way, it overcomes some of the issues of the original Library of Babel by increasing the probability that the randomly contructed sentences are meaningful.

This website implements a variable-order Markov process, where the order of approximation to the English language varies depending on the position of the word in the sentence and the user-defined maximum order of approximation. If the maximum settings are selected, the order is 1 for the first word in the sentence, 2 for the second word, 3 for the third word and 4 for subsequent words. As an example, if an order of approximation of four is selected, words will randomly be selected from the source text until a word starting a sentence is found (first-order approximation). This word will be used as the first word of the sentence. Then, words will randomly be selected from the text until one is found that matches the current first word in the sentence (second-order approximation). Once found, the word adjacent to this will be taken as the second word of the sentence. Next, two consecutive words will randomly be selected from the text until a pair is found that matches the first two words of the sentence (third-order approximation). Once found, the word adjacent to this pair will be taken as the third word of the sentence. Finally, three consecutive words will be randomly selected until they match the first three words of the sentence (fourth-order approximation). Once found, the adjacent word becomes the fourth word in the sentence. From now on, the last three words of the sentence will be used to find all subsequent words.

In some cases - particularly for high orders of approximation or low-frequency word combinations - the program will struggle to find the next word. The program will make 500,000 attempts to find the next word before reducing the order of approximation by 1. For instance, if the order was 4 and the current sentence was '...pineapple on pizza', the program would make 500,000 attempts to search the source text for a word following 'pineapple on pizza' before reducing the order to 3 and searching for 'on pizza'. By randomly sampling from English texts, the output of this program should be statistically similar to that of naturally generated English. Of course, due to the mindless nature of generation, most of the output will be semantically nonsensical. It should, however, 'sound like' English at high orders of approximation, due to its structure.

You are free to choose any book in the library (referenced by number), how many words you wish to read from that book, and the order of approximation to use. The random generation is deterministic, so that for any given book reference and order of approximation, the generated text will always be the same. You can come back at any time, enter the same parameters, and the book will be unchanged. Note that the higher the order of approximation, the longer the program will take to generate the output.

Browse the library.