Fig. 2: Comparison between TREC and Jeopardy! along different complexity dimensions. In the middle every dimension is zero or. very small, at the edge of the polygon every dimension is roughly.
One year ago, the ie-response system developed by IBM has "Watson" In the American Quiz Show "Jeopardy!" The two best human players defeated. Another milestone in the development of machine intelligence
At the end of 2011, I was repeatedly wondering that scientists at our university are little about the IBM system called "Watson" Guide. had read. Actually, it does not happen every day that a computer can beat people on their original terrain, namely linguistic associative shut.1
Since Alan Turing in the firty years, an operational test suggested to judge whether or not a computer can think or not, has not passed the Turing test. No wonder: In the Turing test, a computer must talk to a person by chat line and do so as if he were a human being without the chatting person better than randomly guessing the opposite. Also Watson does not fill the Turing test. Nonetheless, computer people in a wide variety of board games have reviewed and after:
- In 1979 a computer hit the reigning backgammon world champion.
- Since 1986, the program Maven wins steady against experts in Scrabble.
- In 1994, the program Chinook became world champion of the human VS.-Machine competition in the checkers variant of the women’s game. In 2007, the game was "solved", that the computer always plays perfect and no more game can be lost.
- 1997 the parallel calculator won "Deep Blue" A tournament against the reigning Chess World Master Gary Kasparov. Scientist had been waiting for a computer against a chess world champion for 50 years.2
Computers are so good for combinatorial games because they have infinite patience: The range of exportable play (up to a depth) is calculated with raw violence in the processor. Since 1997, when Kasparov was beaten, microprocessors have become faster thousand times faster. Today you do not need a grobble more to bring the chess world champion to sweat.
The essence of our topic, however, is that Quiz competitions represent a very different kind of comparison between man and machine. Raw computing power and huge thought are of coarse advantage, but are not enough for this task alone. The computer must ask the question "to understand"; He has to know what kind of information is expected, and then must search in a database of documents after the right answer. The linguistic level occurs in the foreground both in the detection of the question and when viewed by the database. It is not like chess, where the whole game board can be displayed with only 64 numeric fields and the rules are rather clear. For question response systems, we have one "open", no "closed" world. Everything may be asked: Nothing human, Nihil Humani, is foreign to the computer.
what is "Jeopardy!"?
The TV show "Jeopardy!" is broadcast in different formats since 1964 in the USA. It’s about general education questions in all possible areas: geography, sports, movies, politics, etc. In contrast to shipments like "Who will be a millionar?" have to be answered the questions directly and without multiple-choice alternatives. The question is also very subled and flowery formulated. It is not instantly clear what is needed – people need a moment of thinking. There are also fishing ies and word games.
The screen for Watson in the middle. The computer, with 2880 processor cores, remains invisible. Image: Screenshot IBM video
The current quiz is about acting as the first of three players to activate a buzzer (typically within three seconds) to answer the questioned question. A real answer is rewarded with prize money. However, errors are punished with money losses. That’s why it’s important to know how gross confidence is in a possible response. If the player is not sure of an answer, it may be gunstiger to proceed to other participants. The questions are in categories like "Sports" or "Prasident" Divided, there are also categories for word games or. Puzzle.
A few examples can illustrate how difficult sometimes the answer to the questions can be. The actual question is formulated as a statement, the answer must be asked as a question. These "reversal" is blob a gimmick, but does not make the whole more complicated than if the questions were formulated as real ies and not as statements. Examples:
Question: "In 1899 The Senate Ratified The Treaty of Paris, ending this was"
Answer: "What is the Spanish-American was?"
Question: "They’re The 3 Thrown Objects in The Olympic Decathlon"
Answer: "What Are Discus, Shot Put, and Javelin"
Question: "In May 2010 5 Paintings Worth $ 125 Million by Braque, Matisse 3 Others Left Paris’ Museum of this Art Period".
Answer: "What is modern style?"
The sets above show that the questions in the formulations relative "hidden" are. Word fragments like "This" or "they" indicate the kind of answer that is expected. The determination of the question can be very tricky. In addition, money can be bet in different stages of the game – that’s why Watson must know all the rules of the game and possess a betting strategy. These aspects of Jeopardy! But are dependent here. Relevant is the question-answer architecture of the system.
At the beginning of 2008, the IBM Thomas-Watson Research Center in New York invited to a meeting between industry and academic research in order to acquire the future of question-response systems. Although such systems can look back on an old story, there were only maby achievements with applications until then. The focus was therefore on the definition of a challenge in the field of question-response systems, which could lead to a significant increase in the stand of art.
The example of the robotic book was explained by expression: such robotics-"Challenge" serve as a laboratory for a concentrated and targeted collaboration of many researchers. Possible benchmarks were also discussed:
- The Quintual TREC Competition (Text Retrieval Conference), where 500 questions have to be answered within a week from a computer with access to some million documents,
- Jeopardy!, whereby the game takes place in real time, and
- "Learning by Reading", D.H. The old dream to read the computer book to generate a structured knowledge base automatically from it.
Fig. Figure 2 shows the dimensions of the complexity of the tasks. Weighed the difficulty of questions, the later applicability, the confidence, accuracy, speed, width of the domain, difficulty of interpretation, difficulty of the language used, etc. Like ABB. 2 shows the computer with Jeopardy! Fast, accurately and with high confidence answers.3 The questions come from arbitrary areas and the language to interpret is difficult. Ultimately, IBM Jeopardy has! As the next challenge selected for your own QA team. It was decided that the system should do without internet connection, D.H. The entire Watson world quantities should be included in the memory of the computer.
However, something very important at the time of the formation was very important: IBM wanted to be pragmatic and was interested in shared interfaces and data formats. D.H. different approaches should be pursued, so that industry and academy could work on different modules, which then like lego stones to an overall system "putty" could become. Therefore, the idea of creating an open framework for question response systems. This approach was ultimately very important for the success of the project.
The architecture of Watson
Watson is a system for playing Jeopardy!: Used were 2880 processor cores, with 14 terabytes RAM. The machine is certainly huge, but programming is the most important. For the public, the software carries the name of the IBM-Grunder. However, the technology used is called DEEPQA from IBM (Deep Question Answers).
I had the gluck, recently dr. John Prague from the Watson team to meet a conference at the University of Cambridge. To two evenings and wedding outrunk in which also caused by James Watson and Francis Crick "The Eagle" Was Prague very clearly: although Deepqa is on well-known technologies and can not be made responsible for the success of the system alone, three factors play the essential role. There was also the actual contribution of the DeepQA team:
- First, the organization of the system around a carefully defined pipeline at the hypotheses are reviewed sequentially.
- Second, the normalization of the pipeline interfaces, whereby many alternative approaches can be tracked parallel with the pipeline.4 hundreds of alternative responses and up to 100.000 Text fragments and database entrance can be analyzed and evaluated in parallel. Thus, the rough IBM team could work at different construction sites at the same time and without hindering themselves.
- Third, the use of weighting functions to transform partial evidence values (from an ensemble of classifiers) to a common score with the help of learned weights.
DeepQA is therefore structured as a software pipeline: The linguistically formulated question passes through a series of processing levels (see ABB. 3). Since any questions can be asked, Deepqa starts with an analysis of the Jeopardy!-Statement to determine what is looking for and what category is the expected answer. A statement, the z.B. the word pair "This person" Contain, looking for a person. A question in which "that year" occurs, inquires after a date. However, as in the above example, the paris museum can be formed pretty farmer. A Praprocessor and Syntaxanalyzer must therefore break down the statements in elementary rates and after the note for the requested answer. In this case is the hint "This Art Period", D.H. this is the "Linguistic category" the answer to the expected answer.
To respond to such inquiries, the DeepQA developers studied in Jeopardy! Frequently used lexical categories and came to 2.500 species, z.B. Cities, persons, movies. Already the most common 40 of these categories cover most questions. However, at about 11% of the questions the searched category could not be determined. In such cases, it is better for the computer in the game to give up a response.
Fig. 3: The beginning of DeepQA pipelines. After analyzed question and category, very different methods can be used to generate a hypothesis. Hundreds of candidates are produced and after a first simple checking survival approx. 100 hypotheses.
The Linguistic Praprocessor of Deepqa transforms the Jeopardy!-Statement in a request for different types of hypothesis generators.5 ABB. 3 shows the main idea: the request can z.B. Keywords and the category you are looking for. Several hypothesis generators start parallel and use any methodology.
You can z.B. Simply search the keywords in Wikipedia and use the title of the Wikipedia page as an answer. Or you look in structured databases such as Z.B. Dbpedia. Or you are looking for in a list of American city (if the category is), whether your description in the database contains the given keywords. The imagination will not be limited at this point. David Ferrucci, the head of the project, has reported that in Watson about 100 different approaches of IBM itself and from the literature were integrated. A simple filter (the z.B. The category of hypotheses overproofed) completes the search for candidates. For this, so-called triple stores can be used, the z.B. the name "Abraham Lincoln" with the attribute "Prasident of" with "United States" associate. Such triple stores can be generated from the Wikipedia tables and info boxes and sometimes even serve as a source for the answer, more frequently as material for checking or checking. Backing a hypothesis. After filtering survive about 100 from 250 hypotheses and the pipeline continues to work on one "deeper" Inspection of candidates based on evidence fragments.
Evaluation of the hypotheses
While the front part of the DeepQA pipeline for the hypotheses generation is used, the second part of the pipelines for the evaluation and further endurance of the candidates is used. Here also different methods are used in parallel. if "Washington" The hypothesis is just the capital of the USA, you can.B. in Wikipedia or other encyclopadia after the sentence "Washington is the capital" Looking for. If you find the sentence, our trust increases in this hypothesis.
For example, there are also time and room check: if after a philosopher from the 19. Wanted in the century, Aristotle is out of the question. If searched for a Europe, the computer can not answer Martin Luther King. When checking 100 hypotheses are up to 100.000 Evidence objects such as texts, database entries, triple stores called for the help. At the end, most hypotheses divorces and only a few remain steady that a final check and a ranking must.
The parallelity of Deepqa can be introduced as a matrix. The lines of the matrix are the different hypotheses. The columns of the matrix are the evidence assessment. About 100 different hypotheses are reviewed on the basis of up to 1000 evidence fragments or evidence criteria. This results in up to 1000 individual scores per hypothesis that drove to immediate back or to calculate a final scores. Fig. 4 shows this idea as a matrix, in the single "Experts |" Leave your vote for every hypothesis.
DeepQA is therefore a software framework for the connection of many different methods and methods (this is called in the pattern recognition "Ensemble methods") From the area of the Textretrieval. Since the interfaces in the pipeline are standardized, a new method can be tested immediately for hypothesis generation: it simply adds a new line. You can also instantly install a new good criterion: there is a column (and the corresponding score) added. This allows you to give the basic system, very fast and with little effort, if a new idea can contribute something to the overall result or not. The Watson developers have z.B. Especially for Jeopardy! A pun criterion introduced, which was very helpful in many cases.
Fig. 4: Given 100 hypotheses, become up to 100.000 evidence fragments generated. Separate "Experts |" Check the hypothesis after a certain criterion and award a rating, depending on how well the hypothesis suits the evidence. These reviews ("scores") are turned into an overall assessment. Each cell in the pipeline can be executed in parallel. This results in an obvious parallelization strategy.
In order for us to focus on an important principle of playing the kind of man-counter-machine: if no absolute security is required in 100% of the trap, statistical procedures can be used. The computer will make some mistakes, but if in the course of the game the winnings offset the losses, the machine remains in the business. DeepQA can therefore be understood as a statistical ensemble of experts that give a numerical vote (a score) for each hypothesis.
Fig. Figure 5 shows the second part of DeepQA pipelines. Similar sources such as candidate generation are used, but not necessarily the same. In order to have good possible evidence fragments and to extend to important sources, such as.B. Wikipedia, a statistical source expansion was installed.6 This means that one takes a text source (Z.B. the Wikipedia side over Napoleon) and this on the one hand distends by text passages that are redundant or not relevant, but on the other hand goes to the Internet and seek similar sources (D.H. Texts about Napoleon). From the additional sources such sentences are removed which contain usable information. These are added to the previously explored document and in the end one has one "Pseudo document", that probably for people boring goods (with many repetitions of content in different word formulations), but for the computer is very useful, since the same thing is said in different language variants. One of the many software evaluators maybe start with one of the fragments and use them as evidence.
Fig. 5: Candidate rating. Many "Experts |" Analyze the textual or database evidence and forgiven the hypothesis of a note (a score). Some hypotheses can be just filtered out. The rest is summarized and a ranking of the remaining hypotheses is created.
At the end of the DeepQa pipelines, we then have the hypotheses that survive the review, and a few hundred scores. The numerical notes assigned by the expert procedures are simply combined linearly, D.H. A weighted mean is calculated, which is then scaled between 0 and 1 (this is called logistical regression). The weights for the weighted sum are based on a comprehensive database of Jeopardy! Questions, for which the answers are known, learned mechanically. For example, the system learns how much weight for puns or how much weight for the right geographic to be awarded.
Very complex logical considerations are not employed in Watson. The system does not understand about evidence and can not cohere the simplest conclusions. Highs are used to use a few transitive relationships and recognized synonyms. For the IBM researchers, the next step is to install inferencing mechanisms in DeepQA. As a new task, medical systems have been selected. Watson will look at the doctor’s shoulder in the future and serve as a consultant, who always wife, which side effects are to be expected or which laboratory values of treatment with a certain medication are in the way.
Before Watson started in US television, IBM has tested the system exports. The best Jeopardy!-Players of recent years invited to Yorktown Heights, where 55 Jeopardy!-Test plays were held. Watson was able to win 71% of the test games, was always stronger during the year 2010 until IBM researchers were sure that Watson could win against the best human players. Since Deepqa is a machine, it can only be better every day: better sources are synthesized, each new innovative process for question response systems can be integrated, etc. This is the permanent contribution of the DeepQA team: have developed a methodology for the standardized installation of parts experts in a decision pipeline in text retrieval systems.
Ensembles and the microstructure of the mind
It is often asked the question which is the best method for the pattern recognition. Actually none per se. Each classifier can be beaten in another context of another. Linear classifiers can beat neuronal networks when the problem requires linear separations. The best is actually, a good one "team" to have, D.H. A mixture of classifiers that can concentrate on different tasks. Such "Ensemble" have been studied intensively in recent years. Although IBM has been with short interruptions on the Jeopardy since about 1997!-Had worked, it is very interesting to determine that the real breakthrough was only achieved through the connection of many different researchers from other research centers and the idea "open" To create software architecture. It turns out that difficult tasks are better in partnership than loose from a single crack.
Curiosome is the nature as a pragmatic when it is hard tasks. For example, we know that in the visual cortex is not simply projected an image of the retinal image. The first areas in visual cortex project to other areas where different characteristics are analyzed. Color becomes z.B. Separately processed by movement (D.H. In different brain areas), both must be associated with the object recognition and scene analysis. If color recognition has failed due to a stroke, motion detection can still be preserved and vice versa. This coarse number of parallel processes, with which manifold brain areas analyze different ones, to detect a cat in the end, called the brain researcher Semir Zeki the "Microstructure" cognition.7 The gross question is always, as the brain manages to taste these different processing areas even in time, as the results are present at different times at different locations. That "Binding problem" That’s why the neurobiology remains one of the most exciting.
Today, if you see Watson in a video in action, you can only get Gosehaut. We know: There is no intelligence – Watson can not even loose the simplest inference tasks. Watson’s super brain is based on gigabytes of textual sources, but we project our unconscious interpretation into the machine. After all, the system has the best players at Jeopardy! struck and there is a bit of the microstructure of our own spirit in its parallel architecture from many parts experts.