| We recently had a client who is a multi-national | | | | using non-HTML interfaces |
| retailer with both a physical and Internet presence. | | | | Enabled us to schedule regular search requests |
| The client needed a way to acquire certain | | | | designed to harvest new and updated information |
| business intelligence (BI) data from the Internet | | | | on the target subjects. |
| on a daily basis. After several unsuccessful | | | | It provided data in a format which was able to be |
| attempts to create this functionality themselves, | | | | easily integrated with the client's legacy systems. |
| they came to us for a solution. | | | | Using the Google API, SOAP and WSDL, our |
| On the surface the requirements seemed to be | | | | developers were able to define messages that |
| difficult and it was easy to see why their own IT | | | | fetched cached pages, searched the Google |
| team had failed to find a solution. They were | | | | document index and retrieve the responses |
| thinking "inside the box", however, and hadn't | | | | without having to filter out HTML or reformat the |
| considered third-party alternatives. The | | | | data. The resulting data was then handed off to |
| specifications required that the application perform | | | | the client's legacy systems for validation, reporting |
| all of these tasks: | | | | and further processing before reaching the data |
| Retrieve new product listings on competitor's web | | | | warehouse. |
| sites. | | | | During the Proof of Concept phase we ran tests |
| Retrieve current pricing for all products listed on | | | | where we were able to reliably identify and |
| competitor's web sites. | | | | retrieve updated public relations and investor |
| Retrieve full text of competitor's Press Releases | | | | relations information that exceeded the client's |
| and public financial reports. | | | | expectations. |
| Track all inbound links pointing to competitor's web | | | | In our next test we retrieved the most currently |
| sites from other web sites. | | | | available product pages which were listed in |
| Once the data was acquired it needed to be | | | | Google and then ran another query to retrieve |
| processed for reporting purposes and then stored | | | | the Google "cached page" versions. We ran these |
| in the data warehouse for future access. | | | | two data sets through difference filters and were |
| After reviewing current web-based data | | | | able to produce accurate price increase and |
| acquisition technology, including "spiders" which | | | | decrease reports as well as identify new products. |
| crawled the Internet and returned data which | | | | For our final test we used the Google API's ability |
| then had to be processed through HTML filters, | | | | to access the "link:" feature to rapidly build lists of |
| we determined that the Google API and Web | | | | inbound links. |
| Services offered the best solution. | | | | These limited tests demonstrated that the Google |
| The Google API provides remote access to all of | | | | API was capable of producing the BI data that |
| the search engine's exposed functionality and | | | | the client requested as well as demonstrating that |
| provides a communication layer which is accessed | | | | the data could be returned in a pre-defined |
| via the "Simple Object Access Protocol" (SOAP), | | | | format which eliminated the need to apply post |
| a web services standard. Since SOAP is an | | | | retrieval filters. |
| XML-based technology it is easily integrated into | | | | The client was pleased with the results of our |
| legacy web-enabled applications. | | | | Proof of Concept phase and authorized us to |
| The API met all of the requirements of the | | | | proceed with building the solution. The application is |
| application in that it: | | | | now in daily use and is exceeding the client's |
| Provided a methodology for querying the Web | | | | performance expectations by a wide margin. |