How to extract data from Google Analytics and build a data warehouse (webhouse) from it?
I have click stream data such as referring URL, top landing pages, top exit pages and metrics such as page views, number开发者_Go百科 of visits, bounces all in Google Analytics. There is no database yet where all this information might be stored. I am required to build a data warehouse from scratch(which I believe is known as web-house) from this data.So I need to extract data from Google Analytics and load it into a warehouse on a daily automated basis. My questions are:-
1)Is it possible? Every day data increases (some in terms of metrics or measures such as visits and some in terms of new referring sites), how would the process of loading the warehouse go about?
2)What ETL tool would help me to achieve this? Pentaho I believe has a way to pull out data from Google Analytics, has anyone used it? How does that process go? Any references, links would be appreciated besides answers.
As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.
There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.
The second option involves grabbing the data much closer to the source.
Nothing complicated, still, a few lines of background are perhaps helpful here.
The GA Web Dashboard is created by parsing/filtering a GA transaction log (the container that holds the GA data that corresponds to one Profile in one Account).
Each line in this log represents a single transaction and is delivered to the GA server in the form of an HTTP Request from the client.
Appended to that Request (which is nominally for a single-pixel GIF) is a single string that contains all of the data returned from that _TrackPageview function call plus data from the client DOM, GA cookies set for this client, and the contents of the Browser's location bar (http://www....).
Though this Request is from the client, it is invoked by the GA script (which resides on the client) immediately after execution of GA's primary data-collecting function (_TrackPageview).
So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).
The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:
modify the GA tracking code on each page of your Site so that it sends a copy of each GIF Request (one line in the GA logfile) to your own server, specifically, immeidately before the call to _trackPageview(), add this line:
pageTracker._setLocalRemoteServerMode();
Next, just put a single-pixel gif image in your document root and call it "__utm.gif".
So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.
Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):
http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:
gatc_version 1
GIF_req_unique_id 1669045322
language_encoding UTF-8
screen_resolution 1280x800
screen_color_depth 24-bit
browser_language en-us
java_enabled 1
flash_version 10.0%20r45
campaign_session_new 1
page_title Position%20Listings%20%7C%20Linden%20Lab
host_name lindenlab.hrmdirect.com
referral_url http://lindenlab.com/employment
page_request /employment/openings.php?sort=da
account_string UA-XXXXXX-X
cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
The cookies are also simple to parse (see Google's concise description here): for instance,
__utma is the unique-visitor cookie,
__utmb, __utmc are session cookies, and
__utmz is the referral type.
The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).
You can use the Data Export API from Google or a service such as the one we've built specifically for your need: www.analyticspros.com/products/analytics-data-warehouse.html.
Best,
-Caleb Whitmore www.analyticspros.com / www.analyticsformarketers.com
As Shiva said, you can always pull GA data through Google APIs and warehouse it yourself. However, if you are looking for a cost efficient warehousing tool, try out Analytics Canvas @ http://www.analyticscanvas.com/
You could also check out Google's App gallery for Google Analytics related tools: http://www.google.com/analytics/apps/
You can pull the GA (Google Analytics) data anytime through their API and build your own data warehouse (DW). Before you start, you may want to sit with the business user and get clear understanding of business requirements. In DW environment it is extremely important to have a clear set goals and understanding of business user requirements due to fact that you will maintain history of transactions which live for long time and used often.
Assuming that business user defines KPI (Key Performance Indicators), metrics, dimensions, granularity needed for you to proceed, you can check different dimensions and metrics that are availabe through GA API at code.google.com/apis/analytics/docs/. Then it is just a matter of making the right API call and getting what you need. DW activity involves data cleaning, extraction, transformation and loading (ETL) or ELT along with summarizing the facts along different dimensions. Since data is much cleaner than one would encounter in disparate systems (from web logs, external vendors, excel or files, etc.), you can simply load the data through any ETL tools (For example, Talend, Pentaho, SSIS, etc.) or through application of your choice (Perl, Java, Ruby, C#, etc).
For daily load, you need to design incremental loading process during the low user traffic time (nightly loads), pulling only the recent data, de-duping any duplicates, cleansing any non-conforming data, handling erraneous rows, etc.
I have provided a sample GA API application at http://www.hiregion.com/2009/10/google-analytics-data-retriever-api-sem_25.html and it will provide you basic info to get started.
There are two important rules about loading data in data-warehouse
- Initial load
- Incremental data load
When you design using GA api, you need to load the initial historical data for a certain date range. This has its own complications as you might run into segmentation issues, loss of data etc. You need to handle pagination etc.
Once the initial data load is complete, you then run it in incremental mode where you just bring new data only. This data gets appended to the same Data warehouse tables and does not cause duplicate with overlapping dates.
On top of this GA changes their API frequently so you need to be on top of this as well.
Considering the above, we released a fully packaged data-warehouse with Google Analytics and Salesforce data connectors. You can check out the details and get ideas on how you want to setup your own datawarehouse http://www.infocaptor.com/google-analytics-datawarehouse
The minimum you would need to design is some kind of background daemon that runs everyday or at some frequency. You will need job tables to monitor the success and failure of the extracts so that it can resume from where the error occurred.
Some of the other considerations 1. What happens if you run the extract for the same data range 2. What if a job fails for certain dates
It is important to set your primary keys for your DW target tables.In MySQL, using insert statement with duplicate clause will make sure that there are no duplicate records created in case of reloading of data.
Another thing to design is your staging layer. You extract data from GA and dump into a Staging. This way if there is error loading into Target you can simply reload from staging. This way you are not burdening your GA API limits and save bandwidth as well.
You can see our complete design at this location http://www.infocaptor.com/help/social_analytics___datawarehouse.htm
All the best with your DW effort.
We can export data from google analytics to any other data warehouse (or) RDBMS (or) Hadoop using Pentaho Data Integration tool, pdi kettle. Pentaho also have free Pentaho Community Edition to use.
Just follow this step to transform data from google analytics http://wiki.pentaho.com/display/EAI/Google+Analytics
This tool is good by userfriendly, cost and performance wise. Thanks.
精彩评论