Easily parse / search a webpage's search query strings from apache logs using php awk and sed linux command line utilities

Edited by Anonymous, Eng, Doug Collins

This HowTo will show you how to easily at the Linux command line parse your apache log files to get search string query data that users type to find your webpage in a search engine. I have included a php search GUI as well for easy usage.

Was this helpful? Yes | No | I need help

Ad

Use awk and sed to parse apache log files for user typed search queries

Here is the raw code to get the job done. Note that $title2 should be replace by what you are looking for BUT spaces are replaced with "\+".

For example: "How to find a big lollypop" would be "How\+to\+find\+a\+big\+lollypop"

This is because I plan to call from a php function, where I will populate $title2 as you will see later. I could have put the egrep

later but then you will not have the option of searching both page title and queries.  My original version just searched queries but I wanted a function I could easily call for a given page and see all the queries that users have typed over the past few weeks.

cat /var/log/virtualmin/VisiHow.com_access_log | #Get data from the access log egrep "(q|p)=" | #Use only records that likely have search queries sed 's/%22/"/g;s/%20/+/g;s/_/+/g;'| #Do some formatting changes and most importantly change all spaces or underlines to + egrep -I "$title2"| #search for $title2 awk '{print $11}' | #extract only search data sed 's/p=/q=/ig' | #handle Yahoo p= awk -F 'q=' '{print $2}' | #grab all data starting with q= sed 's/q=//g;s/+/ /g;s/%22/"/g;s/%20/ /g;'| #reformat to human readable format cut -d "&" -f 1| #strip off other parameters awk '{print $0"
"}' #add
link breaks because we are planning to output to a webpage

Was this helpful? Yes | No | I need help

==Let's create an PHP

HTML GUI wrapper for this now==

To make this easily usage we want to access the user query data from a web page search form.

queries.php

<title>User Queries</title> <meta name="robots" content="noindex">

<form name="sform" method="get"> Search Title: <input size="80" value="<?php echo $_GET['title']; ?>" name="title" type="text"> <input value="Search" type="submit"> HINT: You can use a partial title or even regex like "Samsung.*damage" </form>

Was this helpful? Yes | No | I need help

$title = $_GET['title']; $title = preg_replace("/how to /i","",$title); $title2 = str_replace(" ","\+",$title); $title2 = str_replace("_","\+",$title2);

$bigcmd = <<<eof

       egrep -i "(q|p)=" | #Use only records that likely have search queries
       sed 's/%22/"/g;s/%20/+/g;s/_/+/g;'| #Do some formatting changes and most importantly change all spaces or underlines to +
       egrep -i "$title2"| #search for title2
       awk '{print $11}' | #extract only search data
       sed 's/p=/q=/ig' |  #handle yahoo p=
       awk -F 'q=' '{print $2}' | #grab all data starting with q=
       sed 's/q=//g;s/+/ /g;s/%22/"/g;s/%20/ /g;'| #reformat to human readable format
       cut -d "&" -f 1| #strip off other parameters
       awk '{print $0"
"}' #add
link breaks because we are planning to output to a webpage

EOF;

Was this helpful? Yes | No | I need help

if ($title!="") {

       print "Searching for "$title" ... ";
       print "CURRENT WEEK ... 
"; print ` cat /var/log/virtualmin/visihow.com_access_log | #Get data from the access log $bigcmd `; //we support p= too because yahoo uses that... odd duck lol for($i=1;$i<6;$i++) { print "
$i WEEK(S) AGO
"; print ` zcat /var/log/virtualmin/visihow.com_access_log.$i.gz | #Get data from the access log $bigcmd `; }

} ?>

Was this helpful? Yes | No | I need help

</eof

Known Problems

The grep is picking up records with referrer data so it is returning unwanted records. For example if someone searches for "iPhone" and a user was on a iPhone page then clicked to go to a Samsung page, this code picks up the iPhone from the referrer and thus includes that data which it should not. I am working on version 2, that does a lot more and will fix this bug by moving much of the command line preprocessing into php. The code will be more obtuse, but will give far more useful and accurate data.

Was this helpful? Yes | No | I need help

Version 2: Less command line preprocessing and more php processing

This version has the following advantages / disadvantages:

  • Less command line preprocessing means longer code
  • More php means more control and features
  • Ignores multiple duplicate requests from the same ip
  • Counts multiple searches instead of listing them multiple times
  • Show trending information by displaying weekly totals
  • Output is in scrollable divs for easy viewing of popular pages
  • Sorted by query lengths totaled, which approximately means more popular pages show first
  • A little quick and dirty javascript & css to make the divs auto expand for detailed viewing

<title>User Queries</title> <meta name="robots" content="noindex">

<form name="sform" method="get"> Search Title: <input size="80" value="<?php
if (strlen($_GET['title'])>0) {
echo $_GET['title'];
} else {
echo " *";<br="" type="text">} ?>" name="title" /> <input value="Search" type="submit">
HINT: Enter * to see everything. You can use a partial title or even regex like "Samsung.*damage" </form>

Was this helpful? Yes | No | I need help

function sortbylen($a,$b){

   return strlen($b)-strlen($a);

}

$title = $_GET['title']; $title = preg_replace("/how to /i","",$title); $title2 = str_replace(" ","\+",$title); $title2 = str_replace("_","\+",$title2);

$bigcmd = <<<eof

       egrep -i "(q|p)=" | #Use only records that likely have search queries
       sed 's/%22/"/g;s/%20/+/g;s/_/+/g;'| #Do some formatting changes and most importantly change all spaces or underlines to +
       awk '{print $1 " " $7 " " $11}'|
       egrep -i "$title2" #search for title2

EOF;

Was this helpful? Yes | No | I need help

if ($title!="") {

   print "Searching for "$title" ... ";
   for($i=0;$i<6;$i++) {
       if ($i==0) {
               $searchlogdata = `cat /var/log/virtualmin/visihow.com_access_log | $bigcmd`;
       } else {
               $searchlogdata = `zcat /var/log/virtualmin/visihow.com_access_log.$i.gz | $bigcmd`;
       }
       $searchlogarray = explode("\n",$searchlogdata);
       foreach($searchlogarray as $sdata) {
               $logparts = explode(" ",$sdata);
               $ip = $logparts[0];
               $url = $logparts[1];
               preg_match("/(q|p)=(.*?)"?(\&|$)/",$logparts[2],$matches);
               $ss = urldecode($matches[2]);
               $ss = str_replace("+"," ",$ss);
               if ((strlen(trim($ss))>3)&&(!preg_match("#http\:\/\/#i",$ss))) { //keep out url from site searches
                       if (($ip!=$lip) || ($ss!=$lss)) {
                               $lip=$ip;
                               $lss=$ss;
                               $ssdata["$url"].="$ss|";
                       }
               }
       }
       uasort($array,'sortbylen');
       foreach($ssdata as $key => $value) {
               $sarr = explode("|",$value);
               foreach($sarr as $s) {
                       if (strlen($s)>0) $s2[$s]++;
               }
               arsort($s2);
               foreach($s2 as $key2 => $val2) {
                       $urldata[$key][$i].="$val2: $key2
"; $stot[$key][$i]+=$val2; } unset($s2); } unset($ssdata); } foreach($urldata as $url => $weekdata) {
print "
<a target="_blank" href="$url">$url</a>
";
       $wk=0;
       foreach($weekdata as $week) {
               print "
".$wk++." WEEK(S) AGO: Total Searches= ".$stot[$url][$wk-1]."
"; print "<div onmouseover="this.style.height=500" onmouseout="this.style.height=100" <br=""></div> style='width:800px;height:100px;border: 1px solid grey;overflow:scroll'>$week"; } }

} ?>

Was this helpful? Yes | No | I need help

</eof

ToDo

Sort by query length as a secondary sort, with longer queries first

That's it. I would love to hear your experiences with this code or suggestions for improvement.

Tips Tricks & Warnings

  • Note: I support p= as well and the standard q= because Yahoo being the odd duck it is uses p=
  • Version 2 php is more complex and I was coding fast so some of the variables names were selected poorly. please don't complain as its free code. But if you want to fix it up and document it and add it into the wiki as version 3 that would be fantastic :) PAY IT FORWARD !
  • If you have problems with any of these steps, ask a question for more help, or post in the comments section below.

Comments

VisiHow welcomes all comments. If you do not want to be anonymous, register or log in. It is free.




Daniel
Featured Author
69 Articles Started
2,601 Article Edits
24,290 Points
Daniel is a featured author with VisiHow. Daniel has achieved the level of "Lieutenant" with 24,290 points. Daniel has started 69 articles (including this one) and has also made 2,601 article edits. 17,578 people have read Daniel's article contributions.
Daniel's Message Board
Daniel: Hi, my name is Daniel.
Daniel: Can I help you with your problem about "Easily parse / search a webpage's search query strings from apache logs using php awk and sed linux command line utilities"?
 

Article Info

Categories : Programming | Websites

Recent edits by: Eng, Anonymous

Share this Article:

Thanks to all authors for creating a page that has been read 2,391 times.

Do you have a question not answered in this article?
Click here to ask one of the writers of this article
x

Thank Our Volunteer Authors.

Would you like to give back to the community by fixing a spelling mistake? Yes | No