Web Spider using PowerShell

Background Information:

I have a need to simulate traffic after hours on one of our SharePoint sites to investigate a performance issues where the w3wp.exe process crashes after a few hours of activity. I have decided to develop a PowerShell script to mimic a user access various pages on the site and clicking on any links they find. This PowerShell script should prompt the user for their credentials, for the URL of the start site they wish to crawl, for the maximum number of links the script should visit before aborting, and last but none the least, the maximum level of pages in the architecture the crawler should visit. The script should log pages that have been visited and ensure we don’t visit the same page more than once.

Script’s Logic:

My PowerShell script will start by making a request to the root site, based on the URL the user provided. Using the Invoke-WebRequest PowerShell cmdlet, I will retrieve the page’s HTML content and store it in a variable. Then, I will search through that content for anchor tags containing links (“<a> tags with the href property). Using a combination of string methods, I will retrieve the URL of every link on the main page, and will call a recursive method that will repeat the same “link-retrieval” process on each link retrieved for the main page.

The script will need to have a try/catch block to ensure no errors are thrown back to the user in the case where URLs retrieved from links are invalid (e.g. “#”, “javascript:”, etc). The script will also account for relative URLs, meaning that if a URL retrieved from a page’s content starts with ‘/’, the domain name of the parent page will be used to form an absolute link. Finally, my script will display some level of execution status out to the PowerShell console; for every link crawled, it will display a new line containing information about the total number of links visited so far, the hierarchical level of the link (compared to the parent site), and the URL being visited. Figures below show the execution of the script on my personal blog. For simplicity’s sake, I have hardcoded the value in the top section of my script.

 

20140429-1.png

20140429-2.png

Script’s Content:

As mentionned above, in my case, I have hardcoded the values of the parameters instead of prompting the user to enter them manually. The following script will crawl my blog (http://nikcharlebois.com) for a maximum total of 200 links, and won’t crawl pages that are more than 3 levels deeper that my blog’s home page. Enjoy!

$creds = Get-Credential
$url = “http://www.nikcharlebois.com”
$Script:maxLinks = 200
$Script:maxLevels = 3
$Script:numberLinks = 0
$Script:linksVisited = @()
Function CrawlLink($site, $level)
{
Try
{
$request = Invoke-WebRequest $site -Credential $creds
$content = $request.Content
$domain = ($site.Replace(“http://”,””).Replace(“https://”,””)).Split(‘/’)[0]
$start = 0
$end = 0
$start = $content.IndexOf(“<a “, $end)
while($start -ge 0)
{
if($start -ge 0)
{
# Get the position of of the beginning of the link. The +6 is to go past the href=”
$start = $content.IndexOf(“href=”, $start) + 6
if($start -ge 6)
{
$end = $content.IndexOf(“”””, $start)
$end2 = $content.IndexOf(“‘”, $start)
if($end2 -lt $end -and $end2 -ne -1)
{
$end = $end2
}
if($end -ge $start)
{
$link = $content.Substring($start, $end – $start)
# Handle case where link is relative
if($link.StartsWith(“/”))
{
$link = $site.Split(‘/’)[0] + “//” + $domain + $link
}
if($Script:numberLinks -le $Script:maxLinks -and $level -le $Script:maxLevels)
{
if(($Script:linksVisited -notcontains $link) -and $link.StartsWith(“http:”))
{
$Script:numberLinks++
Write-Host $Script:numberLinks”[“$level”] – “$link -BackgroundColor Blue -ForegroundColor White
$Script:linksVisited += $link
CrawlLink $link ([int]($level+1))
}
}
}
}
}
$start = $content.IndexOf(“<a “, $end)
}
}
Catch [system.exception]
{
}
}
CrawlLink $url 0

Microsoft Premier Field Engineer – SharePoint

Leave a Comment

Your email address will not be published. Required fields are marked *

*
*