I have a need to simulate traffic after hours on one of our SharePoint sites to investigate a performance issues where the w3wp.exe process crashes after a few hours of activity. I have decided to develop a PowerShell script to mimic a user access various pages on the site and clicking on any links they find. This PowerShell script should prompt the user for their credentials, for the URL of the start site they wish to crawl, for the maximum number of links the script should visit before aborting, and last but none the least, the maximum level of pages in the architecture the crawler should visit. The script should log pages that have been visited and ensure we don’t visit the same page more than once.
My PowerShell script will start by making a request to the root site, based on the URL the user provided. Using the Invoke-WebRequest PowerShell cmdlet, I will retrieve the page’s HTML content and store it in a variable. Then, I will search through that content for anchor tags containing links (“<a> tags with the href property). Using a combination of string methods, I will retrieve the URL of every link on the main page, and will call a recursive method that will repeat the same “link-retrieval” process on each link retrieved for the main page.
As mentionned above, in my case, I have hardcoded the values of the parameters instead of prompting the user to enter them manually. The following script will crawl my blog (http://nikcharlebois.com) for a maximum total of 200 links, and won’t crawl pages that are more than 3 levels deeper that my blog’s home page. Enjoy!