Here you go

Thursday, September 14, 2023

Web Scrapping Quick Starter Guide

 

Introduction to Web Scrapping


What is web scrapping: “Web scraping is the process of extracting data from websites. It involves using automated tools or scripts to retrieve specific information from web pages, which can then be used for analysis, research, or other purposes”. 


• Some common automated web scraping tools are: 

        Beautiful Soup, 

        Scrapy, 

        Selenium, 

        Octoparse, 

        WebHarvy, 

        ParseHub, 

        Apify, 

        Puppeteer, 

        MechanicalSoup, 

        WinAutomation etc. 

• You can also use C# standard libraries (System.Net.Http and HtmlAgilityPack) for web scrapping. Understanding HTML Elements and Attributes A brief summary of some HTML elements and attributes: Head: The element in HTML contains metadata about the document. Example : Title, character encoding, links to external stylesheets, and other resources that the browser needs to properly render the page. Body: The element defines the main content area of a web page. It includes all the visible content that users see when they visit the page, including text, images, links, and other media. Form: The

element is used to create interactive forms on web pages. It contains various input elements like text fields, radio buttons, checkboxes, and buttons that allow users to input data and submit it to a server for processing. HTML Elements and Attributes (Cont’d) Div (Id and Class): The
element is a generic container that doesn't have any inherent meaning. It's often used to group and style other elements. The id attribute provides a unique identifier for an element, while the class attribute is used to assign one or more classes, which can be used for styling or JavaScript interactions. Table Row (tr): The element is used to define a row within an HTML table. It contains one or more table data cells () or table header cells () that align with the columns of the table. Table Data (td): The element represents a cell within an HTML table's data section. It contains content such as text, images, or other elements that belong to a specific row and column intersection. Using Inspect Element • Open a web site on which you want to run scrapping. • Right Click on you required data element. • Select Inspect or Inspect Element as shown in the Picture 1. Picture 1 Finding an Element Copy Element : Copy XPath or Relative XPath: //*[@id="F4550_P1_COMPANY"] It is not dependent on the entire structure of the document. Copy full XPath or absolute XPath : /html/body/form/div[2]/div[1]/div/div[2]/div[1]/div/input An absolute XPath expression specifies the complete path from the root node to the target element. It's less recommended because even a small change in the document structure can break the XPath What is Selenium Selenium WebDriver is a powerful tool used for automating interactions with web browsers. It provides a programming interface (API) for controlling web browsers and simulating user interactions such as clicking links, filling out forms, and navigating through web pages. Selenium WebDriver supports various programming languages, including Java, Python, C#, Ruby, and more. Key features and uses of Selenium WebDriver include: Browser Automation, Web Testing and Web Scraping. For more information please visit : https://www.selenium.dev Example: Perform Login on a Website • Open Visual Studio and create project using C# and Dot Net Framework 4.5. • Install Packages, Add References and Namespaces as shown below: Project Properties Installed Packages Project References Required Namespaces Example (Cont’d) 1. Open Website : PSW Portal 2. Wait for elements to Load on Page. 3. Find Id, Name or XPath location of User Name: Inspect Element and then copy Element: Hint: We can find element using its name “userName”. Example (Cont’d) 4. Find Id, Name or XPath location of Password: Inspect Element and then copy Element: Hint: We can find element using its name “password”. Example (Cont’d) 5. Find Id, Name or XPath location of Login Button: Inspect Element and then copy Element: Copy XPath: //*[@id="root"]/div[3]/div/div[1]/div/div/div[1]/div/form/button Hint: We can find element using its XPath “//button[type=‘submit’]”. 6 . Perform Click on Submit Button Example (Cont’d) • Initialize Selenium Web Driver and define strategy: // Hide Message : Chrome is being controlled by automated test software. chromeOptions.AddExcludedArguments("enable-automation"); // For the disabling of the password popup box chromeOptions.AddUserProfilePreference("credentials_enable_service", false); chromeOptions.AddUserProfilePreference("profile.password_manager_enabled",false); Example (Cont’d) • Navigate to URL > Wait for elements to appear > Then pass username and password. Finally perform click on Login button to login the Website. • Hint : You can find element by using its Name, ID, XPath , Class Name or Link Text. Reading Material https://www.selenium.dev/ Thanks