Google search result list help

Question:
Im trying to print a list useing pprint that separates all of the lists from eachother, but I have to use pprint.pprint so that I can format it so that the Title is above the Link. I will show the output I want, The output Im getting, and the Code I have so far. I will not be beck until Monday Im taking my dad on vacation for his bday so if someone can hep me that would be nice.

Output I want:

[{'Title': 'What is Web Scraping and How to Use It? - GeeksforGeeks',
'Link': 'https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/'},

{'Title': 'Web scraping - Wikipedia',
'Link': 'https://en.wikipedia.org/wiki/Web_scraping'},

{'Title': "What Is Web Scraping? A Complete Beginner's Guide",
'Link': 'https://careerfoundry.com/en/blog/data-analytics/web-scraping-guide/'},

{'Title': 'What Is Scraping | About Price & Web Scraping Tools | Imperva',
'Link': 'https://www.imperva.com/learn/application-security/web-scraping-attack/'},

{'Title': 'What is Web Scraping and What is it Used For? | ParseHub',
'Link': 'https://www.parsehub.com/blog/what-is-web-scraping/'},

{'Title': 'A Practical Introduction to Web Scraping in Python',
'Link': 'https://realpython.com/python-web-scraping-practical-introduction/'},

{'Title': 'Web Scraping with Python: Everything you need to know (2022)',
'Link': 'https://www.scrapingbee.com/blog/web-scraping-101-with-python/'},

{'Title': 'Web Scraper - The #1 web scraping extension',
'Link': 'https://webscraper.io/'},

{'Title': 'What Is Web Scraping? - Zyte',
'Link': 'https://www.zyte.com/learn/what-is-web-scraping/'}]

Output Im getting:

[{'Title': 'What is Web Scraping and How to Use It? - GeeksforGeeks',
'Link': 'https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/'},
{'Title': 'Web scraping - Wikipedia',
'Link': 'https://en.wikipedia.org/wiki/Web_scraping'},
{'Title': "What Is Web Scraping? A Complete Beginner's Guide",
'Link': 'https://careerfoundry.com/en/blog/data-analytics/web-scraping-guide/'},
{'Title': 'What Is Scraping | About Price & Web Scraping Tools | Imperva',
'Link': 'https://www.imperva.com/learn/application-security/web-scraping-attack/'},
{'Title': 'What is Web Scraping and What is it Used For? | ParseHub',
'Link': 'https://www.parsehub.com/blog/what-is-web-scraping/'},
{'Title': 'A Practical Introduction to Web Scraping in Python',
'Link': 'https://realpython.com/python-web-scraping-practical-introduction/'},
{'Title': 'Web Scraping with Python: Everything you need to know (2022)',
'Link': 'https://www.scrapingbee.com/blog/web-scraping-101-with-python/'},
{'Title': 'Web Scraper - The #1 web scraping extension',
'Link': 'https://webscraper.io/'},
{'Title': 'What Is Web Scraping? - Zyte',
'Link': 'https://www.zyte.com/learn/what-is-web-scraping/'}]

Code Im useing:

import requests
import urllib
import pandas as pd
import pprint
from requests_html import HTML
from requests_html import HTMLSession

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
      
def get_results(query):
    
    query = urllib.parse.quote_plus(query)
    response = get_source("https://www.google.co.uk/search?q=" + query)
    
    return response
  
def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".VwiC3b"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'Title': result.find(css_identifier_title, first=True).text,
            'Link': result.find(css_identifier_link, first=True).attrs['href']
        }
        
        output.append(item)
        
    pprint.pprint(output, stream=None, indent=0, width=80, depth=None, compact=False, sort_dicts=False)
  
def google_search(query):
    response = get_results(query)
    return parse_results(response)
  
google_search("web scraping")

Hello @RetroWolf, welcome to the forum!

Coming to your question, After playing a bit with your example replacing few lines helped me to acheive the effect.

I replaced output.append(item) with:

output.append(pprint.pformat(item, indent=0, width=80, depth=None, compact=False, sort_dicts=False))

Then I replaced pprint.pprint() with:

for out in output:
      print(out + "\n")

Here is a working repl I made with the chunk of code you mentioned:
https://replit.com/@pro0grammer/WellwornScratchyMalware

If I was wrong somewhere or if there is a better approach you find, please feel free to update me as I have never coded any project in python before.

2 Likes

It made it slightly better but some of them still return both title and link on the same line.
For example:

SEARCH: test
{'Title': 'Test.com: Home', 'Link': 'https://test.com/'}

{'Title': 'Speedtest by Ookla - The Global Broadband Speed Test',
'Link': 'https://www.speedtest.net/'}

{'Title': 'Test Definition & Meaning - Merriam-Webster',
'Link': 'https://www.merriam-webster.com/dictionary/test'}

{'Title': 'Test - Wikipedia', 'Link': 'https://en.wikipedia.org/wiki/Test'}

{'Title': 'Test Definition & Meaning - Dictionary.com',
'Link': 'https://www.dictionary.com/browse/test'}

{'Title': 'test - Wiktionary', 'Link': 'https://en.wiktionary.org/wiki/test'}

{'Title': 'COVID-19 Testing: What You Need to Know - CDC',
'Link': 'https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/testing.html'}

{'Title': 'Fast.com: Internet Speed Test', 'Link': 'https://fast.com/'}

I see that it occurs whenever link title is short, I think this might be because of the world limit you’ve set in pretty print, let me make some tweaks and check what solves the issue.

So since you actually wanted to format each parameter seperately and not together, I did exactly that:

item = {
            'Title': pprint.pformat(result.find(css_identifier_title, first=True).text, indent=0, width=80, depth=None, compact=False, sort_dicts=False),
            'Link': pprint.pformat(result.find(css_identifier_link, first=True).attrs['href'], indent=0, width=80, depth=None, compact=False, sort_dicts=False),
        }

The only tweak was in how the objects would need to be printed since now you need a newline after each parameter, I thought it would be better to hard code the format itself:

print("{'Title':" + out.get("Title") + ",\n'Link':" + out.get("Link") + '}\n')

It now runs in to an error:

print("{'Title':" + out.get("Title") + ",\n'Link':" + out.get("Link") + '}\n')
AttributeError: 'str' object has no attribute 'get'

Sorry my bad, I forgot to tell that you’ve to change:

output.append(pprint.pformat(item, indent=0, width=cols, depth=None, compact=False, sort_dicts=False))

To this:

output.append(item)

That should solve the problem.

It worked thank you! If you dont mind can you help me add a description to the links?
For example:

SEARCH: test
PAGES: 1

Title: 'Test.com: Home'
Link: 'https://test.com/'
Info: 'Want to test your internet upload and download speeds? ... Looking for a test and certification management solution for you business or organization?'

Title: 'Speedtest by Ookla - The Global Broadband Speed Test'
Link: 'https://www.speedtest.net/'
Info: 'Use Speedtest on all your devices with our free desktop and mobile apps.'

Title: 'Test Definition & Meaning - Merriam-Webster'
Link: 'https://www.merriam-webster.com/dictionary/test'
Info: 'verb ; 1 ¡ to put to test or proof : try. test out your strength ; 3 ¡ to use tests as a way to analyze or identify. test for copper.'

Title: 'Test - Wikipedia'
Link: 'https://en.wikipedia.org/wiki/Test'
Info: 'Test (assessment), an educational assessment intended to measure the respondents' knowledge or other abilities ...'

Title: 'Test Definition & Meaning - Dictionary.com'
Link: 'https://www.dictionary.com/browse/test'
Info: 'A test is a collection of questions, tasks, or problems that are designed to see if a person understands a subject or to measure their ability to do something.'

Title: 'test - Wiktionary'
Link: 'https://en.wiktionary.org/wiki/test'
Info: 'To challenge. ¡ To refine (gold, silver, etc.) ¡ To put to the proof; to prove the truth, genuineness, or quality of by experiment, or by some principle or ...'

Title: 'Testing resources - COVID.gov'
Link: 'https://www.covid.gov/tests'
Info: 'No-cost antigen and PCR COVID-⁠19 tests are available to everyone in the U.S., including the uninsured, at more than 20,000 sites nationwide. Find resources ...'

Title: 'COVID-19 Testing: What You Need to Know - CDC'
Link: 'https://www.cdc.gov/coronavirus/2019-ncov/symptoms-testing/testing.html'
Info: 'These tests detect viral genetic material, which may stay in your body for up to 90 days after you test positive. Therefore, you should not use a NAAT if you ...'

All I need is the rest to scrape the description.
Here is the code I have so far:

import requests
import urllib
import pandas as pd
import pprint
import shutil
from requests_html import HTML
from requests_html import HTMLSession
global terminal_size
global cols
global rows

terminal_size = shutil.get_terminal_size(fallback=(120, 50))
cols = terminal_size.columns
rows = terminal_size.lines

query = input("SEARCH: ")

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
      
def get_results(query):
    query = urllib.parse.quote_plus(query)
    try:
      pages = int(input("PAGES: "))
    except ValueError:
      print("PAGES must be a number!")
      clear()
      os.system('from main import clear; clear()')
    num = pages*10
    response = get_source(f"https://www.google.co.uk/search?q={query}&num={num}")
    
    return response
  
def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".VwiC3b"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'Title': pprint.pformat(result.find(css_identifier_title, first=True).text, indent=0, width=cols, depth=None, compact=False, sort_dicts=False),
            'Link': pprint.pformat(result.find(css_identifier_link, first=True).attrs['href'], indent=0, width=cols, depth=None, compact=False, sort_dicts=False),
        }

        output.append(item)
        
    for out in output:
      print("\nTitle: " + out.get("Title") + "\nLink: " + out.get("Link"))
  
def google_search(query):
    response = get_results(query)
    return parse_results(response)

google_search(query)
1 Like

I have perfected it.
Here is the code:

import requests
import urllib
import pandas as pd
import pprint
import time
from termcolor import colored, cprint
from requests_html import HTML
from requests_html import HTMLSession
from colorama import Fore,Back,Style

query = input(colored(f"{Fore.BLUE}S{Fore.RED}E{Fore.LIGHTYELLOW_EX}A{Fore.BLUE}R{Fore.RED}C{Fore.LIGHTYELLOW_EX}H{Fore.BLUE}:{Fore.WHITE} ", attrs=['bold']))

def get_source(url):
    """Return the source code for the provided URL. 

    Args: 
        url (string): URL of the page to scrape.

    Returns:
        response (object): HTTP response object from requests_html. 
    """

    try:
        session = HTMLSession()
        response = session.get(url)
        return response

    except requests.exceptions.RequestException as e:
        print(e)
def get_results(query):
    query = urllib.parse.quote_plus(query)
    try:
      pages = int(input(colored(f"{Fore.BLUE}P{Fore.RED}A{Fore.LIGHTYELLOW_EX}G{Fore.BLUE}E{Fore.RED}S{Fore.LIGHTYELLOW_EX}:{Fore.WHITE} ", attrs=['bold'])))
    except ValueError:
      print("PAGES must be a number!")
      clear()
    num = pages*10
    response = get_source(f"https://www.google.co.uk/search?q={query}&num={num}")

    return response
  
def parse_results(response):
    
    css_identifier_result = ".tF2Cxc"
    css_identifier_title = "h3"
    css_identifier_link = ".yuRUbf a"
    css_identifier_text = ".VwiC3b"
    
    results = response.html.find(css_identifier_result)

    output = []
    
    for result in results:

        item = {
            'Title': result.find(css_identifier_title, first=True).text,
            'Link': result.find(css_identifier_link, first=True).attrs['href'],
            'Text': result.find(css_identifier_text, first=True).text
        }
        
        output.append(item)
      
    for out in output:
      print(colored(f"\n{Fore.RED}{out.get('Title')}", attrs=['bold']))
      print(colored(f"{Fore.BLUE}{out.get('Link')}", attrs=['underline']))
      print(colored(f"{Fore.GREEN}{out.get('Text')}"))
      time.sleep(0.10)
  
def google_search(query):
    response = get_results(query)
    return parse_results(response)

google_search(query)

Result for query = “test” and pages = 2:



2 Likes

Looks great @RetroWolf well done and thank you for sharing!

No problem. But I couldn’t do it without @pro0grammer

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.