Question:
I cant figure out a solution to these two issues. I’m trying to pull all of the application links for all jobs for a specific search on google jobs. It works, but it only pulls the first link for each job, and it always misses the first job (the first li) in the unordered list.
Current behavior:
Skips the first li (job) in each ul (I believe 10 items in total), getting the remaining 9 jobs. It then prints the first href link to apply for each job.
Desired behavior
Get ALL li’s from ALL ul’s, then print ALL hrefs where one can apply for each job (eg. If you can apply for a job on indeed, linkedin, etc. print all of those hrefs)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url)
# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()
# Initialize an empty list to store the list items
list_items = []
# Find the container element that holds all the unordered lists
container = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]')
# Find the first list
first_list = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul')
# Find all the list items within the unordered list
items = first_list.find_elements(By.TAG_NAME, 'li')
# Append the first UL list items to the list
for item in items:
list_items.append(item)
list_index = 1
# Scroll to the bottom of the page and wait for new list items to load
action = ActionChains(driver)
while len(list_items) < 100:
# Scroll to the bottom of the page
action.move_to_element(list_items[-1]).perform()
# Wait for new list items to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'ul > li:last-child')))
# Find all the list items within the new unordered list
new_items = driver.find_elements(By.CSS_SELECTOR, 'ul > li')
# Append any new list items to the list
for item in new_items:
if item not in list_items:
list_items.append(item)
# Print the href attributes of all the links in the list
for item in list_items:
html_source = item.get_attribute("innerHTML")
soup = BeautifulSoup(html_source, "html.parser")
links = soup.find_all("a", href=True)
for link in links:
href = link.get("href")
if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
print(href)
Trying that now. I’ll let you know. I threw this code into ChatGPT just to see what it would say. It said that there’s a possibility that the li isn’t included in the ul and instead is in another xpath but I don’t understand how that would make any sense because it’s within the same ul.
Lol, I’ll try. This is really weird. I just checked the list, it looks like the first UL is missing the first LI, the subsequent ul contain all line items. Maybe it’s a loading issue? Can’t see why it would be though… Seems bizarre.
Any idea of why it would only be getting the first href for each job?
Oh jeeze… It was used in an old version and I forgot to delete it facepalm. So I guess the issue is in here:
# Print the href attributes of all the links in the list
for item in list_items:
html_source = item.get_attribute("innerHTML")
soup = BeautifulSoup(html_source, "html.parser")
links = soup.find_all("a", href=True)
for link in links:
href = link.get("href")
if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
print(href)
I’m wondering if I shouldn’t explicitly find the xpath for the first li, get the href, and insert() to the list before looping through and printing… I mean, it’s the lazy way to fix it I guess…
Lol… It didn’t work. It’s like Selenium cant find that first job. It throws an error when I reference the xpath and try to force it to add it to the list before printing the hrefs.
This doesn’t skip the first job (woohoo!) and it gets all of the hrefs!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)
# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url)
# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()
# Find all the list items
list_items = driver.find_elements(By.CSS_SELECTOR,'ul > li')
# Click on each item and extract information
for item in list_items:
item.click()
raw_html = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[1]/div/div/div[3]/div[2]/div/div[1]/div/div/g-scrolling-carousel/div[1]/div')
html_source = raw_html.get_attribute("innerHTML")
soup = BeautifulSoup(html_source, "html.parser")
links = soup.find_all("a", href=True)
for link in links:
href = link.get("href")
if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
print(href)