Looping over hrefs, Missing the first job. Help anyone?

Question:
I cant figure out a solution to these two issues. I’m trying to pull all of the application links for all jobs for a specific search on google jobs. It works, but it only pulls the first link for each job, and it always misses the first job (the first li) in the unordered list.

Current behavior:
Skips the first li (job) in each ul (I believe 10 items in total), getting the remaining 9 jobs. It then prints the first href link to apply for each job.

Desired behavior
Get ALL li’s from ALL ul’s, then print ALL hrefs where one can apply for each job (eg. If you can apply for a job on indeed, linkedin, etc. print all of those hrefs)

Repl link:
https://replit.com/@itsjustmyemail/Auto-Job-Applier-2?v=1

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url) 

# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()

# Initialize an empty list to store the list items
list_items = []

# Find the container element that holds all the unordered lists
container = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]')

# Find the first list
first_list = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul')

# Find all the list items within the unordered list
items = first_list.find_elements(By.TAG_NAME, 'li')

# Append the first UL list items to the list
for item in items:
    list_items.append(item)

list_index = 1

# Scroll to the bottom of the page and wait for new list items to load
action = ActionChains(driver)
while len(list_items) < 100:
    # Scroll to the bottom of the page
    action.move_to_element(list_items[-1]).perform()
    # Wait for new list items to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'ul > li:last-child')))
    # Find all the list items within the new unordered list
    new_items = driver.find_elements(By.CSS_SELECTOR, 'ul > li')
    # Append any new list items to the list
    for item in new_items:
        if item not in list_items:
            list_items.append(item)

# Print the href attributes of all the links in the list
for item in list_items:
    html_source = item.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

This is probably it, try setting the index to 0 on each loop.

1 Like

That was my thought too, but when I change the index to 0, the console doesn’t have any printout.

Huh. Strange. What happens when it’s 2? Does it skip the first 2?

Trying that now. I’ll let you know. I threw this code into ChatGPT just to see what it would say. It said that there’s a possibility that the li isn’t included in the ul and instead is in another xpath but I don’t understand how that would make any sense because it’s within the same ul.

ChatGPT has a lot of hallucinations about coding, lol.

2 Likes

Very true. I was like “da fuuuuuuu” also, I set it (the index) to 2 and weirdly, I get the same output as if it’s set to 1.

Huh. That’s really strange. What if you set it to something ridiculous like 10?

Lol, I’ll try. This is really weird. I just checked the list, it looks like the first UL is missing the first LI, the subsequent ul contain all line items. Maybe it’s a loading issue? Can’t see why it would be though… Seems bizarre.

Any idea of why it would only be getting the first href for each job?

Not sure, but also, where does your code actually use list_index? A Crtl+F Shows it’s definition is the only time it’s called.

Oh jeeze… It was used in an old version and I forgot to delete it facepalm. So I guess the issue is in here:

# Print the href attributes of all the links in the list
for item in list_items:
    html_source = item.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

I’m wondering if I shouldn’t explicitly find the xpath for the first li, get the href, and insert() to the list before looping through and printing… I mean, it’s the lazy way to fix it I guess…

I’m not certain, I kinda took a code glance and saw the index tbh. I’m not experienced with web scraping.

If it works, it works.

Lol… It didn’t work. It’s like Selenium cant find that first job. It throws an error when I reference the xpath and try to force it to add it to the list before printing the hrefs.

I don’t know what to tell you, sorry, web scraping is not an area where I usually spend my time.

1 Like

I feel like I’ve tried everything to get that first job and still I can’t figure it out. Banging my head against a wall here lol.

This doesn’t skip the first job (woohoo!) and it gets all of the hrefs!

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url) 

# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()

# Find all the list items
list_items = driver.find_elements(By.CSS_SELECTOR,'ul > li')

# Click on each item and extract information
for item in list_items:
    item.click()
    raw_html = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[1]/div/div/div[3]/div[2]/div/div[1]/div/div/g-scrolling-carousel/div[1]/div')
    html_source = raw_html.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

1 Like

Nice! Probably go ahead and mark that as the solution then.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.