Looping over hrefs, Missing the first job. Help anyone?

itsjustmyemail · April 25, 2023, 4:13pm

Question:
I cant figure out a solution to these two issues. I’m trying to pull all of the application links for all jobs for a specific search on google jobs. It works, but it only pulls the first link for each job, and it always misses the first job (the first li) in the unordered list.

Current behavior:
Skips the first li (job) in each ul (I believe 10 items in total), getting the remaining 9 jobs. It then prints the first href link to apply for each job.

Desired behavior
Get ALL li’s from ALL ul’s, then print ALL hrefs where one can apply for each job (eg. If you can apply for a job on indeed, linkedin, etc. print all of those hrefs)

Repl link:
https://replit.com/@itsjustmyemail/Auto-Job-Applier-2?v=1

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url) 

# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()

# Initialize an empty list to store the list items
list_items = []

# Find the container element that holds all the unordered lists
container = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]')

# Find the first list
first_list = driver.find_element(By.XPATH, '//*[@id="immersive_desktop_root"]/div/div[3]/div[1]/div[1]/div[3]/ul')

# Find all the list items within the unordered list
items = first_list.find_elements(By.TAG_NAME, 'li')

# Append the first UL list items to the list
for item in items:
    list_items.append(item)

list_index = 1

# Scroll to the bottom of the page and wait for new list items to load
action = ActionChains(driver)
while len(list_items) < 100:
    # Scroll to the bottom of the page
    action.move_to_element(list_items[-1]).perform()
    # Wait for new list items to load
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, 'ul > li:last-child')))
    # Find all the list items within the new unordered list
    new_items = driver.find_elements(By.CSS_SELECTOR, 'ul > li')
    # Append any new list items to the list
    for item in new_items:
        if item not in list_items:
            list_items.append(item)

# Print the href attributes of all the links in the list
for item in list_items:
    html_source = item.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

Firepup650 · April 25, 2023, 4:15pm

This is probably it, try setting the index to 0 on each loop.

itsjustmyemail · April 25, 2023, 4:19pm

That was my thought too, but when I change the index to 0, the console doesn’t have any printout.

Firepup650 · April 25, 2023, 4:20pm

Huh. Strange. What happens when it’s 2? Does it skip the first 2?

itsjustmyemail · April 25, 2023, 4:21pm

Trying that now. I’ll let you know. I threw this code into ChatGPT just to see what it would say. It said that there’s a possibility that the li isn’t included in the ul and instead is in another xpath but I don’t understand how that would make any sense because it’s within the same ul.

Firepup650 · April 25, 2023, 4:22pm

ChatGPT has a lot of hallucinations about coding, lol.

itsjustmyemail · April 25, 2023, 4:23pm

Very true. I was like “da fuuuuuuu” also, I set it (the index) to 2 and weirdly, I get the same output as if it’s set to 1.

Firepup650 · April 25, 2023, 4:24pm

Huh. That’s really strange. What if you set it to something ridiculous like 10?

itsjustmyemail · April 25, 2023, 4:26pm

Lol, I’ll try. This is really weird. I just checked the list, it looks like the first UL is missing the first LI, the subsequent ul contain all line items. Maybe it’s a loading issue? Can’t see why it would be though… Seems bizarre.

Any idea of why it would only be getting the first href for each job?

Firepup650 · April 25, 2023, 4:27pm

Not sure, but also, where does your code actually use list_index? A Crtl+F Shows it’s definition is the only time it’s called.

itsjustmyemail · April 25, 2023, 4:29pm

Oh jeeze… It was used in an old version and I forgot to delete it facepalm. So I guess the issue is in here:

# Print the href attributes of all the links in the list
for item in list_items:
    html_source = item.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

itsjustmyemail · April 25, 2023, 4:31pm

I’m wondering if I shouldn’t explicitly find the xpath for the first li, get the href, and insert() to the list before looping through and printing… I mean, it’s the lazy way to fix it I guess…

Firepup650 · April 25, 2023, 4:32pm

I’m not certain, I kinda took a code glance and saw the index tbh. I’m not experienced with web scraping.

Firepup650 · April 25, 2023, 4:33pm

If it works, it works.

itsjustmyemail · April 25, 2023, 4:45pm

Lol… It didn’t work. It’s like Selenium cant find that first job. It throws an error when I reference the xpath and try to force it to add it to the list before printing the hrefs.

Firepup650 · April 25, 2023, 4:46pm

I don’t know what to tell you, sorry, web scraping is not an area where I usually spend my time.

itsjustmyemail · April 25, 2023, 5:23pm

I feel like I’ve tried everything to get that first job and still I can’t figure it out. Banging my head against a wall here lol.

itsjustmyemail · April 25, 2023, 6:53pm

This doesn’t skip the first job (woohoo!) and it gets all of the hrefs!

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

# Pointing to Google Search and navigating there
url = f'https://www.google.com/search?q=jobs+"director"+OR+"consultant"+OR+"analyst"+AND+"improvement"+OR+"change"+OR+"innovation"+OR+"power+platform"+OR+"implementation"+AND+Calgary'
driver.get(url) 

# Find and click the "Jobs" Area
location_button = driver.find_element(By.ID, 'fMGJ3e')
location_button.click()

# Find all the list items
list_items = driver.find_elements(By.CSS_SELECTOR,'ul > li')

# Click on each item and extract information
for item in list_items:
    item.click()
    raw_html = driver.find_element(By.XPATH, '/html/body/div[2]/div/div[2]/div[1]/div/div/div[3]/div[2]/div/div[1]/div/div/g-scrolling-carousel/div[1]/div')
    html_source = raw_html.get_attribute("innerHTML")
    soup = BeautifulSoup(html_source, "html.parser")
    links = soup.find_all("a", href=True)
    for link in links:
        href = link.get("href")
        if href and (href.startswith("http") or href.startswith("https")) and "job" in href:
            print(href)

Firepup650 · April 25, 2023, 7:26pm

Nice! Probably go ahead and mark that as the solution then.

system · May 2, 2023, 7:26pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.