find content / column with Xpath - KNIME Analytics Platform

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

从容的脆皮肠 · “百万英才汇南粤”——2025年广州市南沙区 ...· 2 月前 ·

闷骚的灌汤包 · 中国空中突击旅装备已到位 ...· 6 月前 ·

很拉风的柚子 · 最高人民法院知识产权案件年度报告（2020） ...· 11 月前 ·

飞翔的小熊猫 · 聊聊丹麦式圣诞节：橱窗经常摆的玩偶不是圣诞老 ...· 1 年前 ·

仗义的沙发 · C++ ...· 1 年前 ·

can someone help with Xpath?

Here my simplified workflow . . .

I want to retrieve the column of symbols (tickers). I suppose something has changed and I tried a lot but the ticker column keeps empty.

Of course I make a stupid mistake but I don’t see it

Many thnx in advance

Xpath_1.knwf (10.7 KB)

Hey there,

I think there are two issues at play - one minor and probably an oversight and one bigger one…

The small one: As far as I can tell from your workflow you are right now getting google.com homepage as response - you may want to select your URL column rather than having the default google.com address scraped

That said, it looks like yahoo does not like to be pinged this way - the node responds with 503 error.

My gut feeling is that you may have to opt for using the KNIME Web Interaction Extension to have KNIME open the website in a browser and then grab the data. There was a just KNIME it challenge to extract economic use from yahoo finance using exactly this extension.

Here’s the solution thread with plenty of options to pick from to see how it can work:

https://forum.knime.com/t/solutions-to-just-knime-it-challenge-9-season-3/81017/30

Here is my solution:

Hi MartinDDDD,

many thnx for your very quick respons

regarding your 1st remark your fully right . . . sorry I abuse your time . . . I was inaccurate constructing my basic example flow . . . mea culpa

I tried to run your suggestion and filled 2nd node Navigator Labs with this URL: Yahooist Teil der Yahoo Markenfamilie

That produces the following error . . .
ERROR Navigator (Labs) 5:2 Execute failed: HTTPConnectionPool(host=‘localhost’, port=30459): Max retries exceeded with url: /session/a72344c9fef2b061b15aa84e3c85a12f/url (Caused by NewConnectionError(‘<urllib3.connection.HTTPConnection object at 0x0000023659746AA0>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it’))

Strange because the mentioned URL is an existing page. An also tried “Refresh”.
Any idea why it produces this error?

When I run your unchanged example (with URL Yahooist Teil der Yahoo Markenfamilie ) the results are missing values.

I hope you (or someone else) can help

THNX in advance

Several years ago I developed a Yahoo Finance URL to use in a GET request. It no longer works due to a variety of changes Yahoo has made. It still may be possible, but frankly after reading a variety of posts on the subject its beyond me. I was able to develop a Python script employing the yfinance package which seems to work fine. I’ve wrapped the workflow in a component so it has interactive inputs. You’ll need a Python environment with the packages highlighted below. You can add a Table write… I’ve modified the original to include conda propagation for the required Python environment as well as writing an output. You’ll need to change the location of the Excel Writer in the String Manipulation node.

KNIME Community Hub

thnx . . . I succeeded to run your script.

But I will have some challenges:
(1) modify it for these pages: Yahooist Teil der Yahoo Markenfamilie
and 2025-01-01 / 2025-01-02 etc etc

(2) run it for 500+ funds

In general . . . I don’t understand why my basic flow doesn’t work anymore. With WebPage Retriever and Xpath life was so simple

What changed at Finance-Yahoo?

Hi @rfeigel rfeigel

I build a financial model and as a part of it I check if there are any stock splits for the funds/companies I analyse.

As input I use e.g. this page: Yahooist Teil der Yahoo Markenfamilie

I carry out this check every month. So download the finance.yahoo split pages for every day of a specific month, e.g.:
…calendar/splits/?day=2025-01-01
…calendar/splits/?day=2025-01-02
…calendar/splits/?day=2025-01-31

Later on in the KNIME flow I check for all the funds in my portfolio if there was a stock split. But that’s not relevant my actual problem

Until recently the flow worked fine with Webpage Retriever and Xpath.
Attatched the relevant part of my workflow
Xpath_1_extended_example.knwf (102.6 KB)

I suppose something changed at Finance.Yahoo but I don’t know what.

So my problem . . . how to download the tables for every day. See screenshot.

Processing: Xpath_1.knwf…

@rfeigel . . .

thnx for your latest contribution . . . VERY VERY GREAT . . . it gave me lot of insights and I learned a lot of it

Your approach is first to collect all the split data of the portfolio. Something you could not know is that my model is only interested in te most recent splits.

Further it takes a lot of running time to collect all the split data of my portfolio (about 800 funds and that takes 45+ minutes runtime). A well known problem that Python loops are very slow

Accidently I discovered the HTTP Retriever node and that it reads / collects all the data of pages such as Yahooist Teil der Yahoo Markenfamilie

With your contribution in mind I started tweaking . . . and developed the attached flow.

My approach is to collect all the split data over the e.g. last 60 days. Of course with a lot of reduncy (funds not in the portfolio). Later on in the flow (not attached because it is ordinary KNIME) I join all the split funds with my portfolio (joined by yahoo ticker).

The advantages of this approach are:

runtime about 15 seconds

less redundant split data

I only could came to this thanks your feedback . . . 1000x thnx

REMAINING QUESTION . . .
Has someone an idea why the HTTP Retriever node reads all the data on a certain page (abovementioned page_. And Webpage Retriever does not ???

. . . maybe this is a question to someone close to the KNIME development team

Enclosed my flow . . .
Xpath_3_DEFDEF.knwf (89.0 KB)